Libraries: If you want this lab, consider one of these libraries.
Data

Advanced Data Manipulation with PySpark

This lab introduces you to the fundamentals of aggregations, joins, and window functions in PySpark, key techniques for efficiently transforming and analyzing large-scale datasets. Throughout this lab, you will gain hands-on experience in grouping employees by department, computing salary statistics, and performing inner, left, and full outer joins to maintain essential relationships across datasets. You will also leverage window functions to analyze salary trends, rank employees, and compute cumulative totals within departments. By applying these techniques to real-world data, you will develop a deeper understanding of how PySpark processes structured data efficiently in a distributed computing environment. This lab is designed for data engineers, analysts, and developers who want to refine their skills in data manipulation and transformation using PySpark. By the end of this lab, you will have the expertise to implement aggregations, optimize joins, and apply window functions to build scalable, efficient data pipelines that drive analytics and decision-making in PySpark.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Dec 30, 2026

Duration

1h 3m

Challenge

Introduction to Advanced Data Manipulation with PySpark
Introduction to Advanced Data Manipulation with PySpark

In this lab, you’ll explore advanced data manipulation techniques in PySpark, focusing on aggregations, joins, ranking, and cumulative calculations. PySpark provides a powerful DataFrame API that enables you to process large datasets efficiently, making it a key tool for data engineers and analysts.

By working with employee and department datasets, you will learn how to aggregate data, merge datasets, rank employees based on salary, and compute running totals using window functions.

🟦 Note:
- PySpark’s aggregations help derive key insights by computing counts, averages, and other statistics on large datasets.
- Joins allow seamless merging of multiple datasets, ensuring data consistency across related tables.
- Window functions enable advanced analytics by ranking records and calculating cumulative values without collapsing datasets.
- Mastering these techniques is essential for real-world data processing, especially in big data environments.
Why It Matters

Understanding data manipulation in PySpark is crucial for handling large-scale datasets in distributed computing. By mastering aggregations, joins, and window functions, you will be able to:
- Extract meaningful insights by summarizing large datasets with aggregations.
- Merge datasets effectively using different types of joins to analyze relational data.
- Rank employees based on salary while handling ties efficiently.
- Compute cumulative totals to track trends and financial calculations.
Key Concepts
- Aggregations in PySpark
  
  Purpose: Compute summary statistics like count, average, min, and max.
  
  Implementation: Uses functions such as count(), avg(), min(), and max().
  
  Use Case: Finding the average salary per department and total number of employees.
- Joining Datasets with PySpark
  
  Purpose: Merge multiple datasets efficiently using different join types.
  
  Implementation: join() function with inner, left, and full outer joins.
  
  Use Case: Combining employees and departments to analyze workforce distribution.
- Ranking Employees with Window Functions
  
  Purpose: Assign rankings based on salary while handling ties correctly.
  
  Implementation: Uses row_number() and rank() with Window.partitionBy().
  
  Use Case: Identifying the top 3 highest-paid employees per department.
- Computing Cumulative Totals
  
  Purpose: Track progressive salary sums within departments.
  
  Implementation: Uses sum().over(Window.partitionBy()) with ordered rows.
  
  Use Case: Calculating cumulative salaries for financial forecasting.
🟩 Important:
Mastering data aggregation, joins, and window functions allows you to efficiently process and analyze large datasets in PySpark, making it a critical skill for data professionals.

Learning Objectives
- Understand how to aggregate data using PySpark’s built-in functions.
- Learn how to join datasets using different join types to merge relational data.
- Apply window functions to rank employees and compute cumulative totals.
- Work with real-world datasets, using PySpark for scalable data processing.
Now that you have an understanding of advanced data manipulation in PySpark, let’s move on to applying these techniques in real-world scenarios. Click on the Next Step arrow to begin! 🚀
Challenge

Grouping and Aggregations Data
In this step, you will work with two datasets: employees and departments, which contain structured employee and department information. You will begin by loading these datasets into PySpark, verifying their structure, and performing fundamental aggregations to analyze workforce distribution and salary trends.

By the end of this step, you will have a structured view of employees, departments, and their relationships, enabling deeper workforce analytics.
🟦 Why It Matters:

Understanding employee distribution across departments helps in workforce planning and resource allocation.

Aggregating salary information allows for insights into compensation trends and potential pay gaps.

Working with relational data (employees linked to departments) is essential for structured business analysis.
Using PySpark, you will:
- Load and display the datasets to verify their structure and ensure correctness (Task 1.1).
- Group employees by department and count the number of employees per department, handling cases where employees do not belong to any department (Task 1.2).
- Calculate the average salary per department, ensuring departments with no employees are accounted for (Task 1.3).
- Compute salary statistics, including minimum, maximum, and average salaries across all employees (Task 1.4).
Dataset Schema

The datasets include employee details and department mappings, but some employees may not be assigned to a department, and some departments may not have any employees.

Employees Table Schema

| Column Name | Data Type | Description | |-----------------|----------|-------------| | employee_id | Integer | Unique identifier for each employee | | name | String | Full name of the employee | | age | Integer | Employee’s age | | salary | Float | Employee’s salary | | department_id | Integer | Linking to the departments table (Nullable) | | experience | Integer | Years of experience in the company |

Departments Table Schema

| Column Name | Data Type | Description | |-----------------|----------|-------------| | department_id | Integer | Unique identifier for each department | | department_name | String | Name of the department |

By the end of this step, you will have a structured and aggregated dataset ready for further workforce analysis.
Challenge

Performing Joins Between Datasets
In this step, you will combine the employees and departments datasets using various types of joins in PySpark. This will allow you to analyze the relationships between employees and departments, including cases where employees do not belong to a department and departments have no assigned employees.

Additionally, you will learn how to handle missing values resulting from these joins using PySpark’s fillna() function, ensuring a clean and structured dataset for analysis.

By the end of this step, you will have a fully joined and cleaned dataset, enabling better insights into workforce distribution and department structures.
🟦 Why It Matters:

Joining relational datasets is essential for analyzing employee-department relationships.

Different types of joins allow you to control how records are matched and retained.

Handling missing data ensures your dataset is clean and ready for further processing.
Using PySpark, you will:
- Perform an inner join to keep only employees with valid department assignments (Task 2.1).
- Perform a left join to ensure that all employees are retained, even if they do not belong to a department (Task 2.2).
- Perform a full outer join to include all employees and all departments, even when no direct relationship exists (Task 2.3).
- Handle missing values in the dataset by replacing NULL values with appropriate defaults (Task 2.4).
Understanding Different Types of Joins in PySpark

Joins are a crucial part of data processing, allowing us to combine information from multiple datasets. PySpark supports multiple types of joins that determine how records are matched and retained.

💡 Inner Join
An inner join returns only the records where a match is found in both datasets.
If an employee does not have a valid department, or if a department has no employees, those records will be excluded from the result.

💡 Left Join
A left join ensures that all employees are retained, even if they do not belong to a department.
If an employee does not have a matching department, NULL values will appear in department-related columns.
Departments without employees are excluded from the result.

💡 Right Join
A right join is the opposite of a left join. It ensures that all departments are retained, even if they do not have any employees.
Employees without a valid department are excluded from the result.

💡 Full Outer Join
A full outer join retains all employees and all departments, even when there is no direct relationship between them.
If an employee does not belong to a department, their department_name will be NULL.
If a department has no employees, employee-related fields will be NULL.

By understanding and applying these different types of joins, you will be able to merge datasets efficiently, ensuring that your data maintains the correct relationships while preserving important records.
Challenge

Applying Window Functions
In this step, you will apply advanced ranking and cumulative aggregation techniques using PySpark’s window functions. These operations allow you to rank employees by salary, identify the top performers in each department, and compute running totals of salaries within each department.

By the end of this step, you will have a clear understanding of how to use window functions for ranking and cumulative calculations, making it easier to analyze hierarchical and sequential data.
🟦 Why It Matters:

Ranking employees helps in identifying top performers within an organization.

Partitioning data by department allows for group-wise analysis, making it possible to find top earners in each department.

Cumulative aggregation is widely used in financial analytics, forecasting, and trend analysis.
Using PySpark, you will:
- Rank employees based on salary using row_number() to assign a unique ranking to each employee (Task 3.1).
- Find the top 3 highest-paid employees per department using rank() while handling ties (Task 3.2).
- Compute cumulative salary totals within each department using sum().over(Window.partitionBy()) (Task 3.3).
Understanding Window Functions in PySpark

Window functions in PySpark allow you to perform calculations across a subset of rows that are related to the current row. Unlike aggregate functions that collapse data into a single value, window functions retain all rows while computing values based on a defined window (partition).

💡 Ranking Employees Using row_number()
The row_number() function assigns a unique rank to each row without duplicates.
If two employees have the same salary, they will receive different ranks (no ties).

💡 Identifying Top Earners Using rank()
The rank() function assigns the same rank to employees with identical salaries.
If two employees have the same salary, they will both receive the same rank, and the next rank will be skipped.

💡 Computing Cumulative Totals Using sum().over()
The cumulative sum function computes running totals of salary within each department.
This technique is useful in financial reporting, revenue tracking, and progressive salary calculations.

By mastering these window functions, you will be able to rank, filter, and analyze hierarchical data efficiently, allowing for deeper insights into employee performance and salary trends.

🎉 Congratulations on Completing the Lab!

You have successfully completed the lab on Advanced Data Manipulation with PySpark. Throughout this lab, you explored essential data manipulation techniques, including aggregation, joins, ranking, and cumulative calculations using PySpark’s DataFrame API and window functions.

In this lab, you learned:
- Performing data aggregations using functions like count(), avg(), min(), and max() to derive insights from employee data.
- Merging datasets efficiently with inner, left, and full outer joins, ensuring meaningful relationships between employees and departments.
- Handling missing values with fillna(), maintaining a clean and structured dataset after performing joins.
- Ranking employees effectively by applying row_number(), ensuring a unique ranking based on salary.
- Identifying the top 3 highest-paid employees per department using rank(), while correctly handling ties.
- Computing cumulative salary totals with sum().over(Window.partitionBy()), allowing for progressive salary analysis within each department.
Key Takeaways
- Aggregating Data for Insights
  
  PySpark’s built-in functions allow for quick and efficient aggregations, helping to analyze trends and compute key statistics.
- Joining Data for Comprehensive Analysis
  
  Inner, left, and outer joins merge relational datasets, providing meaningful insights into employee-department structures.
- Using Window Functions for Advanced Analytics
  
  row_number() assigns unique rankings, while rank() correctly handles ties when sorting employees by salary.
  
  sum().over(Window.partitionBy()) computes running totals, useful for financial forecasting and performance tracking.
Thank you for completing the lab! 🚀

About the author

Pinal Dave

Pinal Dave is a Pluralsight Developer Evangelist.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Advanced Data Manipulation with PySpark

Lab Info

Table of Contents

Introduction to Advanced Data Manipulation with PySpark

Introduction to Advanced Data Manipulation with PySpark

Why It Matters

Key Concepts

Learning Objectives

Grouping and Aggregations Data

Using PySpark, you will:

Dataset Schema

Employees Table Schema

Departments Table Schema

Performing Joins Between Datasets

Using PySpark, you will:

Understanding Different Types of Joins in PySpark

Applying Window Functions

Using PySpark, you will:

Understanding Window Functions in PySpark

🎉 Congratulations on Completing the Lab!

In this lab, you learned:

Key Takeaways

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight