Learning Path

Libraries: This path is only available in the libraries listed. To access this path, purchase a license for the corresponding library.

Data

Big Data with PySpark

5 Courses

4 Labs

10 Hours

Skill IQ

The Big Data with PySpark learning path equips learners with the skills to process, transform, and analyze large datasets efficiently. This path covers data ingestion, ETL workflows, query optimization, and distributed machine learning using PySpark DataFrames, SQL, MLlib, and Structured Streaming. By mastering performance tuning and real-time processing, learners can build scalable data pipelines for analytics, machine learning, and big data applications.

Get started

Content in this path

Big Data with PySpark

Watch the following courses to start your big data with PySpark learning journey.

Course

Big Data Analytics with PySpark

by Warner Chaves
1h 10m
Mar 20, 2025

Lab

Analyze Large Datasets with PySpark

by Warner Chaves
2h 3m
Nov 24, 2025

Course

Build ETL Pipelines with PySpark

by Dayo Bamikole
1h 7m
May 20, 2025

Lab

Build an ETL Pipeline with PySpark

by Dayo Bamikole
45m
Apr 08, 2025

Course

Scalable Machine Learning with PySpark MLlib

by Warner Chaves
1h 9m
Apr 25, 2025

Lab

Train a Machine Learning Model with PySpark MLlib

by Warner Chaves
1h 14m
Nov 25, 2025

Course

Real-time Stream Processing with PySpark

by Ivan Gavryliuk
1h 4m
Apr 17, 2025

Course

Build Recommendation Systems with PySpark

by Bismark Adomako
49m
Apr 23, 2025

Lab

Build a Recommendation System with PySpark

by Bismark Adomako
57m
Nov 24, 2025

Try this learning path for free

Access this learning path and other top-rated tech content with a free trial.

Free individual trial

Free team trial

What You'll Learn

How to perform big data analytics with PySpark
How to build ETL pipelines with PySpark
How to perform scalable machine learning with PySpark
How to perform real-time stream processing with PySpark
How to build recommendation systems with Pyspark

Prerequisites

Learners interested in this path should have a solid understanding of Python programming, SQL, and basic data manipulation using pandas. Familiarity with distributed computing concepts and cloud or big data tools (e.g., Hadoop, Spark, or databases like PostgreSQL) is helpful but not required.

Apache Spark
PySpark
SQL
MLlib
ETL
Big Data

Not sure where to start?

With over 500 assessments to choose from, you can see where your skills stand and receive adaptive learning recommendations to fill knowledge gaps in as little as 10 minutes.

Learn more

Learn with the best

Warner Chaves

Warner is a SQL Server Certified Master, MVP, and Solutions Architect at Pythian, a global Canada-based company specializing in data and infrastructure services. A brief stint in .NET programming led to his early DBA formation, working for enterprise customers in Hewlett-Packard's ITO organization. From HP he transitioned to his current position at Pythian, managing multiple clients of different sizes and industries. He leads a highly-talented team of data professionals that keep things running ...

View profile

Dayo Bamikole

Dayo is a seasoned Data and AI Solutions Architect with deep expertise in Cloud Data Solutions, Artificial Intelligence, and Web Development. Over the years, he has traveled extensively, presenting in more than 35 U.S. states and across Europe, delivering engaging, hands-on workshops for audiences ranging from Database Administrators to Software Developers. He is certified as an Azure Data Engineer, Azure Data Scientist, Azure AI Engineer and AWS Solutions Architect, Dayo blends technical depth ...

View profile

Ivan Gavryliuk

Ivan is a technical architect and independent cloud consultant based in London, UK with over 15 years of experience in designing and developing applications on a wide range of stacks, primarily but not limited to Microsoft. Although, his preferred specialization is backend architecture and cloud computing, Ivan has worked professionally with just about every major Microsoft technology. Today, he's excited about public clouds especially Microsoft Azure, and spreading the knowledge around scalable...

View profile

Bismark Adomako

Bismark believes that education is no longer a one-time investment, but instead a lifelong pursuit and that great authors and mentors can play an invaluable role in finding and developing a fulfilling career. He holds a Bachelor of Science degree in Computer Engineering specializing in Software Engineering, Artificial Intelligence and Distributed computing. He taught courses in robotics and computer vision as a student leader whiles in school, and interned and consulted for the British Counci...

View profile

Big Data with PySpark

Content in this path

Big Data with PySpark

Big Data Analytics with PySpark

Analyze Large Datasets with PySpark

Build ETL Pipelines with PySpark

Build an ETL Pipeline with PySpark

Scalable Machine Learning with PySpark MLlib

Train a Machine Learning Model with PySpark MLlib

Real-time Stream Processing with PySpark

Build Recommendation Systems with PySpark

Build a Recommendation System with PySpark

Try this learning path for free

What You'll Learn

Learn with the best

Get started with Pluralsight