Course

Skills Expanded

Handling Batch Data with Apache Spark on Databricks

by Janani Ravi

This course will teach you how to transform and aggregate batch data using Apache Spark on the Azure Databricks platform using selection, filter, and aggregation queries, built-in and user-defined functions, and perform windowing and join operations on batch data.

Preview this course

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$45.00

per month after 10 day trial

Your 10 day Premium free trial includes

Expanded library

This course and over 7,000+ additional courses from our full course library.

Hands-on library

Practice and apply knowledge faster in real-world scenarios with projects and interactive courses.

*Available on Premium only

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Level

Beginner

Updated

Dec 1, 2021

Duration

2h 22m

What you'll learn

Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. Azure Databricks allows to work with a variety of batch sources and makes it seamless to analyze, visualize, and process data on the Azure Cloud Platform.

In this course, Handling Batch Data with Apache Spark on Databricks, you will learn how to perform transformations and aggregations on batch data with selection, filtering, grouping, and ordering queries that use the DataFrame API. You will understand the difference between narrow transformations and wide transformations in Spark which will help you figure out why certain transformations are more efficient than others. You will also see how you can execute these same transformations by executing SQL queries on your data.

Next, you will learn how you can implement your own custom user-defined functions to process your data. You will write code on Azure Databricks notebooks to define and register your UDFs and use them to transform your data. You will also understand how to define and use different flavors of vectorized UDFs for data processing and learn how vectorized UDFs are often more efficient than regular UDFs. Along the way, you will also see how you can read from Azure Cosmos DB as a source for your batch data.

Finally, you will see how you can repartition your data in memory to improve processing performance, you will use window functions to compute statistics on your data and you will combine data frames using union and join operations.

When you’re finished with this course you will have the skills and ability to perform advanced transformations and aggregations on batch data, including defining and using user-defined functions for processing.

Course Overview

2mins

Course Overview 2m

Transforming Data Using DataFrames

40mins

Prerequisites and Course Outline 2m
Apache Spark on Databricks 3m
RDDs and Data Frames 6m
Narrow and Wide Transformations 5m
Demo: Configuring Workspace and Cluster 5m
Demo: Operations with Shuffled Writes to Disk 6m
Demo: Basic Transformations 6m
Demo: Aggregation Transformations 8m

Transforming Data Using Spark SQL

31mins

The Catalyst Optimizer 6m
Demo: Creating Global Table 3m
Demo: Running SQL Queries in Spark 6m
Demo: Replacing Table Contents and Partitioning Tables 6m
Demo: Running Interactive Queries on a Notebook on an All-purpose Cluster 4m
Demo: Running a Notebook as a Job on a Job Cluster 6m

Applying User-defined Functions to Transform Data

31mins

User-defined Functions (UDFs) 2m
Vectorized UDFs 3m
Demo: Loading Data into Azure Cosmos DB 4m
Demo: Reading Data from Cosmos DB in Spark 4m
Demo: User-defined Functions (UDFs) 5m
Demo: Vectorized UDFs - Series to Series 5m
Demo: Vectorized UDFs - Iterator of Series to Iterator of Series 2m
Demo: Vectorized UDFs - Iterator of Multiple Series to Iterator of Series 3m
Demo: Vectorized UDFs - Series to Scalar 3m

Processing Data Using Joins and Window Functions

37mins

Partitioning 2m
Demo: Working with Data Partitions 6m
Demo: Repartitioning and Coalescing Data 4m
Demo: Performing Union Operations 2m
Demo: Performing Join Operations 7m
Window Functions 3m
Row Frames and Range Frames 5m
Demo: Applying Window Functions 6m
Summary and Further Study 1m

About the author

Janani Ravi

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework. After spending years working in tech in the Bay Area, New York, and Singapore at companies such as Microsoft, Google, and Flipkart, Janani finally decided to combine her love for technology with her passion for teaching. She is now the co-founder of Loonycorn, a content studio focused on providing ... more

See more courses by Janani Ravi

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$45.00

per month after 10 day trial

Your 10 day Premium free trial includes

Expanded library

This course and over 7,000+ additional courses from our full course library.

Hands-on library

Practice and apply knowledge faster in real-world scenarios with projects and interactive courses.

*Available on Premium only

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Level

Beginner

Updated

Dec 1, 2021

Duration

2h 22m

Ready to upskill? Get started

Contact Sales

Handling Batch Data with Apache Spark on Databricks

What you'll learn

Table of contents

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Handling Batch Data with Apache Spark on Databricks

What you'll learn

Table of contents

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?