Expanded

Handling Batch Data with Apache Spark on Databricks

This course will teach you how to transform and aggregate batch data using Apache Spark on the Azure Databricks platform using selection, filter, and aggregation queries, built-in and user-defined functions, and perform windowing and join operations on batch data.
Course info
Level
Beginner
Updated
Oct 25, 2021
Duration
2h 21m
Table of contents
Description
Course info
Level
Beginner
Updated
Oct 25, 2021
Duration
2h 21m
Your 10-day individual free trial includes:

Expanded library

This course and over 7,000+ additional courses from our full course library.

Hands-on library

Practice and apply knowledge faster in real-world scenarios with projects and interactive courses.
*Available on Premium only
Description

Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. Azure Databricks allows to work with a variety of batch sources and makes it seamless to analyze, visualize, and process data on the Azure Cloud Platform.

In this course, Handling Batch Data with Apache Spark on Databricks, you will learn how to perform transformations and aggregations on batch data with selection, filtering, grouping, and ordering queries that use the DataFrame API. You will understand the difference between narrow transformations and wide transformations in Spark which will help you figure out why certain transformations are more efficient than others. You will also see how you can execute these same transformations by executing SQL queries on your data.

Next, you will learn how you can implement your own custom user-defined functions to process your data. You will write code on Azure Databricks notebooks to define and register your UDFs and use them to transform your data. You will also understand how to define and use different flavors of vectorized UDFs for data processing and learn how vectorized UDFs are often more efficient than regular UDFs. Along the way, you will also see how you can read from Azure Cosmos DB as a source for your batch data.

Finally, you will see how you can repartition your data in memory to improve processing performance, you will use window functions to compute statistics on your data and you will combine data frames using union and join operations.

When you’re finished with this course you will have the skills and ability to perform advanced transformations and aggregations on batch data, including defining and using user-defined functions for processing.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Machine Learning for Financial Services
Beginner
1h 50m
Nov 24, 2021
Machine Learning for Healthcare
Beginner
1h 48m
Nov 24, 2021
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi my name is Janani Ravi, and welcome to this course on Handling Batch Data with Apache Spark on Databricks. A little about myself, I have a Master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. I currently work on my own startup, Loonycorn, a studio for high‑quality video content. Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. Databricks makes it seamless for you to analyze, visualize, and process data on the Azure cloud platform. In this course, you will learn how to perform transformations and aggregations on batch data with selection, filtering, grouping, and ordering query that use the DataFrame API. You will understand the difference between narrow and wide transformations in Spark, which will help you figure out why certain transformations are more efficient than others. You will also see how you can execute the same transformations by executing SQL queries on your data. Next, you will learn how you can implement your own custom user‑defined functions to process your data. You will write code on Azure Databricks notebooks to define and register your UDFs and use them to transform your data. You'll also understand how to define and use different flavors of vectorized UDFs for data processing and learn how vectorized UDFs are often more efficient than regular UDFs. Along the way, you will also see how you can read from Azure Cosmos DB as a source for your batch data. Finally, you'll see how you can repartition your data in memory to improve processing performance, and you will use window functions to compute statistics on your data, and you'll combine DataFrames using union and join operations. When you're finished with this course, you will have the skills and ability to perform advanced transformations and aggregations on batch data, including defining and using user‑defined functions.