Expanded

Processing Streaming Data with Apache Spark on Databricks

This course will teach you how to use Spark abstractions for streaming data and perform transformations on streaming data using the Spark structured streaming APIs on Azure Databricks.
Course info
Level
Intermediate
Updated
Oct 25, 2021
Duration
2h 51s
Table of contents
Description
Course info
Level
Intermediate
Updated
Oct 25, 2021
Duration
2h 51s
Your 10-day individual free trial includes:

Expanded library

This course and over 7,000+ additional courses from our full course library.

Hands-on library

Practice and apply knowledge faster in real-world scenarios with projects and interactive courses.
*Available on Premium only
Description

Structured streaming in Apache Spark treats real-time data as a table that is being constantly appended. This leads to a stream processing model that uses the same APIs as a batch processing model - it is up to Spark to incrementalize our batch operations to work on the stream. The burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark.

In this course, Processing Streaming Data with Apache Spark on Databricks, you’ll learn to stream and process data using abstractions provided by Spark structured streaming. First, you’ll understand the difference between batch processing and stream processing and see the different models that can be used to process streaming data. You will also explore the structure and configurations of the Spark structured streaming APIs.

Next, you will learn how to read from a streaming source using Auto Loader on Azure Databricks. Auto Loader automates the process of reading streaming data from a file system, and takes care of the file management and tracking of processed files making it very easy to ingest data from external cloud storage sources. You will then perform transformations and aggregations on streaming data and write data out to storage using the append, complete, and update models.

Finally, you will learn how to use SQL-like abstractions on input streams. You will connect to an external cloud storage source, an Amazon S3 bucket, and read in your stream using Auto Loader. You will then run SQL queries to process your data. Along the way, you will make your stream processing resilient to failures using checkpointing and you will also implement your stream processing operation as a job on a Databricks Job Cluster.

When you’re finished with this course, you’ll have the skills and knowledge of streaming data in Spark needed to process and monitor streams and identify use-cases for transformations on streaming data.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Machine Learning for Financial Services
Beginner
1h 50m
Nov 24, 2021
Machine Learning for Healthcare
Beginner
1h 48m
Nov 24, 2021
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi. My name is Janani Ravi, and welcome to this course on Processing Streaming data with Apache Spark on Databricks. A little about myself. I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. I currently work on my own startup, Loonycorn, a studio for high‑quality video content. Structured Streaming in Apache Spark treats real‑time data as a table that is being constantly upended. This leads to a stream processing model that uses the same APIs as a batch processing model. It's up to Spark to incrementalize our batch operation to work on streams. The burden of stream processing shifts from the user to the system. In this course, you'll learn to stream and process data using abstractions provided by Spark Structured Streaming. First, you'll understand the difference between batch processing and stream processing and see the different models that can be used to process streaming data. You'll also explore the structure and configurations of the Spark structure streaming APIs. Next, you will learn how to read from a streaming source using Auto Loader on Azure Databricks. Auto Loader automates the process of reading streaming data from a file system, takes care of file management and tracking of processed files. You will then perform transformations and aggregations on streaming data and write data out to storage using the append, complete, and update models. Finally, you will learn how to use SQL‑like abstractions on input streams. You will connect to an external cloud storage source, an Amazon S3 bucket, and read in your stream using Auto Loader. You will then run SQL queries to process your data. Along the way, you'll make your stream processing resilient to failures using checkpointing, and you'll also implement your stream processing operation as a job on a Databricks job cluster. When you're finished with this course, you will have the skills and knowledge of streaming data in Spark needed to process and monitor streams.