Course

Skills

Processing Streaming Data with Apache Spark on Databricks

by Janani Ravi

This course will teach you how to use Spark abstractions for streaming data and perform transformations on streaming data using the Spark structured streaming APIs on Azure Databricks.

Preview this course

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(25)

Level

Intermediate

Updated

Oct 25, 2021

Duration

2h 1m

What you'll learn

Structured streaming in Apache Spark treats real-time data as a table that is being constantly appended. This leads to a stream processing model that uses the same APIs as a batch processing model - it is up to Spark to incrementalize our batch operations to work on the stream. The burden of stream processing shifts from the user to the system, making it very easy and intuitive to process streaming data with Spark.

In this course, Processing Streaming Data with Apache Spark on Databricks, you’ll learn to stream and process data using abstractions provided by Spark structured streaming. First, you’ll understand the difference between batch processing and stream processing and see the different models that can be used to process streaming data. You will also explore the structure and configurations of the Spark structured streaming APIs.

Next, you will learn how to read from a streaming source using Auto Loader on Azure Databricks. Auto Loader automates the process of reading streaming data from a file system, and takes care of the file management and tracking of processed files making it very easy to ingest data from external cloud storage sources. You will then perform transformations and aggregations on streaming data and write data out to storage using the append, complete, and update models.

Finally, you will learn how to use SQL-like abstractions on input streams. You will connect to an external cloud storage source, an Amazon S3 bucket, and read in your stream using Auto Loader. You will then run SQL queries to process your data. Along the way, you will make your stream processing resilient to failures using checkpointing and you will also implement your stream processing operation as a job on a Databricks Job Cluster.

When you’re finished with this course, you’ll have the skills and knowledge of streaming data in Spark needed to process and monitor streams and identify use-cases for transformations on streaming data.

Course Overview

2mins

Course Overview 2m

Overview of the Streaming Architecture in Apache Spark

43mins

Applying Transformations on Streaming Data

38mins

Streaming Sources and Sinks 2m
Auto Loader 5m
Demo: Auto Loader and Rescued Data 6m
Demo; Writing Streams to File Sinks 5m
Demo: Performing Transformations on Streams 5m
Demo: Stream Processing 1m
Output Modes 5m
Demo: Append Mode 3m
Demo: Complete Mode 3m
Demo: Update Mode 3m

Executing SQL Queries on Streaming Data

37mins

Demo: Executing SQL Queries to Process Streams 7m
Demo: Creating an AWS User and S3 Bucket 4m
Demo: Mounting an S3 Bucket to DBFS 4m
Demo: Auto Loader to Read from an S3 Bucket Source 2m
Demo: Applying UDFs on Streaming Data 2m
Checkpointing 2m
Demo: Checkpointing 7m
Demo: Running a Streaming Job on a Cluster 5m
Demo: Viewing Job Results 2m
Summary and Further Study 1m

About the author

Janani Ravi

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework. After spending years working in tech in the Bay Area, New York, and Singapore at companies such as Microsoft, Google, and Flipkart, Janani finally decided to combine her love for technology with her passion for teaching. She is now the co-founder of Loonycorn, a content studio focused on providing ... more

See more courses by Janani Ravi

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(25)

Level

Intermediate

Updated

Oct 25, 2021

Duration

2h 1m

Ready to upskill? Get started

Contact Sales

Processing Streaming Data with Apache Spark on Databricks

What you'll learn

Table of contents

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Processing Streaming Data with Apache Spark on Databricks

What you'll learn

Table of contents

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?