Conceptualizing the Processing Model for Apache Spark Structured Streaming

Much real-world data is available in streams; from self-driving car sensors to weather monitors. Apache Spark 2 is a strong analytics engine with first-class support for streaming operations using micro-batch and continuous processing.
Course info
Level
Intermediate
Updated
Sep 18, 2020
Duration
2h 56m
Table of contents
Course Overview
Getting Started with Structured Streaming
Executing Streaming Queries
Understanding Scheduling and Checkpointing
Configuring Processing Models
Understanding Query Planning
Description
Course info
Level
Intermediate
Updated
Sep 18, 2020
Duration
2h 56m
Description

Structured Streaming in Spark 2 is a unified model that treats batch as a prefix of stream. This allows Spark to perform the same operations on streaming data as on batch data, and Spark takes care of the details involved in incrementalizing the batch operation to work on streams.

In this course, Conceptualizing the Processing Model for Apache Spark Structured Streaming, you will use the DataFrame API as well as Spark SQL to run queries on streaming sources and write results out to data sinks.

First, you will be introduced to streaming DataFrames in Spark 2 and understand how structured streaming in Spark 2 is different from Spark Streaming available in earlier versions of Spark. You will also get a high level understanding of how Spark’s architecture works, and the role of drivers, workers, executors, and tasks.

Next, you will execute queries on streaming data from a socket source as well as a file system source. You will perform basic operations on streaming data using Data frames and register your data as a temporary view to run SQL queries on input streams. You will explore the append, complete, and update modes to write data out to sinks. You will then understand how scheduling and checkpointing works in Spark and explore the differences between the micro-batch mode of execution and the new experimental continuous processing mode that Spark offers.

Finally, you will discuss the Tungsten engine optimizations which make Spark 2 so much faster than Spark 1, and discuss the stages of optimization in the Catalyst optimizer which works with SQL queries.

At the end of this course, you will be able to build and execute streaming queries on input data, write these out to reliable storage using different output modes, and checkpoint your streaming applications for fault tolerance and recovery.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
[Autogenerated] Hi, My name is John. Any Ravi and welcome to the scores on conceptualizing the processing model for Apache Spark structured Streaming a little about myself. I have a masters in electrical engineering from Stanford on have worked at companies such as Microsoft, Google and Flip Card. I currently work on my own startup Loony Con, a studio for high quality video content. In this course, you will use the data frame a P I as well. A spark sequel to run queries on streaming sources and right results out to data things. First, you'll get introduced to streaming data frames in spark to on understand how structured streaming in spark to is different from spark streaming available in earlier versions of spark. You'll also get a high level understanding off how sparks architecture works on the rule of drivers. Workers execute er's and tasks. Next, you will execute query zahn streaming data from a socket source as well as using a file system source. You will perform basic operations on streaming data using data frames. You will then register your data as a temporary view to run sequel query zone input streams. You'll explore the upend complete and update moods to write data out two things. You'll then understand how shed youling and check pointing works in spark. And you'll also explore the differences between the micro batch mode of execution and the new experimental continuous processing mode that spark offers. Finally, we will discuss the tungsten engine optimization, which makes park toe so much faster than Spark one. And we'll also discuss the stages off optimization in the catalyst Optimizer that works. Which sequel? Query these. At the end of this course, you'll be ableto build and execute streaming queries on input data. Right these out to reliable storage using different output moves on checkpoint your streaming applications for fault, tolerance and recovery.