Structured Streaming in Apache Spark 2

Many sources of data in the real world are available in the form of streams; from self-driving car sensors to weather monitors. Apache Spark 2 is a powerful, distributed, analytics engine which offers great support for streaming applications
Course info
Rating
(21)
Level
Beginner
Updated
Jun 22, 2018
Duration
2h 11m
Table of contents
Course Overview
Understanding the High Level Streaming API in Spark 2.x
Building Advanced Streaming Pipelines Using Structured Streaming
Integrating Apache Kafka with Structured Streaming
Description
Course info
Rating
(21)
Level
Beginner
Updated
Jun 22, 2018
Duration
2h 11m
Description

Stream processing applications work with continuously updated data and react to changes in real-time. Data frames in Spark 2.x support infinite data, thus effectively unifying batch and streaming applications. In this course, Structured Streaming in Apache Spark 2, you'll focus on using the tabular data frame API to work with streaming, unbounded datasets using the same APIs that work with bounded batch data. First, you'll start off by understanding how structured streaming works and what makes it different and more powerful than traditional streaming applications; the basic streaming architecture and the improvements included in structured streaming allowing it to react to data in real-time. Then you'll create triggers to evaluate streaming results and output modes to write results out to file or screen. Next, you'll discover how you can build streaming pipelines using Spark by studying event time aggregations, grouping and windowing functions, and how to perform join operations between batch and streaming data. You'll even work with real Twitter streams and perform analysis on trending hashtags on Twitter. Finally, you'll then see how Spark stream processing integrates with the Kafka distributed publisher-subscriber system by ingesting Twitter data from a Kafka producer and process it using Spark Streaming. By the end of this course, you'll be comfortable performing analysis of stream data using Spark's distributed analytics engine and its high-level structured streaming API.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Analyzing Data with Qlik Sense
Intermediate
2h 11m
Jun 17, 2019
Using PyTorch in the Cloud: PyTorch Playbook
Intermediate
2h 21m
Apr 25, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi, and welcome to this course on Structured Streaming in Apache Spark 2. A little about myself, I have a master's degree in electrical engineering from Stanford, and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real-time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high- quality video content. In this course, we focus on using the tabular data frame to work with streaming and bounded data sets using the same APIs that work with bounded batched data. We start off by understanding how structured streaming works and what makes it different and more powerful than traditional streaming applications. We'll understand the basic streaming architecture and the improvements included in structured streaming, allowing it to react to data in real-time. We'll start triggers to evaluate streaming results, and output modes to write results out to file or to screen. We'll then see how we can build streaming pipelines using Spark. We'll study event time aggregations, grouping and windowing functions, and how we perform join operations between batch and streaming data. We'll work with real Twitter streams, and perform analysis on trending hashtags on Twitter. We'll then see how Spark stream processing integrates with Kafka distributed publisher subscriber systems. We'll ingest Twitter data from a Kafka producer and process it using Spark streaming. At the end of this course, you should be comfortable performing analysis of stream data using Spark's distributed analytics engine and its high-level structured streaming API.

Understanding the High Level Streaming API in Spark 2.x
Hi, and welcome to this course on Structured Streaming in Apache Spark 2. Apache Spark today is one of the most popular distributed data processing and analytics engine, and the cool thing is it can work with streaming data as well. Processing streaming data has become extremely important nowadays as we get data in real-time from logs, sensors, and a variety of other sources. We'll start this module off by understanding the need for stream processing, and then move on to identifying the differences between batch and stream architectures. We'll talk briefly about how streaming works in Spark version 1. The code components and streaming that are RDDs, which come together to form DStreams. We'll then see how streaming has changed in Spark version 2. x. The latest versions of Spark use the high-performance Tungsten engine and use the DataFrames as the basic developer API, data represented in rows and columns. We'll then talk about structured streaming in Spark 2, which unifies the batch and streaming architectures and uses the same API to work with batch data as well as stream data. This is possible in Spark 2 by modeling streams as unbounded datasets.

Building Advanced Streaming Pipelines Using Structured Streaming
Hi, and welcome to this module on Building Advanced Streaming Pipelines Using Structured Streaming. It's in this module that we'll get really hands-on. We'll see how to perform a variety of operations on streaming data, starting off with selections, projections, and aggregations on streaming entities. We'll also perform ad hoc SQL queries, which runs on streaming data as opposed to batch data. We'll then see how we can perform windowing operations which allow us to operate on a subset of streaming data. We'll work on some real-world streaming data by connecting to the Twitter streaming API using Tweepy. We'll then study what features Spark streaming offers to deal with late data. Lateness is the difference between the event time, that is the time at which the entity was generated, and processing time. Dealing with late data involves using a Spark feature called watermarks, which determines how late is actually late, and what to do with data when it comes in after it's supposed to.

Integrating Apache Kafka with Structured Streaming
Hi, and welcome to this module on integrating Apache Kafka with Structured Streaming in Spark. Now if you haven't worked with Kafka before, Kafka is a powerful publisher/subscriber messaging technology which allows various publishers generating messages to publish messages to a queue, which can then be received by subscribers. When messages are published to Kafka, they are categorized by topic, and stored in partitioned, replicated logs. As a subscriber, you'll specify the topic that you're interested in, and receive all messages on that topic. Kafka is a distributed technology which means it's highly skilled and can process many millions of messages per second. Internally, Kafka uses Apache Zookeeper for configuration services and synchronization of the servers in its cluster. Using Kafka as publisher/subscriber message transport system, and using Spark streaming for processing those messages is a common setup that you're likely to find in organizations.