Processing Streaming Data Using Apache Spark Structured Streaming

Structured streaming is the scalable and fault-tolerant stream processing engine in Apache Spark 2 which can be used to process high-velocity streams.
Course info
Level
Intermediate
Updated
Nov 11, 2020
Duration
2h 35m
Table of contents
Course Overview
Getting Started with the Spark Standalone Cluster
Integrating Spark with Apache Kafka
Performing Windowing Operations on Streams
Performing Join Operations on Streams
Description
Course info
Level
Intermediate
Updated
Nov 11, 2020
Duration
2h 35m
Your 10-day individual free trial includes:

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.
Description

Stream processing applications work with continuously updated data and react to changes in real-time. In this course, Processing Streaming Data Using Apache Spark Structured Streaming, you'll focus on integrating your streaming application with the Apache Kafka reliable messaging service to work with real-world data such as Twitter streams.

First, you’ll explore Spark’s architecture to support distributed processing at scale. Next, you will install and work with the Apache Kafka reliable messaging service.

Finally, you'll perform a number of transformation operations on Twitter streams, including windowing and join operations.

When you're finished with this course you will have the skills and knowledge to work with high volume and velocity data using Spark and integrate with Apache Kafka to process streaming data.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Summarizing Data and Deducing Probabilities
Intermediate
2h 50m
Jul 8, 2021
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi. My name is Janani Ravi, and welcome to this course on Processing Streaming Data Using Apache Spark Structured Streaming. A little about myself. I have a master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real‑time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high‑quality video content. In this course, you'll focus on integrating your streaming applications with the Apache Kafka reliable messaging service to work with real‑world data such as Twitter streams. First, you'll explore Spark's architecture to support distributed processing at scale. You'll set up a Spark standalone cluster on your local machine and configure the cluster using Spark configuration files. You'll also use the cluster install scripts to start and stop the master and worker processes. Next, you'll install and work with the Apache Kafka reliable messaging service. You'll understand how Kafka publishers, consumers, and topics work, and you'll integrate your Spark streaming application to read and write data to topics in Kafka. You'll also set up a Twitter developer account, which you will use to stream Twitter messages to a Kafka topic, which can then be processed by our streaming application. Finally, you will perform a number of transformation operations on Twitter streams, including windowing and join operations. You'll also see how you can perform sentiment analysis on each incoming Twitter message. We'll round this course off by exploring how you can perform unit testing and end‑to‑end testing of your streaming application. When you're finished with this course, you will have the skills and knowledge to work with high‑volume and high‑velocity data using Spark and integrate with Apache Kafka to process streaming data.