Stream Processing with Apache Spark Structured Streaming and Azure Databricks

Paths

Stream Processing with Apache Spark Structured Streaming and Azure Databricks

Authors: Eugene Meidinger, Janani Ravi, Mohit Batra

Streaming data is used to make decisions and take actions in real time. The processing of streaming data must support these virtually immediate results, by the stateful analysis... Read more

What You Will Learn

  • The fundamentals of modeling streaming data
  • Governance and quality concerns around streaming data
  • Streaming data processing using Apache Spark Structured Streaming
  • Streaming data processing in Azure Databricks

Pre-requisites

  • Java
  • Distributed Systems Literacy
  • Basic SQL
  • Relational Database Design Literacy

Beginner

Understand the processing model for streaming data, as implemented in Apache Spark Structured Streaming.

Modeling Streaming Data for Processing with Apache Spark Structured Streaming

by Eugene Meidinger

Sep 23, 2020 / 1h 19m

1h 19m

Start Course
Description

Streaming analytics can be a difficult to set up, especially when working with late data arrivals and other variables. In this course, Modeling Streaming Data for Processing with Apache Spark Structured Streaming, you’ll learn to model your data for real-time analysis. First, you’ll explore applying batch processing to streaming data. Next, you’ll discover aggregating and outputting data. Finally, you’ll learn how to late arrivals and job failures. When you’re finished with this course, you’ll have the skills and knowledge of Spark Structured Streaming needed to combine your batch and streaming analytics jobs.

Table of contents
  1. Course Overview
  2. Comparing Batch and Stream Processing
  3. Understanding Structured Streaming
  4. Grouping and Aggregating Data
  5. Handling Late Arrivals and Failures

Conceptualizing the Processing Model for Apache Spark Structured Streaming

by Janani Ravi

Sep 18, 2020 / 2h 56m

2h 56m

Start Course
Description

Structured Streaming in Spark 2 is a unified model that treats batch as a prefix of stream. This allows Spark to perform the same operations on streaming data as on batch data, and Spark takes care of the details involved in incrementalizing the batch operation to work on streams.

In this course, Conceptualizing the Processing Model for Apache Spark Structured Streaming, you will use the DataFrame API as well as Spark SQL to run queries on streaming sources and write results out to data sinks.

First, you will be introduced to streaming DataFrames in Spark 2 and understand how structured streaming in Spark 2 is different from Spark Streaming available in earlier versions of Spark. You will also get a high level understanding of how Spark’s architecture works, and the role of drivers, workers, executors, and tasks.

Next, you will execute queries on streaming data from a socket source as well as a file system source. You will perform basic operations on streaming data using Data frames and register your data as a temporary view to run SQL queries on input streams. You will explore the append, complete, and update modes to write data out to sinks. You will then understand how scheduling and checkpointing works in Spark and explore the differences between the micro-batch mode of execution and the new experimental continuous processing mode that Spark offers.

Finally, you will discuss the Tungsten engine optimizations which make Spark 2 so much faster than Spark 1, and discuss the stages of optimization in the Catalyst optimizer which works with SQL queries.

At the end of this course, you will be able to build and execute streaming queries on input data, write these out to reliable storage using different output modes, and checkpoint your streaming applications for fault tolerance and recovery.

Table of contents
  1. Course Overview
  2. Getting Started with Structured Streaming
  3. Executing Streaming Queries
  4. Understanding Scheduling and Checkpointing
  5. Configuring Processing Models
  6. Understanding Query Planning

Intermediate

Build a streaming data processing pipeline using Apache Spark Structured Streaming.

Exploring the Apache Spark Structured Streaming API for Processing Streaming Data

by Janani Ravi

Sep 25, 2020 / 2h 48m

2h 48m

Start Course
Description

Stream processing applications work with continuously updated data and react to changes in real-time. In this course, Exploring the Apache Spark Structured Streaming API for Processing Streaming Data, you'll focus on using the tabular data frame API as well as Spark SQL to work with streaming, unbounded datasets using the same APIs that work with bounded batch data.

First, you’ll explore Spark’s support for different data sources and data sinks, understand the use case for each and also understand the fault-tolerance semantics that they offer. You’ll write data out to the console and file sinks, and customize your write logic with the foreach and foreachBatch sinks.

Next, you'll see how you can transform streaming data using operations such as selections, projections, grouping, and aggregations using both the DataFrame API as well as Spark SQL. You'll also learn how to perform windowing operations on streams using tumbling and sliding windows. You'll then explore relational join operations between streaming and batch sources and also learn the limitations on streaming joins in Spark.

Finally, you'll explore Spark’s support for managing and monitoring streaming queries using the Spark Web UI and the Spark History server.

When you're finished with this course, you'll have the skills and knowledge to work with different sources and sinks for your streaming data, apply a range of processing operations on input streams, and perform windowing and join operations on streams.

Table of contents
  1. Course Overview
  2. Exploring Sources and Sinks
  3. Processing Streaming Data Frames
  4. Performing Windowing Operations on Streams
  5. Working with Streaming Joins
  6. Managing and Monitoring Streaming Queries

Processing Streaming Data Using Apache Spark Structured Streaming

by Janani Ravi

Nov 11, 2020 / 2h 35m

2h 35m

Start Course
Description

Stream processing applications work with continuously updated data and react to changes in real-time. In this course, Processing Streaming Data Using Apache Spark Structured Streaming, you'll focus on integrating your streaming application with the Apache Kafka reliable messaging service to work with real-world data such as Twitter streams.

First, you’ll explore Spark’s architecture to support distributed processing at scale. Next, you will install and work with the Apache Kafka reliable messaging service.

Finally, you'll perform a number of transformation operations on Twitter streams, including windowing and join operations.

When you're finished with this course you will have the skills and knowledge to work with high volume and velocity data using Spark and integrate with Apache Kafka to process streaming data.

Table of contents
  1. Course Overview
  2. Getting Started with the Spark Standalone Cluster
  3. Integrating Spark with Apache Kafka
  4. Performing Windowing Operations on Streams
  5. Performing Join Operations on Streams

Advanced

Apply your streaming data knowledge inside of Azure Databricks.

Conceptualizing the Processing Model for Azure Databricks Service

by Mohit Batra

Jul 21, 2020 / 2h 51m

2h 51m

Start Course
Description

Modern data pipelines often include streaming data, that needs to be processed in real-time. While Apache Spark is very popular for big data processing and can help us build reliable streaming pipelines, managing the Spark environment is no cakewalk.

In this course, Conceptualizing the Processing Model for Azure Databricks Service, you will learn how to use Spark Structured Streaming on Databricks platform, which is running on Microsoft Azure, and leverage its features to build an end-to-end streaming pipeline quickly and reliably. And all this while learning about collaboration options and optimizations that it brings, but without worrying about the infrastructure management.

First, you will learn about the processing model of Spark Structured Streaming, about the Databricks platform and features, and how it is runs on Microsoft Azure.

Next, you will see how to setup the environment, like workspace, clusters, and security; configure streaming sources and sinks, and see how Structured Streaming fault tolerance works.

Followed by this, you will learn how to build each phase of streaming pipeline, by extracting the data from source, transforming it, and loading it in a sink. And then make it production ready, and run it using Databricks jobs.

You will also see, how to customize the cluster using Initialization scripts and Docker containers, to suit your business requirements.

Finally, you will explore other aspects. You will see what are the different workloads available, and how pricing works. We will also talk about best practices, in terms of development, performance, stability and cost. And lastly, you will see how Spark Structured Streaming on Azure Databricks compares to other managed services, like Flink on AWS, Azure Stream Analytics, Beam on Google Cloud etc.

By the end of this course, you will have the skills and knowledge of Azure Databricks platform needed to build an end-to-end streaming pipeline, using Spark Structured streaming.

Table of contents
  1. Course Overview
  2. Getting Started with Structured Streaming on Azure Databricks
  3. Setting up Databricks Environment
  4. Configuring Source and Sink Stores
  5. Building Streaming Pipeline Using Structured Streaming
  6. Making Streaming Pipeline Production Ready
  7. Understanding Pricing, Workloads, and Competition
  8. Customizing the Cluster

Handling Streaming Data with Azure Databricks Using Spark Structured Streaming

by Mohit Batra

Nov 25, 2020 / 2h 27m

2h 27m

Start Course
Description

Modern data pipelines often include streaming data that needs to be processed in real-time. In a practical scenario, you would be required to deal with multiple streams and datasets, to continuously produce the results. In this course, Handling Streaming Data with Azure Databricks Using Spark Structured Streaming, you will learn how to use Spark Structured Streaming on Databricks platform, which is running on Microsoft Azure, and leverage its features to build end-to-end streaming pipelines. First, you will see a quick recap of Spark Structured Streaming processing model; understand the scenario that we will implement, and complete the environment setup. Next, you will learn how to configure sources and sinks, and build each phase of the streaming pipeline – by extracting the data from various sources, transforming it, and loading it into multiple sinks – Azure Data Lake, Azure Event Hubs, and Azure SQL. You will also see the different timestamps associated with an event, and how to aggregate data using Windows. Next, you will see how to combine a stream, with static or historical datasets. And how to combine multiple streams together. Finally, you will learn how to build a production ready pipeline, schedule it as a job in Databricks, and manage them using Databricks CLI. When you are finished with this course, you will be comfortable to build complex streaming pipelines, running on Azure Databricks, to solve a variety of business problems.

Table of contents
  1. Course Overview
  2. Setting up the Environment
  3. Building Streaming Pipeline
  4. Working with Timestamps and Windows
  5. Handling Stateful Operations
  6. Working with Multiple Streams and Datasets
  7. Running Streaming Pipeline in Production
Offer Code *
Email * First name * Last name *
Company
Title
Phone
Country *

* Required field

Opt in for the latest promotions and events. You may unsubscribe at any time. Privacy Policy

By providing my phone number to Pluralsight and toggling this feature on, I agree and acknowledge that Pluralsight may use that number to contact me for marketing purposes, including using autodialed or pre-recorded calls and text messages. I understand that consent is not required as a condition of purchase from Pluralsight.

By activating this benefit, you agree to abide by Pluralsight's terms of use and privacy policy.

I agree, activate benefit