Exploring the Apache Beam SDK for Modeling Streaming Data for Processing

Apache Beam is an open-source unified model for processing batch and streaming data in a parallel manner. Built to support Google’s Cloud Dataflow backend, Beam pipelines can now be executed on any supported distributed processing backends.
Course info
Level
Beginner
Updated
Sep 18, 2020
Duration
3h 28m
Table of contents
Course Overview
Understanding Pipelines, PCollections, and PTransforms
Executing Pipelines to Process Streaming Data
Applying Transformations to Streaming Data
Working with Windowing and Join Operations
Perform SQL Queries on Streaming Data
Description
Course info
Level
Beginner
Updated
Sep 18, 2020
Duration
3h 28m
Description

Apache Beam SDKs can represent and process both finite and infinite datasets using the same programming model. All data processing tasks are defined using a Beam pipeline and are represented as directed acyclic graphs. These pipelines can then be executed on multiple execution backends such as Google Cloud Dataflow, Apache Flink, and Apache Spark.

In this course, Exploring the Apache Beam SDK for Modeling Streaming Data for Processing, we will explore Beam APIs for defining pipelines, executing transforms, and performing windowing and join operations.

First, you will understand and work with the basic components of a Beam pipeline, PCollections, and PTransforms. You will work with PCollections holding different kinds of elements and see how you can specify the schema for PCollection elements. You will then configure these pipelines using custom options and execute them on backends such as Apache Flink and Apache Spark.

Next, you will explore the different kinds of core transforms that you can apply to streaming data for processing. This includes the ParDo and DoFns, GroupByKey, CoGroupByKey for join operations and the Flatten and Partition transforms.

You will then see how you can perform windowing operations on input streams and apply fixed windows, sliding windows, session windows, and global windows to your streaming data. You will use the join extension library to perform inner and outer joins on datasets.

Finally, you will configure metrics that you want tracked during pipeline execution including counter metrics, distribution metrics, and gauge metrics, and then round this course off by executing SQL queries on input data.

When you are finished with this course you will have the skills and knowledge to perform a wide range of data processing tasks using core Beam transforms and will be able to track metrics and run SQL queries on input streams.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
[Autogenerated] Hi, My name is Jonny Robbie, and welcome to the scores on exploring the Apache beam SDK for modeling streaming data for processing a little about myself. I have a masters in electrical engineering from Stanford on have worked at companies such as Microsoft, Google and Flip Card. I currently work on my own startup Loony Con, a studio for high quality video content. In this course, we will explore beam APIs for defining pipelines, executing, transforms and performing window ing and join operations. First, you'll understand and work with the basic components off a beam pipeline P collections and P transforms. You'll work with peak elections holding different kinds of elements, and you'll see how you can specify this schema for these peak election elements. You will then configure these pipelines using custom options and execute them on back ends, such as a party Flink on Apache Spark. Next, you will explore the different kinds of court transforms that you can apply on streaming data for processing. This includes the power do and do functions group by key code group by key for joint operations on the flatten and partition transforms, you will then see how you can perform win doing operations on input streams on. Apply fixed windows, sliding Windows, session windows and global windows to your streaming data. You will then use the joint Extension Library to perform inner and outer joints on data sets. Finally, you'll configure metrics that you won't track during pipeline execution. Using counter metrics distribution metrics, engage metrics you'll round. This goes off by executing sequel queries on input data. When you're finished with this course, you will have the skills and knowledge to perform a wide range of data processing tasks. Using core beam transforms and we'll be able to track metrics and ran sequel query on input streams.