Dataflow represents a fundamentally different approach to Big Data processing than computing engines such as Spark. Dataflow is serverless and fully-managed, and supports running pipelines designed using Apache Beam APIs.
Dataflow allows developers to process and transform data using easy, intuitive APIs. Dataflow is built on the Apache Beam architecture and unifies batch as well as stream processing of data. In this course, Conceptualizing the Processing Model for the GCP Dataflow Service, you will be exposed to the full potential of Cloud Dataflow and its innovative programming model.
First, you will work with an example Apache Beam pipeline performing stream processing operations and see how it can be executed using the Cloud Dataflow runner.
Next, you will understand the basic optimizations that Dataflow applies to your execution graph such as fusion and combine optimizations.
Finally, you will explore Dataflow pipelines without writing any code at all using built-in templates. You will also see how you can create a custom template to execute your own processing jobs.
When you are finished with this course, you will have the skills and knowledge to design Dataflow pipelines using Apache Beam SDKs, integrate these pipelines with other Google services, and run these pipelines on the Google Cloud Platform.
A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.
Course Overview Hi, my name is Janani Ravi, and welcome to this course on Conceptualizing the Processing Model for the GCP Dataflow Service. A little about myself. I have a master's degree in electrical engineering from Stanford, and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real‑time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content. Dataflow allows developers to process and transform data using easy, intuitive Apache Beam APIs. In this course, you will be exposed to the full potential of Cloud Dataflow and its innovative programming model. First, you'll work with an example Apache Beam pipeline performing stream processing, and see how it can be executed using the Cloud Data LoadRunner. You'll see how the pipeline can be configured to read and write data using cloud storage buckets. You will customize the execution of the pipeline using the PipelineOptions object in Beam. You'll use the Dataflow monitoring interface, the gcloud command line utility, and the cloud monitoring service to monitor and debug your processing application. Next, you'll explore Dataflow‑specific optimizations to your execution graph, and you'll see how Dataflow manages the auto‑scaling of workers. You will integrate your Dataflow pipeline with the Pub/Sub scalable messaging service on the GCP, and you'll see how you can extract event timestamps and perform windowing operations on your streaming data. Finally, you'll see how you can execute Dataflow pipelines without writing any code at all using built‑in templates. When you're finished with this course, you'll have the skills and knowledge to design Dataflow pipelines using Apache Beam SDKs, and integrate those pipelines with other Google services.