Flink is a stateful, tolerant, and large-scale system with excellent latency and throughput characteristics. It works with bounded and unbounded datasets using the same underlying stream-first architecture, focusing on streaming or unbounded data.
Apache Flink is built on the concept of stream-first architecture, where the stream is the source of truth. Flink offers extensive APIs to process both batch as well as streaming data in an easy and intuitive manner.
In this course, Conceptualizing the Processing Model for Apache Flink, you’ll be introduced to Flink Architecture and processing APIs to get started on your data analysis journey.
First, you’ll explore the differences between processing batch and streaming data, and understand how stream-first architecture works. You’ll study the stream-first processing model that Flink uses to process data at scale, and Flink’s architecture which uses JobManager, TaskManagers, and task slots to execute the operators and streams in a Flink application in a data-parallel manner.
Next, you’ll understand the difference between stateless and stateful stream transformations and apply these concepts in a hands-on manner in your Flink stream processing. You’ll process data in a stateless manner using the map(), flatMap(), and filter() transformations, and use keyed streams and rich functions to work with Flink state.
Finally, you’ll round off your understanding of the state persistence and fault-tolerance mechanism that Flink uses by exploring the checkpointing architecture in Flink. You’ll enable checkpoints and savepoints in your streaming application, see how state can be restored from a snapshot in the case of failures, and configure your Flink application to support different restart strategies.
When you’re finished with this course, you’ll have the skills and knowledge to design Flink pipelines performing stateless and stateful transformations, and you’ll be able to build fault-tolerant applications using checkpoints and savepoints.
A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.
Getting Started with Apache Flink Now that we have a big picture understanding of how batch processing differs from stream processing, let's compare and contrast the two side by side. With batch processing, you're working on bounded finite datasets. With stream processing you're working on unbounded infinite datasets. New streaming entities are constantly added to this unbounded data. Batch processing tends to be a relatively slow pipeline from data ingestion to analysis. This processing can take several hours or even several days. With stream processing, processing is immediate. You want to process the data almost as soon as it is received. Batch processing jobs tend to be high‑latency jobs. Latency in minutes and hours is considered acceptable. You may have batch jobs that even run for a few days. Latency is a much more critical aspect of stream processing jobs. Acceptable latencies are usually in seconds or milliseconds. Batch processing is typically run for periodic updates. You're generating reports maybe every day, every week, or every month. Stream processing jobs provide continuous updates as the jobs are constantly running and monitoring incoming data. With batch processing, the entire dataset on which you operate is known up front, which means the order in which the data was originally received is completely unimportant. In fact, it's entirely irrelevant. In stream processing, the entities are received in real time, which means the order in which the entities appear in the stream is important. Out‑of‑order arrival of entities is typically tracked and monitored. Batch processing pipelines operate within a single global state of the world at any point in time. The global state is always known. With stream processing, there is no one global state. Only the history of events that have been received in the past is tracked. We have no idea what's coming up front. And that brings us to the last major difference between batch processing and stream processing. The processing code in a batch job knows all the data. The entire data is available up front. There are no unknowns at processing time. With stream processing, the processing code has no idea what lies ahead, what entity is going to come next, and how that entity might affect the results of processing.