- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- Data
Process Incremental Partitions with Quality Gates in Polars
Build an incremental batch pipeline using Python and Polars that processes new partitions, handles late-arriving records, enforces data quality gates, and guarantees idempotent outputs.
Lab Info
Table of Contents
-
Challenge
Step 1: Introduction and Environment Setup
Welcome to the "Process Incremental Partitions with Quality Gates in Polars" code lab! In the modern data ecosystem, pipelines cannot simply truncate and reload entire datasets every day. As data volumes grow into the terabytes, engineers must rely on incremental processing—identifying exactly what data is new, processing only that subset, and safely merging it into a curated destination.
In this lab, you are stepping into the role of a Data Engineer tasked with building a daily batch pipeline for an e-commerce platform's transaction logs. You will be using Polars, a lightning-fast DataFrame library written in Rust, which has taken the data engineering world by storm due to its memory efficiency and powerful Lazy API. Unlike Pandas, which loads everything into memory upfront, Polars' Lazy API allows you to define a massive execution graph encompassing multiple files, filters, and aggregations, and then push down optimizations before executing a single byte of data.
Your architecture will follow a classic Medallion-style pattern:
- Raw Zone: Partitioned parquet files arriving continuously.
- Pipeline Core: Your Polars application handling discovery, deduplication, and quality enforcement.
- Curated Zone: The pristine, deduplicated, idempotent outputs ready for reporting.
Throughout the next steps, you will modify the
src/pipeline.pyfile to incrementally build out each of these stages. The environment has already been pre-configured with Python 3.11, Polars, and a pytest suite. Familiarize yourself with the file structure in your editor, and move on to Step 2 when you are ready to begin writing code.info> If you get stuck, you can refer to the provided solution code for each task, available in the
solutionfolder.This lab experience was developed by the Pluralsight team using Forge, an internally developed AI tool utilizing Gemini technology. All sections were verified by human experts for accuracy prior to publication. For issue reporting, please contact us.
-
Challenge
Step 2: Partition Discovery and Incremental Logic
The first responsibility of an incremental pipeline is Partition Discovery. Before reading any data, inspect the storage layer (e.g., an S3 bucket or local directory) to determine which logical partitions exist, and then filter them down to the exact subset to process during this run.
This is typically managed using a Watermark. A watermark is simply a state tracking mechanism—usually a date or timestamp—representing the high-water mark of the last successful pipeline execution. If the watermark is
2023-10-01, you only want to process folders partitioned on or after that date. Furthermore, distributed systems often suffer from late-arriving data (or stragglers). A mobile application offline over the weekend might sync data on Tuesday that actually occurred on Saturday. To catch these, data engineers implement a Rolling Lookback Window, safely reprocessing the last N days on every run to absorb delayed events.Key Terminology
- Partitioning: Dividing a large dataset into smaller, manageable folders (e.g.,
date=2023-10-01). - Watermark: The timestamp representing the boundary of previously processed data.
- Lookback Window: Subtracting a predefined interval from the watermark to intentionally reprocess recent history.
What you'll accomplish in this step
- Filter a list of available partitions against a watermark date.
- Implement a time-delta calculation to establish a rolling lookback window.
- Translate logical partition dates into physical, scannable file paths. With the basic filtering in place, you must now account for real-world networking delays by modifying the cutoff point. By implementing a lookback window, you ensure the pipeline is resilient to late-arriving records without requiring manual intervention from the engineering team. Now that the logical dates are firmly established (accounting for both the watermark and the lookback window), the final piece of the discovery phase is constructing physical paths so the Polars engine knows exactly where to read the Parquet files from disk. Excellent work. You have completed the Partition Discovery phase. Your pipeline is now capable of identifying exactly which files need processing based on intelligent, incremental logic. You are now ready to hand these paths over to the Polars execution engine.
- Partitioning: Dividing a large dataset into smaller, manageable folders (e.g.,
-
Challenge
Step 3: Data Ingestion and Deterministic Deduplication
With the target files identified, you enter the core data processing phase. Here, you encounter one of Polars' most powerful features: the Lazy API. When dealing with gigabytes or terabytes of partitioned data, eagerly loading files into RAM will crash a pipeline immediately. By using
pl.scan_parquet(), Polars simply reads the file metadata and builds an execution graph. No data is moved into memory until explicitly requested.However, because the pipeline implements a rolling lookback window in Step 2, it is intentionally reading overlapping data. If the pipeline processes the last 3 days of data every single day, the records from 3 days ago will be ingested 3 times! To resolve this, enforce Deterministic Deduplication.
Deterministic deduplication relies on a strict ordering rule—usually a business primary key combined with an
updated_attimestamp. By sorting the dataset so the most recently updated records float to the top, and then keeping only the first occurrence of each primary key, the pipeline gracefully merges late-arriving updates (upserts) and discards stale historical records.Key Terminology
- LazyFrame: Polars' representation of an unexecuted query plan.
- Deterministic: Yielding the exact same result no matter how many times a process runs.
- Upsert: A combination of update and insert; resolving the final state of an entity.
What you'll accomplish in this step
- Initialize a Polars LazyFrame spanning multiple partition directories.
- Implement a robust deduplication expression using sorting and uniqueness constraints. The execution graph now knows how to read the files, but it will currently load overlapping, duplicate records. Fix that by chaining a deterministic deduplication rule directly onto the LazyFrame, ensuring you only ever see the final state of each transaction. Fantastic. The pipeline is now lazily reading data and guaranteeing entity uniqueness. You have a solid foundation. The next logical step before writing this data to the curated zone is to ensure the data itself is healthy and mathematically sound.
-
Challenge
Step 4: Building Data Quality Gates
A pipeline that moves bad data perfectly is still a bad pipeline. In modern DataOps, treat data as a product, which means applying rigorous Data Quality Gates during ingestion. Instead of crashing the entire batch job when a single malformed row is encountered, the industry standard is to evaluate rules row-by-row, flag the anomalies, and allow the pipeline to continue.
Polars makes this incredibly efficient through its Expression API. By using
.with_columns()combined with boolean logic, you can evaluate millions of rows instantly, appending metadata "flags" to the dataset. In this step, you will implement three distinct types of quality gates:- Completeness: Ensuring critical identifying columns (like
user_id) are never null. - Validity: Ensuring numerical facts (like
amount) fall within logically sound boundaries (e.g., greater than zero). - Uniqueness: Ensuring secondary business keys (like
transaction_receipt) do not violate uniqueness constraints even after primary deduplication.
Key Terminology
- Data Quality Gate: An automated check that prevents malformed data from reaching downstream consumers.
- Polars Expression: A declarative representation of a computation (e.g.,
pl.col('x') > 5) that Polars optimizes before execution.
What you'll accomplish in this step
- Create a boolean flag expression to catch Null values.
- Create a compound expression to catch out-of-range numerical data.
- Utilize column-wide evaluation to catch uniqueness violations. With the null check in place, you must secure the numerical facts. Financial aggregations are easily skewed by negative values or zeroes where they shouldn't exist. Add a boundary gate. Finally, even with the primary deterministic deduplication, upstream system bugs can sometimes generate duplicate secondary identifiers. Add one final gate to flag any remaining uniqueness violations across the entire batch. Great work. The dataset is now thoroughly annotated with quality metadata. Every row carries a series of boolean flags detailing exactly which, if any, business rules it violated. In the final step, you will route this data accordingly.
- Completeness: Ensuring critical identifying columns (like
-
Challenge
Step 5: Idempotent Writes and Quality Reporting
You have reached the culmination of the pipeline. The LazyFrame contains the full execution plan: scanning files, deduplicating records, and calculating quality flags. It is time to execute the graph and materialize the outputs.
There are two distinct outputs to generate. First, an aggregated Data Quality Report. This summary counts the total number of violations per partition date, providing vital observability to the engineering team. Second, the actual Curated Dataset. Filter out the flagged anomalies so downstream users only receive pristine data.
Crucially, writing this curated data must be Idempotent. If the job fails during the write phase and is automatically retried by an orchestrator like Airflow, you cannot risk appending duplicate data. To prevent this, use the "Stage and Atomic Move" pattern. Write the partitioned output to a temporary staging directory, and then atomically replace the target directories, guaranteeing data consistency.
Key Terminology
- Idempotency: The property of an operation where applying it multiple times yields the same result as applying it once.
- Atomic Operation: An operation that either completes entirely or not at all, preventing partial states.
- .collect(): The Polars method that triggers the execution of a LazyFrame into a concrete memory-backed DataFrame.
What you'll accomplish in this step
- Aggregate row-level flags into a partition-level summary DataFrame.
- Dynamically filter out any rows containing quality violations.
- Implement an idempotent directory swap to safely write partitioned parquet files. With the report generated, focus on the payload itself. Do not allow anomalous records to reach the curated zone. Use a horizontal evaluation expression to drop the bad data. The data is pristine and ready for disk. The final hurdle is ensuring the actual physical write operation is safe, repeatable, and resilient to failure via atomic directory replacements. Congratulations! You have successfully built a production-grade incremental batch pipeline. You handled partition discovery, late-arriving data, deterministic deduplication, embedded quality gates, and idempotent writes. These patterns are the bedrock of reliable data engineering and can be scaled out to process massive datasets using Polars. Consider extending this pipeline by integrating the quality report with an alerting system or deploying the code to a cloud orchestrator. ## Run Your Pipeline End-to-End
All 10 functions are now implemented. From the terminal, run your pipeline directly to see every stage execute in sequence against a small synthetic dataset:
python src/pipeline.pyYou should see:
- Partition discovery — which dates are selected after applying the lookback window to the watermark
- Quality Report — a table grouped by partition date showing the count of violations per flag column
- Record summary — total records after deduplication, how many passed the quality gates, and how many were rejected
The demo data contains two intentional violations: a missing
user_idon one transaction and a negativeamounton another. Look for non-zero counts in the quality report and two records missing from the final output to confirm your gates are working correctly.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.