Featured resource
2026 Tech Forecast
2026 Tech Forecast

1,500+ tech insiders, business leaders, and Pluralsight Authors share their predictions on what’s shifting fastest and how to stay ahead.

Download the forecast
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • Data
Labs

Process Incremental Partitions with Quality Gates in Polars

Build an incremental batch pipeline using Python and Polars that processes new partitions, handles late-arriving records, enforces data quality gates, and guarantees idempotent outputs.

Lab platform
Lab Info
Level
Intermediate
Last updated
May 08, 2026
Duration
35m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Step 1: Introduction and Environment Setup

    Welcome to the "Process Incremental Partitions with Quality Gates in Polars" code lab! In the modern data ecosystem, pipelines cannot simply truncate and reload entire datasets every day. As data volumes grow into the terabytes, engineers must rely on incremental processing—identifying exactly what data is new, processing only that subset, and safely merging it into a curated destination.

    In this lab, you are stepping into the role of a Data Engineer tasked with building a daily batch pipeline for an e-commerce platform's transaction logs. You will be using Polars, a lightning-fast DataFrame library written in Rust, which has taken the data engineering world by storm due to its memory efficiency and powerful Lazy API. Unlike Pandas, which loads everything into memory upfront, Polars' Lazy API allows you to define a massive execution graph encompassing multiple files, filters, and aggregations, and then push down optimizations before executing a single byte of data.

    Your architecture will follow a classic Medallion-style pattern:

    • Raw Zone: Partitioned parquet files arriving continuously.
    • Pipeline Core: Your Polars application handling discovery, deduplication, and quality enforcement.
    • Curated Zone: The pristine, deduplicated, idempotent outputs ready for reporting.

    Throughout the next steps, you will modify the src/pipeline.py file to incrementally build out each of these stages. The environment has already been pre-configured with Python 3.11, Polars, and a pytest suite. Familiarize yourself with the file structure in your editor, and move on to Step 2 when you are ready to begin writing code.

    info> If you get stuck, you can refer to the provided solution code for each task, available in the solution folder.

    This lab experience was developed by the Pluralsight team using Forge, an internally developed AI tool utilizing Gemini technology. All sections were verified by human experts for accuracy prior to publication. For issue reporting, please contact us.

  2. Challenge

    Step 2: Partition Discovery and Incremental Logic

    The first responsibility of an incremental pipeline is Partition Discovery. Before reading any data, inspect the storage layer (e.g., an S3 bucket or local directory) to determine which logical partitions exist, and then filter them down to the exact subset to process during this run.

    This is typically managed using a Watermark. A watermark is simply a state tracking mechanism—usually a date or timestamp—representing the high-water mark of the last successful pipeline execution. If the watermark is 2023-10-01, you only want to process folders partitioned on or after that date. Furthermore, distributed systems often suffer from late-arriving data (or stragglers). A mobile application offline over the weekend might sync data on Tuesday that actually occurred on Saturday. To catch these, data engineers implement a Rolling Lookback Window, safely reprocessing the last N days on every run to absorb delayed events.

    Key Terminology

    • Partitioning: Dividing a large dataset into smaller, manageable folders (e.g., date=2023-10-01).
    • Watermark: The timestamp representing the boundary of previously processed data.
    • Lookback Window: Subtracting a predefined interval from the watermark to intentionally reprocess recent history.

    What you'll accomplish in this step

    • Filter a list of available partitions against a watermark date.
    • Implement a time-delta calculation to establish a rolling lookback window.
    • Translate logical partition dates into physical, scannable file paths. With the basic filtering in place, you must now account for real-world networking delays by modifying the cutoff point. By implementing a lookback window, you ensure the pipeline is resilient to late-arriving records without requiring manual intervention from the engineering team. Now that the logical dates are firmly established (accounting for both the watermark and the lookback window), the final piece of the discovery phase is constructing physical paths so the Polars engine knows exactly where to read the Parquet files from disk. Excellent work. You have completed the Partition Discovery phase. Your pipeline is now capable of identifying exactly which files need processing based on intelligent, incremental logic. You are now ready to hand these paths over to the Polars execution engine.
  3. Challenge

    Step 3: Data Ingestion and Deterministic Deduplication

    With the target files identified, you enter the core data processing phase. Here, you encounter one of Polars' most powerful features: the Lazy API. When dealing with gigabytes or terabytes of partitioned data, eagerly loading files into RAM will crash a pipeline immediately. By using pl.scan_parquet(), Polars simply reads the file metadata and builds an execution graph. No data is moved into memory until explicitly requested.

    However, because the pipeline implements a rolling lookback window in Step 2, it is intentionally reading overlapping data. If the pipeline processes the last 3 days of data every single day, the records from 3 days ago will be ingested 3 times! To resolve this, enforce Deterministic Deduplication.

    Deterministic deduplication relies on a strict ordering rule—usually a business primary key combined with an updated_at timestamp. By sorting the dataset so the most recently updated records float to the top, and then keeping only the first occurrence of each primary key, the pipeline gracefully merges late-arriving updates (upserts) and discards stale historical records.

    Key Terminology

    • LazyFrame: Polars' representation of an unexecuted query plan.
    • Deterministic: Yielding the exact same result no matter how many times a process runs.
    • Upsert: A combination of update and insert; resolving the final state of an entity.

    What you'll accomplish in this step

    • Initialize a Polars LazyFrame spanning multiple partition directories.
    • Implement a robust deduplication expression using sorting and uniqueness constraints. The execution graph now knows how to read the files, but it will currently load overlapping, duplicate records. Fix that by chaining a deterministic deduplication rule directly onto the LazyFrame, ensuring you only ever see the final state of each transaction. Fantastic. The pipeline is now lazily reading data and guaranteeing entity uniqueness. You have a solid foundation. The next logical step before writing this data to the curated zone is to ensure the data itself is healthy and mathematically sound.
  4. Challenge

    Step 4: Building Data Quality Gates

    A pipeline that moves bad data perfectly is still a bad pipeline. In modern DataOps, treat data as a product, which means applying rigorous Data Quality Gates during ingestion. Instead of crashing the entire batch job when a single malformed row is encountered, the industry standard is to evaluate rules row-by-row, flag the anomalies, and allow the pipeline to continue.

    Polars makes this incredibly efficient through its Expression API. By using .with_columns() combined with boolean logic, you can evaluate millions of rows instantly, appending metadata "flags" to the dataset. In this step, you will implement three distinct types of quality gates:

    1. Completeness: Ensuring critical identifying columns (like user_id) are never null.
    2. Validity: Ensuring numerical facts (like amount) fall within logically sound boundaries (e.g., greater than zero).
    3. Uniqueness: Ensuring secondary business keys (like transaction_receipt) do not violate uniqueness constraints even after primary deduplication.

    Key Terminology

    • Data Quality Gate: An automated check that prevents malformed data from reaching downstream consumers.
    • Polars Expression: A declarative representation of a computation (e.g., pl.col('x') > 5) that Polars optimizes before execution.

    What you'll accomplish in this step

    • Create a boolean flag expression to catch Null values.
    • Create a compound expression to catch out-of-range numerical data.
    • Utilize column-wide evaluation to catch uniqueness violations. With the null check in place, you must secure the numerical facts. Financial aggregations are easily skewed by negative values or zeroes where they shouldn't exist. Add a boundary gate. Finally, even with the primary deterministic deduplication, upstream system bugs can sometimes generate duplicate secondary identifiers. Add one final gate to flag any remaining uniqueness violations across the entire batch. Great work. The dataset is now thoroughly annotated with quality metadata. Every row carries a series of boolean flags detailing exactly which, if any, business rules it violated. In the final step, you will route this data accordingly.
  5. Challenge

    Step 5: Idempotent Writes and Quality Reporting

    You have reached the culmination of the pipeline. The LazyFrame contains the full execution plan: scanning files, deduplicating records, and calculating quality flags. It is time to execute the graph and materialize the outputs.

    There are two distinct outputs to generate. First, an aggregated Data Quality Report. This summary counts the total number of violations per partition date, providing vital observability to the engineering team. Second, the actual Curated Dataset. Filter out the flagged anomalies so downstream users only receive pristine data.

    Crucially, writing this curated data must be Idempotent. If the job fails during the write phase and is automatically retried by an orchestrator like Airflow, you cannot risk appending duplicate data. To prevent this, use the "Stage and Atomic Move" pattern. Write the partitioned output to a temporary staging directory, and then atomically replace the target directories, guaranteeing data consistency.

    Key Terminology

    • Idempotency: The property of an operation where applying it multiple times yields the same result as applying it once.
    • Atomic Operation: An operation that either completes entirely or not at all, preventing partial states.
    • .collect(): The Polars method that triggers the execution of a LazyFrame into a concrete memory-backed DataFrame.

    What you'll accomplish in this step

    • Aggregate row-level flags into a partition-level summary DataFrame.
    • Dynamically filter out any rows containing quality violations.
    • Implement an idempotent directory swap to safely write partitioned parquet files. With the report generated, focus on the payload itself. Do not allow anomalous records to reach the curated zone. Use a horizontal evaluation expression to drop the bad data. The data is pristine and ready for disk. The final hurdle is ensuring the actual physical write operation is safe, repeatable, and resilient to failure via atomic directory replacements. Congratulations! You have successfully built a production-grade incremental batch pipeline. You handled partition discovery, late-arriving data, deterministic deduplication, embedded quality gates, and idempotent writes. These patterns are the bedrock of reliable data engineering and can be scaled out to process massive datasets using Polars. Consider extending this pipeline by integrating the quality report with an alerting system or deploying the code to a cloud orchestrator. ## Run Your Pipeline End-to-End

    All 10 functions are now implemented. From the terminal, run your pipeline directly to see every stage execute in sequence against a small synthetic dataset:

    python src/pipeline.py
    

    You should see:

    • Partition discovery — which dates are selected after applying the lookback window to the watermark
    • Quality Report — a table grouped by partition date showing the count of violations per flag column
    • Record summary — total records after deduplication, how many passed the quality gates, and how many were rejected

    The demo data contains two intentional violations: a missing user_id on one transaction and a negative amount on another. Look for non-zero counts in the quality report and two records missing from the final output to confirm your gates are working correctly.

About the author

Pluralsight’s AI authoring technology is designed to accelerate the creation of hands-on, technical learning experiences. Serving as a first-pass content generator, it produces structured lab drafts aligned to learning objectives defined by Pluralsight’s Curriculum team. Each lab is then enhanced by our Content team, who configure the environments, refine instructions, and conduct rigorous technical and quality reviews. The result is a collaboration between artificial intelligence and human expertise, where AI supports scale and efficiency, and Pluralsight experts ensure accuracy, relevance, and instructional quality, helping learners build practical skills with confidence.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight