- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- Data
Transform and Curate a Partitioned Sales Dataset with Lazy Polars
Build a scalable data transformation pipeline using the Polars library’s Lazy API. You will ingest Parquet data, apply performance-optimizing pushdowns, perform complex transformations, and join multiple tables to produce curated, aggregated outputs.
Lab Info
Table of Contents
-
Challenge
Step 1: Introduction to the Data Pipeline Project
Welcome to the Lazy Polars Pipeline Code Lab! In this scenario, you assume the role of a Data Engineer at a global retail company. The business relies on a daily Python script to process sales transactions and generate aggregate reports—but it has started crashing with Out-Of-Memory (OOM) exceptions as transaction logs have grown and pandas loads entire datasets into RAM at once. Scaling server hardware is becoming prohibitively expensive.
Your task is to refactor this script using Polars, a lightning-fast DataFrame library written in Rust. Specifically, you will use Polars' Lazy API. Instead of loading everything into memory, the Lazy API lets you define a computation graph (a blueprint of your operations). Polars' internal query optimizer then analyzes this blueprint, automatically rearranging tasks to maximize CPU utilization, eliminate unnecessary data loading, and stream data through in batches.
Over the next few steps, you will:
- Implement pushdown optimizations to bypass unneeded data on disk.
- Apply complex, vectorized transformations to clean and derive new columns.
- Execute relational joins against dimension tables to enrich the dataset.
- Trigger execution via a streaming sink to safely write out curated data without exceeding memory limits.
Head to the next step to begin building the ingestion phase of the pipeline. info > This lab experience was developed by the Pluralsight team using Forge, an internally developed AI tool utilizing Gemini technology. All sections were verified by human experts for accuracy prior to publication. For issue reporting, please contact us.
-
Challenge
Step 2: Data Ingestion and Pushdown Optimizations
In this step, you'll focus on the Extract portion of the ETL pipeline. You'll connect to the raw Parquet files using the Polars Lazy API's pushdown optimizations—instructions that the query optimizer moves down to the scanning phase so unneeded data is never read from disk.
Because Parquet is a columnar format, Polars can apply the following pushdown optimizations:
- Projection pushdown: Allows Polars to ignore columns you do not need, reducing disk I/O.
- Predicate pushdown: Uses metadata such as minimum and maximum values in row groups to skip reading data that does not match your filter conditions, improving performance and memory efficiency.
Key Terminology:
LazyFrame: A Polars object representing an unevaluated query plan.- Projection: The relational algebra term for selecting specific columns.
- Predicate: A condition that evaluates to true or false (for example, a filter).
What you'll accomplish:
- Initialize a
LazyFramefrom partitioned Parquet files. - Apply projection pushdown to limit I/O.
- Apply predicate pushdown to filter data before it loads. In traditional eager-execution libraries like pandas, invoking a read function immediately pulls all file contents into memory. If the file is 50GB and your machine has 16GB of RAM, the process crashes. Polars solves this by using lazy evaluation.
Lazy evaluation means that instead of executing operations immediately, Polars builds a logical query plan—a directed acyclic graph (DAG) of the operations you define. By using
pl.scan_parquet()instead ofpl.read_parquet(), Polars only reads the schema (column names and data types) from the Parquet metadata. It registers the file path and waits until you explicitly tell it to execute. This allows Polars' built-in query optimizer to look ahead at all your future filters and joins, reorganizing them for maximum efficiency before touching a single row of actual data. With a lazy blueprint established, optimize it by restricting which columns are pulled from disk. This is projection pushdown: because Parquet stores data column by column, the query engine can ignore unneeded columns entirely—pushing the select instruction down to the storage layer so those bytes are never read from disk. With projection in place, apply row-level filtering at the source using predicate pushdown. Parquet files are divided into row groups, each storing statistical metadata (minimum and maximum column values). Polars reads this metadata first and skips any row group that cannot match your filter condition—so the data is filtered before it ever leaves disk. By completing Step 2, you have successfully built the ingestion phase. Your pipeline is now configured to read only the exact columns and rows required from the disk, taking full advantage of the Parquet format. Next, you will perform transformations on this optimized stream of data. -
Challenge
Step 3: Data Transformation and the Expression API
With your data ingestion heavily optimized, you can now move into the Transform phase. Real-world data is rarely ready for analysis and often requires cleaning, formatting, and deriving new business metrics.
In this step, you’ll leverage the Polars expression API. Expressions (such as
pl.col('name').str.to_lowercase()) are compiled to optimized Rust code and parallelized across all available CPU cores—delivering C-like speeds without slow Python loops.Key Terminology:
- Expression: A declarative statement defining a transformation on a column.
- Namespace: A collection of specific methods for a data type (for example,
.strfor string operations). - Vectorization: Applying an operation to an entire column at once, rather than iterating row by row.
What you'll accomplish:
- Clean formatting inconsistencies in string data.
- Parse strings into native date types for temporal accuracy.
- Derive a categorical metric using conditional logic. Categorical variables are now uniform. Next, you must fix the date formats to ensure accurate time-series analysis.
While strings are easy to read and work fine for lexicographical filtering, they are not suitable for robust time-series analysis, aggregations, and joining. Date values should be stored using a native temporal data type. In this task, you will parse the
date_strcolumn into a PolarsDateobject—specifying the format string explicitly avoids slower inferred parsing. With dates handled, inject some business logic. In this task, you will create a categoricaldiscount_bandcolumn from the numericaldiscountvalue using thewhen().then().otherwise()syntax—Polars' idiomaticIF-THEN-ELSEconstruct, fully integrated into the Expression API and evaluated lazily as part of the query plan. You've successfully transformed the raw data stream using high-performance vectorized expressions. The data is now clean and enriched with business logic. In the next step, you will expand the dataset by integrating external dimension tables. -
Challenge
Step 4: Relational Joins and Aggregations
Your fact table is clean, but it only contains IDs for products and customers. To generate human-readable reports, you need to bring in descriptive attributes via relational joins. Polars mitigates the typical join cost using optimized hash join algorithms, and because your pipeline is lazy, the query optimizer can coordinate the join strategy before any data is materialized.
After enriching the data, you will perform a group-by aggregation to condense row-level transactions into a high-level summary.
Key Terminology:
- Fact Table: A table containing quantitative measurements (for example, sales) and foreign keys.
- Dimension Table: A table containing descriptive attributes (product names, regions) related to the fact table keys.
- Left Join: A join that retains all rows from the primary table and fills missing matches with nulls.
What you'll accomplish:
- Load dimension tables into the lazy computation graph.
- Execute multiple left joins to enrich the fact data.
- Handle
nullvalues gracefully to maintain data integrity. - Group and aggregate the final dataset to produce the curated output. With the fact table and dimension tables ready, join them using left joins to ensure no sales transactions are lost, even if a product or customer ID is missing. A left join retains all rows from the left table and fills unmatched right-side values with
null. Handling these nulls immediately after the join maintains data integrity downstream. The dataset is now fully enriched. In this step, you will aggregate the transaction-level data into a curated summary: total daily revenue grouped by product category and region. Polars uses the split-apply-combine strategy—splitting data into groups by the grouping keys, applying the aggregation function to each group, and combining the results. Step 4 is complete! The logical definition of your ETL pipeline is fully constructed. From ingestion to transformation, joining, and aggregation, the query optimizer now knows how to execute the pipeline. In the final step, you will inspect the query plan and execute it.
-
Challenge
Step 5: Execution, Optimization, and Sinks
You have reached the end of the pipeline definition. Before running it, it is important to understand how it will run.
When you build a
LazyFrame, you define a logical plan. Polars optimizes this into a physical plan—the actual CPU instructions—which you can inspect with.explain(). To materialize the data, use a streaming sink rather than.collect(), which would load the full dataset into memory. A sink streams results directly to the file system in batches, keeping memory usage flat.Key Terminology:
- Logical/Physical Plan: The difference between the defined operations and the engine’s execution strategy.
- Sink: A terminal operation that consumes a stream of data and writes it to an external system.
What you'll accomplish:
- Export and analyze the execution plan.
- Execute the pipeline using a streaming Parquet sink. Before executing a large data pipeline, it is good practice to understand how the engine will run it. By writing your pipeline lazily, you build a logical plan. Polars takes this plan, applies optimizations, and generates a physical plan—the set of instructions executed by the CPU.
You can view this optimized plan using the
.explain()method. This returns a string representation of the execution graph, read bottom-up: the data source sits at the bottom, and operations stack upward toward the final result. If you examine the output, you will see Polars rearranging operations. For example, you may seePROJECTnodes showing which columns are read from Parquet, andFILTERnodes showing predicate pushdown conditions applied at the source. With the plan verified, it is time to execute the streaming process and write the curated files to disk.Use
.sink_parquet()rather than.collect()so the Polars engine streams data through the execution graph in batches, writing output directly to disk without loading the full dataset into memory. Congratulations! You have successfully refactored a brittle, memory-heavy script into a robust, highly optimized Polars pipeline. You've utilized lazy evaluation to construct a query plan, applied projection and predicate pushdowns to minimize I/O, leveraged the Expression API for lightning-fast transformations, and executed complex joins and aggregations via a streaming sink. These skills are essential for modern data engineering and scaling Python workflows to out-of-core datasets.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.