Featured resource
2026 Tech Forecast
2026 Tech Forecast

1,500+ tech insiders, business leaders, and Pluralsight Authors share their predictions on what’s shifting fastest and how to stay ahead.

Download the forecast
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • Data
Labs

Build KPI Summary Tables with Polars

Build a high-performance analytical data pipeline using the Polars Expressions API. You will construct reporting-ready KPI tables, handle datetime transformations, and implement complex conditional aggregations.

Lab platform
Lab Info
Level
Intermediate
Last updated
May 07, 2026
Duration
40m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Step 1: Introduction to Polars and the KPI Pipeline

    Welcome to the Polars KPI Builder lab!

    In modern data engineering, processing large datasets quickly is critical. Pandas has long been the standard for Python data manipulation, but Polars has emerged as a high-performance alternative. Written in Rust, Polars uses a multi-threaded execution engine and a lazy-evaluation API that maximizes multi-core CPUs while keeping memory usage tight.

    Scenario Overview

    You are building the backend analytical layer for an executive dashboard. The frontend team needs three distinct summary tables: a Core KPI timeline, a Refund Analysis table, and a Top-N Customer leaderboard.

    The source data is a CSV of raw e-commerce transactions. Your objective is to implement the analytical logic inside the KPIBuilder class in src/kpi_builder.py. The infrastructure, file loading paths, and test suites are already wired up.

    Key Terminology

    • Polars Expressions API: A declarative syntax for defining transformations. Instead of modifying data directly, you describe operations (e.g., pl.col('id').sum()), allowing Polars to optimize execution.
    • Context: The environment where expressions evaluate — for example, .with_columns() for creating columns and .agg() for aggregating grouped data.
    • Group By: Splitting data into buckets based on distinct values in a dimension column before calculating metrics for each bucket.

    What You Will Accomplish

    • Inspect and enforce strict schemas on raw CSV data.
    • Engineer standardized time dimensions using datetime namespaces.
    • Calculate composite metrics, unique entity counts, and conditional aggregations.
    • Rank and filter categorical entities using window functions.

    Move on to Step 2 to begin ingesting your dataset.

    info> If you get stuck, you can refer to the provided solution code for each task, available in the solution folder.

    This lab experience was developed by the Pluralsight team using Forge, an internally developed AI tool utilizing Gemini technology. All sections were verified by human experts for accuracy prior to publication. For issue reporting, please contact us.

  2. Challenge

    Step 2: Data Ingestion and Typing

    Data engineering pipelines are only as good as the cleanliness of their initial data state. In this step, you tackle the "Ingest & Prep" phase.

    Concept Deep-Dive

    When a framework reads a CSV file, it guesses data types by scanning a subset of rows — this is called schema inference. Dates in CSVs are plain text strings (e.g., '2023-10-01'). Before doing time-series analysis, you must cast these strings into native Datetime objects.

    Once data is typed as Datetime, Polars exposes the .dt namespace — a suite of functions for temporal feature extraction. From a single timestamp, you can extract the hour, day of the week, or format it into reporting buckets like 'Year-Month'.

    Architecture Context

    You will modify the load_and_prep_data method inside src/kpi_builder.py. This method reads the file, sanitizes the schema, and extracts the primary keys needed for grouping. The output (self.df) is the foundational DataFrame used by all subsequent methods.

    Terminology

    • pl.read_csv(): Ingests comma-separated data.
    • with_columns(): Adds or overwrites columns without dropping the rest of the DataFrame.
    • dt.strftime(): Formats Datetime objects into specific string patterns. With the dataset loaded and transaction_date cast to Datetime, you can now use the .dt namespace to extract reporting dimensions. The foundational dataset self.df is now loaded, typed, and enriched with the necessary dimensions. Run ./runTest.sh 2.2 in the terminal to confirm your output, then move on to Step 3.
  3. Challenge

    Step 3: Building the Core KPI Table

    With your data prepared, you can now construct the Core KPI timeline — a table that condenses thousands of individual transactions into monthly summary rows.

    Concept Deep-Dive

    Aggregation reduces data granularity via the Split-Apply-Combine strategy: split the data into groups with .group_by(), apply aggregation functions (like sum or mean) to each group, then combine the results into a single DataFrame.

    Polars evaluates all expressions inside .agg() simultaneously on multiple CPU threads — sum, count, and distinct count all run in parallel. Polars expressions are also composable: you can perform math between expressions (like dividing a sum by a count) right inside the aggregation block.

    Architecture Context

    You will work in the build_core_kpis method. This method uses self.df populated in Step 2 and returns a new DataFrame, leaving self.df untouched for reuse by other methods.

    Terminology

    • .group_by('col'): Segments the dataset based on unique values in the specified column.
    • .agg(...): Evaluates aggregation expressions against each group.
    • .n_unique(): Counts distinct values within a group. You have the basic sums and counts. Next, add a unique customer count to understand how many distinct buyers are driving these numbers. The final addition to the Core KPI table is Average Order Value — a metric derived directly from the ones you just created. The Core KPI table is complete. Run ./runTest.sh 3.3 to verify it is sorted chronologically. In Step 4, you will tackle more complex business logic using conditional metrics.
  4. Challenge

    Step 4: Advanced Aggregations - Conditional Metrics

    Standard sums and counts are straightforward, but real-world business logic often requires conditional metrics — values that apply only to rows meeting specific criteria, such as refunded transactions.

    Concept Deep-Dive

    Polars handles conditional logic with pl.when().then().otherwise() — the equivalent of SQL's CASE WHEN. This construct generates a boolean mask: if the condition is True, it returns the value in .then(); otherwise it returns .otherwise(). The entire chain is a single Expression, so you can chain .sum() directly onto it without creating temporary filtered DataFrames.

    Architecture Context

    You will work inside the build_conditional_kpis method, which generates a refund-focused summary table for stakeholders monitoring business health.

    Terminology

    • pl.when(condition): Entry point for conditional logic.
    • .then(value): Value returned when the condition is True.
    • .otherwise(value): Default value when the condition is False. The absolute refund amount is useful, but the Refund Rate percentage gives stakeholders a true measure of system health by comparing refunds against total revenue. Run ./runTest.sh 4.2 to confirm your refund metrics. In Step 5, you will apply ranking functions to build a customer leaderboard.
  5. Challenge

    Step 5: Ranking and Top-N Reporting

    For the final dashboard widget, you will shift away from time-based analytics and build a leaderboard of top-spending customers. This requires grouping by entity, ranking mathematically, and applying a top-N filter.

    Concept Deep-Dive

    The .rank() method assigns numerical positions to values across a DataFrame. To identify top spenders, use descending rank so the largest values receive the lowest rank indexes (1, 2, 3...). Applying the filter after ranking ensures positions reflect the entire dataset before trimming to the top 5.

    Architecture Context

    You will work in the build_top_customers method, which pivots the grouping dimension from year_month to customer_id, returning a small filtered table optimized for frontend rendering.

    Terminology

    • .rank(descending=True): Assigns leaderboard positions to numeric values.
    • .filter(condition): Removes rows where the condition is False.
    • .sort('col'): Orders rows by the specified column. With customer totals calculated, apply the ranking function to establish leaderboard positions. All customers are now ranked. Trim the dataset to the top 5 for the frontend widget. ### Lab Complete!

    Congratulations! You have built a complete analytical pipeline using Polars.

    You enforced a strict schema, engineered datetime dimensions, and computed parallel aggregations. You applied conditional expressions to derive refund metrics and used ranking functions to build a Top-N leaderboard.

    Run ./runTest.sh 5.3 to verify your final leaderboard, or run all tests at once with pytest tests/ from the workspace directory.

    The Polars Expressions API makes your pipeline faster and more memory-efficient than traditional iteration. Keep experimenting!

About the author

Pluralsight’s AI authoring technology is designed to accelerate the creation of hands-on, technical learning experiences. Serving as a first-pass content generator, it produces structured lab drafts aligned to learning objectives defined by Pluralsight’s Curriculum team. Each lab is then enhanced by our Content team, who configure the environments, refine instructions, and conduct rigorous technical and quality reviews. The result is a collaboration between artificial intelligence and human expertise, where AI supports scale and efficiency, and Pluralsight experts ensure accuracy, relevance, and instructional quality, helping learners build practical skills with confidence.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight