Featured resource
2025 Tech Upskilling Playbook
Tech Upskilling Playbook

Build future-ready tech teams and hit key business milestones with seven proven plays from industry leaders.

Check it out
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
Labs

Build a Data Quality Workflow using Great Expectations

Data quality issues in multi-stage pipelines can silently break downstream analytics and models. In this lab, you will use Great Expectations to build a validation pipeline that catches problems at each stage, from raw ingestion to transformation, and learn how to fix and revalidate data before it reaches production.

Lab platform
Lab Info
Last updated
Feb 20, 2026
Duration
45m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Introduction to Multi-Step Data Quality Validation with Great Expectations

    Introduction to Multi-Step Data Quality Validation with Great Expectations

    In this lab, you will build a multi-step data quality validation pipeline using Great Expectations. You'll learn how to configure a project, connect to datasets at different stages of a data pipeline, define data quality rules using expectation suites, and run validations using checkpoints — all by writing and running Python scripts in the Terminal.

    Instead of working with a single dataset, this lab simulates a real-world pipeline with three stages: raw, cleaned, and transformed data. You'll define increasingly strict expectations for each stage, run validations, analyze results, and troubleshoot data quality issues by fixing data and revalidating.


    🟦 Note:

    Many real-world data pipelines involve multiple transformation stages where data quality requirements differ at each step. This lab equips you with the skills to:

    • Configure a Great Expectations project with a filesystem datasource
    • Register multiple data assets for different pipeline stages
    • Define tailored expectation suites for raw, cleaned, and transformed data
    • Run and analyze validations using checkpoints
    • Troubleshoot failures and revalidate corrected data

    🟩 Prerequisites: Great Expectations and pandas are pre-installed in your environment. The starter files and data generation helpers are already provided; you only need to complete the TODO sections in each file.


    🗂️ Lab Structure

    This lab is organized into three steps, each with its own Python file:

    1. Set Up the Pipeline (setup_pipeline.py) — Generate datasets, initialize a GX project, and register data assets.
    2. Define and Run Validations (validate_pipeline.py) — Create expectation suites for each pipeline stage and run validations.
    3. Analyze and Troubleshoot (analyze_pipeline.py) — Bundle validations into a checkpoint, fix data quality issues, and revalidate.

    Each file contains TODO comments marking the code you need to complete. The rest of the code is provided for you.


    🔍 Key Concepts

    📁 Data Context and Project Initialization

    • Initialize a Great Expectations project using a file-based data context
    • Understand the project structure and configuration files generated on setup

    🔌 Datasources and Data Assets

    • Connect a filesystem datasource pointing to a local data directory
    • Register individual data assets for raw, cleaned, and transformed CSV files

    ✅ Expectation Suites and Validation Rules

    • Define expectation suites with stage-specific data quality rules
    • Add expectations for column presence, null checks, value ranges, uniqueness, and aggregates
    • Run validations independently for each pipeline stage

    🏁 Checkpoints and Troubleshooting

    • Configure checkpoints to bundle and execute multiple validations together
    • Review validation results and identify failed expectations
    • Fix data quality issues and revalidate to confirm resolution

    🟩 Learning Objectives

    By the end of this lab, you will:

    • Generate and manage pipeline datasets across raw, cleaned, and transformed stages
    • Initialize a Great Expectations project using a file-based data context
    • Configure a filesystem datasource and register data assets
    • Create expectation suites with stage-specific data quality rules
    • Run validations using checkpoints and batch definitions
    • Troubleshoot data quality failures and revalidate corrected data

    Now that you have an overview of the pipeline you'll be building, click Next Step to start generating your datasets and initializing your Great Expectations project! 🚀

  2. Challenge

    Configure a Multi-Step Data Quality Workflow

    Configure a Great Expectations Project and Register Data Assets

    In this step, you will set up the foundation for a data validation pipeline using Great Expectations. You will generate three CSV datasets representing different stages of a data pipeline (raw, cleaned, and transformed), initialize a file-backed GX project, configure a pandas filesystem datasource, and register each dataset as a named data asset with a batch definition.

    All of your work in this step will be done in the file setup_pipeline.py. Look for the TODO comments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.

    By the end of this step, your Great Expectations project will have a fully configured datasource with three registered CSV assets, one for each pipeline stage. You will also understand how GX organizes datasources, assets, and batch definitions to connect to your data.

    🟦 Note:

    • Great Expectations uses gx.get_context(mode="file") to create a file-backed project that persists configuration and validation results locally.
    • A datasource tells GX where and how to access your data files.
    • Each CSV file is registered as a data asset with a batch definition that points to a specific file.
    • The raw data asset is pre-filled as an example. You will register the cleaned and transformed assets following the same pattern.

    In this step, you will:

    • Generate three pipeline datasets (raw, cleaned, transformed) with intentional quality issues.
    • Initialize a Great Expectations project using gx.get_context(mode="file").
    • Configure a pandas filesystem datasource pointing to your data directory.
    • Register the cleaned and transformed CSV files as named data assets with batch definitions, following the pattern of the pre-filled raw data asset.

    Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python setup_pipeline.py in the terminal.

    A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.

    By completing this step, you will gain practical experience configuring a Great Expectations project from scratch and connecting it to local CSV files for validation. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python setup_pipeline.py in the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. Make sure all tasks pass before moving to the next step, as the next step depends on the data and context created here. If you face any issues, you can refer to the solutions folder for the complete solutions.

  3. Challenge

    Create and Validate Expectation Suites

    Define Expectation Suites and Run Validations

    In this step, you will create expectation suites for each stage of your data pipeline (raw, cleaned, and transformed) and run validations against the corresponding datasets. Expectations are declarative rules that describe what your data should look like — such as column presence, null checks, value ranges, and uniqueness constraints.

    All of your work in this step will be done in the file validate_pipeline.py. Look for the TODO comments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.

    By the end of this step, you will see that raw data and transformed data intentionally fail certain expectations (nulls and duplicates), while cleaned data passes all checks. This demonstrates how Great Expectations surfaces data quality issues at each pipeline stage.

    🟦 Note:

    • An expectation suite is a named collection of expectations that define the quality contract for a dataset.
    • Expectations are added to a suite using suite.add_expectation() with expectation classes from gx.expectations.
    • A validation definition wires together a batch definition (data) and an expectation suite, then runs the checks.
    • Some expectations are designed to fail intentionally in this lab — this is how GX detects real data quality issues.

    In this step, you will:

    • Create an expectation suite for raw data with column existence, uniqueness, and null checks.
    • Create an expectation suite for cleaned data with null and range checks.
    • Create an expectation suite for transformed data with set membership, uniqueness, and row count checks.
    • Wire each suite to its corresponding data asset and run all validations.
    • Observe which expectations pass and which fail, and understand why.

    Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python validate_pipeline.py in the terminal.

    A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.

    By completing this step, you will gain practical experience defining data quality rules and running validations using Great Expectations. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python validate_pipeline.py in the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. Make sure all tasks pass before moving to the next step, as it step depends on the validation definitions created here. If you face any issues, you can refer to the solutions folder for the complete solutions.

  4. Challenge

    Analyze Results and Troubleshoot Data Quality Issues

    Analyze Results & Troubleshoot Data Quality Issues

    In this step, you will create a checkpoint that bundles all three validation definitions from the previous step into a single, repeatable run. After observing the initial failures, you will fix the underlying data quality issues and adjust expectation thresholds so the entire pipeline passes validation.

    All of your work in this step will be done in the file analyze_pipeline.py. Look for the TODO comments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.

    By the end of this step, you will have practiced the full Great Expectations troubleshooting workflow: run a checkpoint, interpret failures, fix the root causes, and revalidate to confirm everything passes.

    🟦 Note:

    • A checkpoint groups multiple validation definitions so they execute together and produce a single aggregated result.
    • The mostly parameter on expectations like ExpectColumnValuesToNotBeNull sets the minimum fraction of rows that must satisfy the check (e.g., mostly=0.95 means at least 95% of values must be non-null).
    • Fixing data quality issues may involve cleaning the data itself (removing duplicates) or relaxing expectations to tolerate known, acceptable gaps.

    In this step, you will:

    • Create a checkpoint that bundles raw, cleaned, and transformed validation definitions.
    • Execute the checkpoint and observe which validations fail.
    • Fix the transformed data by removing duplicate rows.
    • Adjust the raw data expectations with mostly thresholds to tolerate known nulls.
    • Revalidate to confirm all pipeline stages now pass.

    Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python analyze_pipeline.py in the terminal.

    A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.

    By completing this step, you will understand how to use checkpoints for batch validation and how to iteratively troubleshoot and resolve data quality issues. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing python analyze_pipeline.py in the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. If you face any issues, you can refer to the solutions folder for the complete solutions. ## 🎊 Congratulations on Completing the Lab!

    You have successfully completed the Data Quality Validation with Great Expectations lab. Throughout this lab, you built a complete data validation pipeline that defines expectations, validates data at every stage, and troubleshoots quality issues using checkpoints and iterative fixes.


    ✅ What You Accomplished

    • Connected to a Great Expectations File Context and created a persistent gx_project directory for storing validation artifacts.
    • Registered a Pandas Filesystem Datasource pointing to CSV files in the data/ directory.
    • Created Data Assets and Batch Definitions for raw, cleaned, and transformed pipeline stages.
    • Built Expectation Suites with checks for column existence, column count, uniqueness, null values, and value ranges using expectations like ExpectColumnToExist, ExpectColumnValuesToBeUnique, and ExpectColumnValuesToBeBetween.
    • Created Validation Definitions that pair each batch definition with its corresponding expectation suite.
    • Ran individual validations and interpreted pass/fail results for each pipeline stage.
    • Bundled all validations into a single Checkpoint using context.checkpoints.add() for repeatable, aggregated runs.
    • Executed the checkpoint and identified two failures: null values in raw data and a duplicate row in transformed data.
    • Fixed data issues by removing duplicates with df.drop_duplicates(subset=["id"], keep="first").
    • Applied the mostly parameter to relax null checks with tolerant thresholds (0.95 for name, 0.85 for email).
    • Revalidated the entire pipeline and confirmed all stages pass after fixes.

    🔑 Key Takeaways

    • Expectation Suites let you define reusable data quality rules that can be applied across multiple datasets and pipeline stages.
    • Validation Definitions decouple what to check (suite) from where to check it (batch), making your validation logic modular and composable.
    • Checkpoints group multiple validations into a single execution unit, enabling full-pipeline validation in one command.
    • The mostly parameter provides a practical way to enforce quality standards while tolerating known, acceptable data gaps.
    • Iterative troubleshooting (run, inspect failures, fix root causes, revalidate) is the core workflow for maintaining data quality in production pipelines.

    Amazing work! You have built a complete data quality validation pipeline that catches issues at every stage, from raw ingestion to final transformation. You are now ready to integrate Great Expectations into real-world data pipelines and ensure your data meets quality standards before it reaches downstream consumers. 🚀

About the author

Pinal Dave is a Pluralsight Developer Evangelist.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight