- Lab
-
Libraries: If you want this lab, consider one of these libraries.
Build a Data Quality Workflow using Great Expectations
Data quality issues in multi-stage pipelines can silently break downstream analytics and models. In this lab, you will use Great Expectations to build a validation pipeline that catches problems at each stage, from raw ingestion to transformation, and learn how to fix and revalidate data before it reaches production.
Lab Info
Table of Contents
-
Challenge
Introduction to Multi-Step Data Quality Validation with Great Expectations
Introduction to Multi-Step Data Quality Validation with Great Expectations
In this lab, you will build a multi-step data quality validation pipeline using Great Expectations. You'll learn how to configure a project, connect to datasets at different stages of a data pipeline, define data quality rules using expectation suites, and run validations using checkpoints — all by writing and running Python scripts in the Terminal.
Instead of working with a single dataset, this lab simulates a real-world pipeline with three stages: raw, cleaned, and transformed data. You'll define increasingly strict expectations for each stage, run validations, analyze results, and troubleshoot data quality issues by fixing data and revalidating.
🟦 Note:
Many real-world data pipelines involve multiple transformation stages where data quality requirements differ at each step. This lab equips you with the skills to:
- Configure a Great Expectations project with a filesystem datasource
- Register multiple data assets for different pipeline stages
- Define tailored expectation suites for raw, cleaned, and transformed data
- Run and analyze validations using checkpoints
- Troubleshoot failures and revalidate corrected data
🟩 Prerequisites: Great Expectations and pandas are pre-installed in your environment. The starter files and data generation helpers are already provided; you only need to complete the
TODOsections in each file.
🗂️ Lab Structure
This lab is organized into three steps, each with its own Python file:
- Set Up the Pipeline (
setup_pipeline.py) — Generate datasets, initialize a GX project, and register data assets. - Define and Run Validations (
validate_pipeline.py) — Create expectation suites for each pipeline stage and run validations. - Analyze and Troubleshoot (
analyze_pipeline.py) — Bundle validations into a checkpoint, fix data quality issues, and revalidate.
Each file contains
TODOcomments marking the code you need to complete. The rest of the code is provided for you.
🔍 Key Concepts
📁 Data Context and Project Initialization
- Initialize a Great Expectations project using a file-based data context
- Understand the project structure and configuration files generated on setup
🔌 Datasources and Data Assets
- Connect a filesystem datasource pointing to a local data directory
- Register individual data assets for raw, cleaned, and transformed CSV files
✅ Expectation Suites and Validation Rules
- Define expectation suites with stage-specific data quality rules
- Add expectations for column presence, null checks, value ranges, uniqueness, and aggregates
- Run validations independently for each pipeline stage
🏁 Checkpoints and Troubleshooting
- Configure checkpoints to bundle and execute multiple validations together
- Review validation results and identify failed expectations
- Fix data quality issues and revalidate to confirm resolution
🟩 Learning Objectives
By the end of this lab, you will:
- Generate and manage pipeline datasets across raw, cleaned, and transformed stages
- Initialize a Great Expectations project using a file-based data context
- Configure a filesystem datasource and register data assets
- Create expectation suites with stage-specific data quality rules
- Run validations using checkpoints and batch definitions
- Troubleshoot data quality failures and revalidate corrected data
Now that you have an overview of the pipeline you'll be building, click Next Step to start generating your datasets and initializing your Great Expectations project! 🚀
-
Challenge
Configure a Multi-Step Data Quality Workflow
Configure a Great Expectations Project and Register Data Assets
In this step, you will set up the foundation for a data validation pipeline using Great Expectations. You will generate three CSV datasets representing different stages of a data pipeline (raw, cleaned, and transformed), initialize a file-backed GX project, configure a pandas filesystem datasource, and register each dataset as a named data asset with a batch definition.
All of your work in this step will be done in the file
setup_pipeline.py. Look for theTODOcomments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.By the end of this step, your Great Expectations project will have a fully configured datasource with three registered CSV assets, one for each pipeline stage. You will also understand how GX organizes datasources, assets, and batch definitions to connect to your data.
🟦 Note:
- Great Expectations uses
gx.get_context(mode="file")to create a file-backed project that persists configuration and validation results locally. - A datasource tells GX where and how to access your data files.
- Each CSV file is registered as a data asset with a batch definition that points to a specific file.
- The raw data asset is pre-filled as an example. You will register the cleaned and transformed assets following the same pattern.
In this step, you will:
- Generate three pipeline datasets (raw, cleaned, transformed) with intentional quality issues.
- Initialize a Great Expectations project using
gx.get_context(mode="file"). - Configure a pandas filesystem datasource pointing to your data directory.
- Register the cleaned and transformed CSV files as named data assets with batch definitions, following the pattern of the pre-filled raw data asset.
Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python setup_pipeline.pyin the terminal.A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.
By completing this step, you will gain practical experience configuring a Great Expectations project from scratch and connecting it to local CSV files for validation. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python setup_pipeline.pyin the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. Make sure all tasks pass before moving to the next step, as the next step depends on the data and context created here. If you face any issues, you can refer to thesolutionsfolder for the complete solutions. - Great Expectations uses
-
Challenge
Create and Validate Expectation Suites
Define Expectation Suites and Run Validations
In this step, you will create expectation suites for each stage of your data pipeline (raw, cleaned, and transformed) and run validations against the corresponding datasets. Expectations are declarative rules that describe what your data should look like — such as column presence, null checks, value ranges, and uniqueness constraints.
All of your work in this step will be done in the file
validate_pipeline.py. Look for theTODOcomments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.By the end of this step, you will see that raw data and transformed data intentionally fail certain expectations (nulls and duplicates), while cleaned data passes all checks. This demonstrates how Great Expectations surfaces data quality issues at each pipeline stage.
🟦 Note:
- An expectation suite is a named collection of expectations that define the quality contract for a dataset.
- Expectations are added to a suite using
suite.add_expectation()with expectation classes fromgx.expectations. - A validation definition wires together a batch definition (data) and an expectation suite, then runs the checks.
- Some expectations are designed to fail intentionally in this lab — this is how GX detects real data quality issues.
In this step, you will:
- Create an expectation suite for raw data with column existence, uniqueness, and null checks.
- Create an expectation suite for cleaned data with null and range checks.
- Create an expectation suite for transformed data with set membership, uniqueness, and row count checks.
- Wire each suite to its corresponding data asset and run all validations.
- Observe which expectations pass and which fail, and understand why.
Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python validate_pipeline.pyin the terminal.A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.
By completing this step, you will gain practical experience defining data quality rules and running validations using Great Expectations. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python validate_pipeline.pyin the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. Make sure all tasks pass before moving to the next step, as it step depends on the validation definitions created here. If you face any issues, you can refer to thesolutionsfolder for the complete solutions. -
Challenge
Analyze Results and Troubleshoot Data Quality Issues
Analyze Results & Troubleshoot Data Quality Issues
In this step, you will create a checkpoint that bundles all three validation definitions from the previous step into a single, repeatable run. After observing the initial failures, you will fix the underlying data quality issues and adjust expectation thresholds so the entire pipeline passes validation.
All of your work in this step will be done in the file
analyze_pipeline.py. Look for theTODOcomments inside each task function — they describe exactly what code to add. The rest of the file contains helper functions that are already complete; you do not need to modify them.By the end of this step, you will have practiced the full Great Expectations troubleshooting workflow: run a checkpoint, interpret failures, fix the root causes, and revalidate to confirm everything passes.
🟦 Note:
- A checkpoint groups multiple validation definitions so they execute together and produce a single aggregated result.
- The
mostlyparameter on expectations likeExpectColumnValuesToNotBeNullsets the minimum fraction of rows that must satisfy the check (e.g.,mostly=0.95means at least 95% of values must be non-null). - Fixing data quality issues may involve cleaning the data itself (removing duplicates) or relaxing expectations to tolerate known, acceptable gaps.
In this step, you will:
- Create a checkpoint that bundles raw, cleaned, and transformed validation definitions.
- Execute the checkpoint and observe which validations fail.
- Fix the transformed data by removing duplicate rows.
- Adjust the raw data expectations with
mostlythresholds to tolerate known nulls. - Revalidate to confirm all pipeline stages now pass.
Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python analyze_pipeline.pyin the terminal.A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work.
By completing this step, you will understand how to use checkpoints for batch validation and how to iteratively troubleshoot and resolve data quality issues. Once you have completed all the tasks, run the file by clicking RUN in the terminal, or by typing
python analyze_pipeline.pyin the terminal. A successful run will print confirmation messages for each completed task. If any task is incomplete, the script will stop and tell you which function still needs work. If you face any issues, you can refer to thesolutionsfolder for the complete solutions. ## 🎊 Congratulations on Completing the Lab!You have successfully completed the Data Quality Validation with Great Expectations lab. Throughout this lab, you built a complete data validation pipeline that defines expectations, validates data at every stage, and troubleshoots quality issues using checkpoints and iterative fixes.
✅ What You Accomplished
- Connected to a Great Expectations File Context and created a persistent
gx_projectdirectory for storing validation artifacts. - Registered a Pandas Filesystem Datasource pointing to CSV files in the
data/directory. - Created Data Assets and Batch Definitions for raw, cleaned, and transformed pipeline stages.
- Built Expectation Suites with checks for column existence, column count, uniqueness, null values, and value ranges using expectations like
ExpectColumnToExist,ExpectColumnValuesToBeUnique, andExpectColumnValuesToBeBetween. - Created Validation Definitions that pair each batch definition with its corresponding expectation suite.
- Ran individual validations and interpreted pass/fail results for each pipeline stage.
- Bundled all validations into a single Checkpoint using
context.checkpoints.add()for repeatable, aggregated runs. - Executed the checkpoint and identified two failures: null values in raw data and a duplicate row in transformed data.
- Fixed data issues by removing duplicates with
df.drop_duplicates(subset=["id"], keep="first"). - Applied the
mostlyparameter to relax null checks with tolerant thresholds (0.95forname,0.85foremail). - Revalidated the entire pipeline and confirmed all stages pass after fixes.
🔑 Key Takeaways
- Expectation Suites let you define reusable data quality rules that can be applied across multiple datasets and pipeline stages.
- Validation Definitions decouple what to check (suite) from where to check it (batch), making your validation logic modular and composable.
- Checkpoints group multiple validations into a single execution unit, enabling full-pipeline validation in one command.
- The
mostlyparameter provides a practical way to enforce quality standards while tolerating known, acceptable data gaps. - Iterative troubleshooting (run, inspect failures, fix root causes, revalidate) is the core workflow for maintaining data quality in production pipelines.
Amazing work! You have built a complete data quality validation pipeline that catches issues at every stage, from raw ingestion to final transformation. You are now ready to integrate Great Expectations into real-world data pipelines and ensure your data meets quality standards before it reaches downstream consumers. 🚀
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.