Featured resource
2025 Tech Upskilling Playbook
Tech Upskilling Playbook

Build future-ready tech teams and hit key business milestones with seven proven plays from industry leaders.

Check it out
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • AI
Labs

Model Evaluation: Implementing Validation Metrics

In this lab, you'll practice implementing validation metrics and visualizing model evaluation. When you're finished, you'll have diagnosed model issues using validation results and plots.

Lab platform
Lab Info
Level
Beginner
Last updated
Feb 04, 2026
Duration
44m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Introduction

    Welcome to the Model Evaluation: Implementing Validation Metrics Code Lab. In this hands-on lab, you'll learn to implement validation metrics, visualize model performance using diagnostic plots, and interpret results to identify issues like overfitting and bias. By the end, you can confidently evaluate and diagnose machine learning models for real-world applications.

    What is Model Evaluation?

    When you build a machine learning model, you need to answer a critical question: "How well does this model actually work?" Model evaluation is the process of measuring your model's performance to determine if it is good enough for real-world use.

    Validation metrics are the specific measurements you use to evaluate your model. Different metrics answer different questions:

    • Accuracy - What percentage of predictions were correct overall?
    • Precision - When the model predicts "yes," how often is it right?
    • Recall - Of all the actual "yes" cases, how many did the model find?
    • F1 Score - A balanced combination of precision and recall

    No single metric tells the whole story. A model might have high accuracy but completely miss rare but important cases. This lab teaches you to use multiple metrics together for a complete picture.

    Background

    You are a data scientist at Globomantics evaluating a customer conversion prediction model. The validation metrics show unexpected results, and you need to diagnose the issue. You compute metrics on the validation dataset, analyze learning curves to identify overfitting or underfitting, and use visualization insights to improve model performance.

    Your team has built a machine learning model using the Gradient Boosting algorithm, a powerful technique that combines many simple decision trees into one strong predictor. Now you need to evaluate how well this model performs before deploying it to production.

    The Dataset

    This lab uses synthetic data generated by scikit-learn's make_classification function. The data simulates customer behavior with the following characteristics:

    • 1,000 total samples (representing website visitors)
    • 10 features (representing visitor attributes like time on site, pages viewed, etc.)
    • Imbalanced classes - About 70% non-converters and 30% converters (typical for real conversion data)
    • Binary classification - Each visitor either converts (1) or does not convert (0)

    Familiarizing with the Program Structure

    The lab environment includes the following key files:

    1. data_loader.py - Loads and prepares the customer conversion dataset with train/validation splits
    2. model_trainer.py - Trains the Gradient Boosting classifier (pre-configured, no changes needed)
    3. metrics_calculator.py - Computes validation metrics (accuracy, precision, recall, F1)
    4. roc_plotter.py - Computes and plots ROC curves
    5. pr_plotter.py - Computes and plots Precision-Recall curves
    6. learning_curves.py - Generates learning curves to diagnose bias/variance issues
    7. loss_plotter.py - Plots training and validation loss over iterations
    8. diagnostics.py - Creates residuals plots and model complexity analysis
    9. evaluate_models.py - Compares baseline vs tuned model performance

    The environment uses Python 3 with scikit-learn for machine learning, NumPy for numerical operations, and Matplotlib for visualization. All dependencies are pre-installed.

    To run scripts, use the terminal with commands like python3 data_loader.py. Results are saved to the output/ directory.

    Note: Complete tasks in order. Each task builds on the previous one. Test your code frequently by running the provided scripts to catch errors early.

    info > If you get stuck on a task, there are solution files provided for you located in the solution directory in your filetree.

  2. Challenge

    Setting Up Data and Computing Basic Metrics

    Understanding Validation Data

    Before evaluating any model, you need properly split data. The validation set must remain unseen during training to provide honest performance estimates. Using a fixed random seed ensures reproducibility, meaning you get the same split every time, making debugging easier and results comparable.

    In machine learning, you typically split data into training (80%) and validation (20%) sets. The training set teaches the model patterns, while the validation set measures how well those patterns generalize to new, unseen data. Stratification ensures both sets maintain the same class distribution. ### Understanding Validation Metrics

    A single accuracy number hides important details. Consider a customer conversion model where only 10% of customers convert. A model predicting "no conversion" for everyone achieves 90% accuracy but catches zero actual converters!

    This is why you need multiple metrics:

    • Accuracy measures overall correctness across all predictions
    • Precision tells you how trustworthy positive predictions are (low false alarms)
    • Recall tells you how completely you catch actual positives (few missed cases)
    • F1 Score balances both precision and recall into a single score

    For customer conversion, missing a potential converter (low recall) means lost revenue, while false positives (low precision) waste marketing resources. Understanding these trade-offs is essential for choosing the right model.

  3. Challenge

    Visualizing Classification Performance

    Understanding ROC and Precision-Recall Curves

    Classification models output probabilities, not just 0/1 predictions. By varying the threshold (e.g., "predict conversion if probability > 0.3"), you trade off between catching more converters (higher recall) and having more false alarms (lower precision).

    ROC Curve plots True Positive Rate (Recall) vs False Positive Rate:

    • A perfect classifier hugs the top-left corner
    • The diagonal line represents random guessing
    • AUC (Area Under Curve) summarizes performance: 1.0 = perfect, 0.5 = random

    Precision-Recall Curve is especially useful for imbalanced datasets:

    • Shows the trade-off between precision and recall at different thresholds
    • Better for situations where the positive class is rare (like customer conversion)
  4. Challenge

    Diagnosing Model Issues with Learning Curves

    Understanding Learning Curves

    Learning curves show how model performance changes as training data increases. They are essential for diagnosing whether your model suffers from high bias (underfitting) or high variance (overfitting).

    High Bias (Underfitting):

    • Both training and validation scores are low
    • The curves converge but at a low performance level
    • Solution: Use a more complex model or add features

    High Variance (Overfitting):

    • Training score is high but validation score is much lower
    • Large gap between the two curves
    • Solution: Get more data, reduce model complexity, or add regularization
  5. Challenge

    Analyzing Loss Curves and Residuals

    Understanding Loss Curves

    Loss curves track how the model's error changes during training. For iterative models like Gradient Boosting, watching the training and validation loss over iterations reveals overfitting early.

    Signs of Overfitting in Loss Curves:

    • Training loss keeps decreasing
    • Validation loss starts increasing after some point
    • The divergence point suggests where to stop training

    Signs of Healthy Training:

    • Both losses decrease together
    • They converge to similar low values
    • The gap between them remains small ### Understanding Residuals and Model Complexity

    Residuals are the differences between predicted probabilities and actual outcomes. Analyzing residuals helps identify systematic prediction errors.

    What Residuals Reveal:

    • Symmetric distribution around zero suggests unbiased predictions
    • Skewed patterns indicate systematic over- or under-prediction
    • Large standard deviation suggests inconsistent predictions

    Model Complexity Curves show how performance changes as you vary hyperparameters like tree depth. This helps find the sweet spot that balances underfitting and overfitting.

  6. Challenge

    Applying Diagnostic Insights

    Understanding Hyperparameter Tuning

    Based on your diagnostic analysis, you can now make informed decisions about hyperparameters:

    • From complexity curves: The optimal max_depth is where validation accuracy peaks
    • From loss curves: The optimal n_estimators is where validation loss stops improving

    Tuning based on actual diagnostic data beats random guessing and helps you build models that generalize well to new data.

  7. Challenge

    Conclusion

    Congratulations on completing the Model Evaluation lab! You have successfully learned to implement validation metrics, create diagnostic visualizations, and interpret results to identify and fix model issues.

    What You Have Accomplished

    Throughout this lab, you have:

    1. Configured Validation Splitting - Set up reproducible train/validation splits with stratification for honest model evaluation.
    2. Computed Validation Metrics - Calculated accuracy, precision, recall, and F1 score to evaluate model performance from multiple angles.
    3. Created ROC Curves - Computed ROC curve data and AUC scores to visualize model discrimination ability.
    4. Generated PR Curves - Built Precision-Recall curves for better evaluation on imbalanced datasets.
    5. Analyzed Learning Curves - Diagnosed bias and variance issues by examining how performance scales with data.
    6. Plotted Loss Curves - Tracked training progress and detected overfitting by monitoring loss over iterations.
    7. Created Residuals and Complexity Analysis - Identified systematic prediction errors and found optimal model complexity.
    8. Tuned Hyperparameters - Applied diagnostic insights to improve model performance.

    Key Takeaways

    • Accuracy Is Not Everything - High accuracy can hide poor performance on minority classes. Always use multiple metrics.
    • Learning Curves Reveal Bias/Variance - Use them to diagnose underfitting vs overfitting before tuning.
    • Loss Curves Detect Overfitting Early - Watch for divergence between training and validation loss.
    • Residuals Expose Systematic Errors - Analyze prediction patterns to find model weaknesses.
    • Complexity Curves Guide Hyperparameter Selection - Find the sweet spot that maximizes validation performance.

    Experiment Before You Go

    You still have time in the lab environment. Try these explorations:

    • Adjust classification thresholds and observe how metrics change
    • Modify the train/validation split ratio and compare results
    • Experiment with different model hyperparameters
    • Create custom diagnostic plots for specific use cases

    Take this opportunity to experiment and deepen your understanding!

About the author

Angel Sayani is a Certified Artificial Intelligence Expert®, CEO of IntellChromatics, author of two books in cybersecurity and IT certifications, world record holder, and a well-known cybersecurity and digital forensics expert.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight