- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Model Evaluation: Implementing Validation Metrics
In this lab, you'll practice implementing validation metrics and visualizing model evaluation. When you're finished, you'll have diagnosed model issues using validation results and plots.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Model Evaluation: Implementing Validation Metrics Code Lab. In this hands-on lab, you'll learn to implement validation metrics, visualize model performance using diagnostic plots, and interpret results to identify issues like overfitting and bias. By the end, you can confidently evaluate and diagnose machine learning models for real-world applications.
What is Model Evaluation?
When you build a machine learning model, you need to answer a critical question: "How well does this model actually work?" Model evaluation is the process of measuring your model's performance to determine if it is good enough for real-world use.
Validation metrics are the specific measurements you use to evaluate your model. Different metrics answer different questions:
- Accuracy - What percentage of predictions were correct overall?
- Precision - When the model predicts "yes," how often is it right?
- Recall - Of all the actual "yes" cases, how many did the model find?
- F1 Score - A balanced combination of precision and recall
No single metric tells the whole story. A model might have high accuracy but completely miss rare but important cases. This lab teaches you to use multiple metrics together for a complete picture.
Background
You are a data scientist at Globomantics evaluating a customer conversion prediction model. The validation metrics show unexpected results, and you need to diagnose the issue. You compute metrics on the validation dataset, analyze learning curves to identify overfitting or underfitting, and use visualization insights to improve model performance.
Your team has built a machine learning model using the Gradient Boosting algorithm, a powerful technique that combines many simple decision trees into one strong predictor. Now you need to evaluate how well this model performs before deploying it to production.
The Dataset
This lab uses synthetic data generated by scikit-learn's
make_classificationfunction. The data simulates customer behavior with the following characteristics:- 1,000 total samples (representing website visitors)
- 10 features (representing visitor attributes like time on site, pages viewed, etc.)
- Imbalanced classes - About 70% non-converters and 30% converters (typical for real conversion data)
- Binary classification - Each visitor either converts (1) or does not convert (0)
Familiarizing with the Program Structure
The lab environment includes the following key files:
data_loader.py- Loads and prepares the customer conversion dataset with train/validation splitsmodel_trainer.py- Trains the Gradient Boosting classifier (pre-configured, no changes needed)metrics_calculator.py- Computes validation metrics (accuracy, precision, recall, F1)roc_plotter.py- Computes and plots ROC curvespr_plotter.py- Computes and plots Precision-Recall curveslearning_curves.py- Generates learning curves to diagnose bias/variance issuesloss_plotter.py- Plots training and validation loss over iterationsdiagnostics.py- Creates residuals plots and model complexity analysisevaluate_models.py- Compares baseline vs tuned model performance
The environment uses Python 3 with scikit-learn for machine learning, NumPy for numerical operations, and Matplotlib for visualization. All dependencies are pre-installed.
To run scripts, use the terminal with commands like
python3 data_loader.py. Results are saved to theoutput/directory.Note: Complete tasks in order. Each task builds on the previous one. Test your code frequently by running the provided scripts to catch errors early.
info > If you get stuck on a task, there are solution files provided for you located in the
solutiondirectory in your filetree. -
Challenge
Setting Up Data and Computing Basic Metrics
Understanding Validation Data
Before evaluating any model, you need properly split data. The validation set must remain unseen during training to provide honest performance estimates. Using a fixed random seed ensures reproducibility, meaning you get the same split every time, making debugging easier and results comparable.
In machine learning, you typically split data into training (80%) and validation (20%) sets. The training set teaches the model patterns, while the validation set measures how well those patterns generalize to new, unseen data. Stratification ensures both sets maintain the same class distribution. ### Understanding Validation Metrics
A single accuracy number hides important details. Consider a customer conversion model where only 10% of customers convert. A model predicting "no conversion" for everyone achieves 90% accuracy but catches zero actual converters!
This is why you need multiple metrics:
- Accuracy measures overall correctness across all predictions
- Precision tells you how trustworthy positive predictions are (low false alarms)
- Recall tells you how completely you catch actual positives (few missed cases)
- F1 Score balances both precision and recall into a single score
For customer conversion, missing a potential converter (low recall) means lost revenue, while false positives (low precision) waste marketing resources. Understanding these trade-offs is essential for choosing the right model.
-
Challenge
Visualizing Classification Performance
Understanding ROC and Precision-Recall Curves
Classification models output probabilities, not just 0/1 predictions. By varying the threshold (e.g., "predict conversion if probability > 0.3"), you trade off between catching more converters (higher recall) and having more false alarms (lower precision).
ROC Curve plots True Positive Rate (Recall) vs False Positive Rate:
- A perfect classifier hugs the top-left corner
- The diagonal line represents random guessing
- AUC (Area Under Curve) summarizes performance: 1.0 = perfect, 0.5 = random
Precision-Recall Curve is especially useful for imbalanced datasets:
- Shows the trade-off between precision and recall at different thresholds
- Better for situations where the positive class is rare (like customer conversion)
-
Challenge
Diagnosing Model Issues with Learning Curves
Understanding Learning Curves
Learning curves show how model performance changes as training data increases. They are essential for diagnosing whether your model suffers from high bias (underfitting) or high variance (overfitting).
High Bias (Underfitting):
- Both training and validation scores are low
- The curves converge but at a low performance level
- Solution: Use a more complex model or add features
High Variance (Overfitting):
- Training score is high but validation score is much lower
- Large gap between the two curves
- Solution: Get more data, reduce model complexity, or add regularization
-
Challenge
Analyzing Loss Curves and Residuals
Understanding Loss Curves
Loss curves track how the model's error changes during training. For iterative models like Gradient Boosting, watching the training and validation loss over iterations reveals overfitting early.
Signs of Overfitting in Loss Curves:
- Training loss keeps decreasing
- Validation loss starts increasing after some point
- The divergence point suggests where to stop training
Signs of Healthy Training:
- Both losses decrease together
- They converge to similar low values
- The gap between them remains small ### Understanding Residuals and Model Complexity
Residuals are the differences between predicted probabilities and actual outcomes. Analyzing residuals helps identify systematic prediction errors.
What Residuals Reveal:
- Symmetric distribution around zero suggests unbiased predictions
- Skewed patterns indicate systematic over- or under-prediction
- Large standard deviation suggests inconsistent predictions
Model Complexity Curves show how performance changes as you vary hyperparameters like tree depth. This helps find the sweet spot that balances underfitting and overfitting.
-
Challenge
Applying Diagnostic Insights
Understanding Hyperparameter Tuning
Based on your diagnostic analysis, you can now make informed decisions about hyperparameters:
- From complexity curves: The optimal max_depth is where validation accuracy peaks
- From loss curves: The optimal n_estimators is where validation loss stops improving
Tuning based on actual diagnostic data beats random guessing and helps you build models that generalize well to new data.
-
Challenge
Conclusion
Congratulations on completing the Model Evaluation lab! You have successfully learned to implement validation metrics, create diagnostic visualizations, and interpret results to identify and fix model issues.
What You Have Accomplished
Throughout this lab, you have:
- Configured Validation Splitting - Set up reproducible train/validation splits with stratification for honest model evaluation.
- Computed Validation Metrics - Calculated accuracy, precision, recall, and F1 score to evaluate model performance from multiple angles.
- Created ROC Curves - Computed ROC curve data and AUC scores to visualize model discrimination ability.
- Generated PR Curves - Built Precision-Recall curves for better evaluation on imbalanced datasets.
- Analyzed Learning Curves - Diagnosed bias and variance issues by examining how performance scales with data.
- Plotted Loss Curves - Tracked training progress and detected overfitting by monitoring loss over iterations.
- Created Residuals and Complexity Analysis - Identified systematic prediction errors and found optimal model complexity.
- Tuned Hyperparameters - Applied diagnostic insights to improve model performance.
Key Takeaways
- Accuracy Is Not Everything - High accuracy can hide poor performance on minority classes. Always use multiple metrics.
- Learning Curves Reveal Bias/Variance - Use them to diagnose underfitting vs overfitting before tuning.
- Loss Curves Detect Overfitting Early - Watch for divergence between training and validation loss.
- Residuals Expose Systematic Errors - Analyze prediction patterns to find model weaknesses.
- Complexity Curves Guide Hyperparameter Selection - Find the sweet spot that maximizes validation performance.
Experiment Before You Go
You still have time in the lab environment. Try these explorations:
- Adjust classification thresholds and observe how metrics change
- Modify the train/validation split ratio and compare results
- Experiment with different model hyperparameters
- Create custom diagnostic plots for specific use cases
Take this opportunity to experiment and deepen your understanding!
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.