- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Model Evaluation: Applying Validation Metrics
In this lab, you'll practice selecting appropriate metrics and applying advanced evaluation techniques. When you're finished, you'll have compared models using statistical tests and calibration analysis.
Lab Info
Table of Contents
-
Challenge
Introduction
Welcome to the Model Evaluation Code Lab. In this hands-on lab, you'll practice selecting appropriate metrics and applying advanced evaluation techniques. When you finish, you will have compared models using statistical tests and calibration analysis.
Selecting Metrics Based on Use Cases
Not all metrics are equally important for every problem. The right metric depends on the business context and the costs of different types of errors:
- Screening tests (e.g., cancer screening) prioritize recall because missing a positive case could be life-threatening
- Confirmatory tests (e.g., pre-surgery diagnosis) prioritize precision because false positives lead to unnecessary invasive procedures
- Balanced scenarios prioritize F1 score when both error types have similar costs
Understanding how to prioritize metrics based on use cases is essential for building effective machine learning solutions.
Advanced Evaluation Considerations
Beyond basic metrics, you need to consider:
- Probability Calibration - Are the model's confidence scores reliable? A 70% prediction should be correct about 70% of the time.
- Statistical Significance - Is the performance difference between two models real, or just due to random chance?
This lab teaches you to evaluate calibration using reliability diagrams and Brier score, and to compare models fairly using statistical significance tests.
Background
You are a data scientist at CarvedRock AI working on a medical diagnosis model. Different stakeholders have different priorities: clinicians want to catch all positive cases (high recall), while administrators want to minimize unnecessary follow-up tests (high precision). Your job is to select appropriate metrics, evaluate probability calibration, and fairly compare model alternatives.
The Dataset
This lab uses synthetic medical data generated by scikit-learn's
make_classificationfunction. The data simulates diagnostic test results with the following characteristics:- 1,000 total samples (representing patient records)
- 10 features (representing diagnostic measurements)
- Imbalanced classes - About 70% negative and 30% positive cases (typical for diagnostic screening)
- Binary classification - Each patient either has the condition (1) or does not (0)
Familiarizing with the Program Structure
The lab environment includes the following key files:
data_loader.py- Loads and prepares the medical diagnosis dataset with train/test splitsmetrics_calculator.py- Computes core classification metrics (accuracy, precision, recall, F1)metric_selector.py- Selects and prioritizes metrics based on use case requirementscalibration_evaluator.py- Evaluates probability calibration using Brier score and reliability diagramscalibration_methods.py- Applies calibration techniques like Platt Scalingstatistical_comparison.py- Compares models using McNemar's statistical testevaluation_pipeline.py- Combines all techniques into a complete evaluation workflow
The environment uses Python 3 with scikit-learn for machine learning, NumPy for numerical operations, SciPy for statistical tests, and Matplotlib for visualization. All dependencies are pre-installed.
To run scripts, use the terminal with commands like
python3 data_loader.py. Results are saved to theoutput/directory.Note: Complete tasks in order. Each task builds on the previous one. Test your code frequently by running the provided scripts to catch errors early.
info> If you get stuck on a task, there are solution files provided for you located in the
solutiondirectory in your filetree. -
Challenge
Setting Up Data and Selecting Metrics
Understanding Metric Selection
Choosing the right evaluation metric depends on the costs of different types of errors in your specific application domain. In medical diagnosis, errors have real consequences for patient health and safety:
- False Negatives (Missing a disease) - Patient goes untreated, condition worsens, potentially life-threatening consequences that could have been prevented with early detection
- False Positives (False alarm) - Unnecessary follow-up tests, patient anxiety and stress, wasted healthcare resources, and potential harm from unnecessary treatments
The key insight is that these error types rarely have equal costs. For screening tests where missing a case is dangerous and follow-up testing is available, you prioritize recall (sensitivity) to catch every potential case. For confirmatory tests where false alarms lead to invasive procedures or harmful treatments, you prioritize precision (positive predictive value) to be certain before acting. ### Understanding Core Metrics
Before diving into advanced techniques, you need to understand the four core classification metrics that form the foundation of model evaluation:
- Accuracy - Overall correctness: (TP + TN) / Total. Measures the proportion of all predictions that were correct, but can be misleading when classes are imbalanced.
- Precision - When predicting positive, how often correct: TP / (TP + FP). Answers the question "Of all patients I flagged as positive, how many actually have the condition?"
- Recall - Of actual positives, how many found: TP / (TP + FN). Answers the question "Of all patients who actually have the condition, how many did I successfully identify?"
- F1 Score - Harmonic mean of precision and recall. Provides a single balanced metric when you need to consider both precision and recall together.
Each metric tells a different story about model performance. High accuracy can be misleading with imbalanced data since predicting the majority class always achieves high accuracy. High precision means few false alarms but may miss cases. High recall means few missed cases but may have many false alarms.
-
Challenge
Metric Selection for Use Cases
Understanding Use Case Requirements
Different medical scenarios require different metric priorities based on the consequences of each error type and the workflow that follows a prediction:
Screening Tests (e.g., cancer screening, initial disease detection):
- Goal: Cast a wide net to catch all potential cases for follow-up investigation and confirmatory testing
- Priority: High Recall (sensitivity) - minimize false negatives at all costs since missing a case could be fatal
- Acceptable trade-off: Some false positives are acceptable because follow-up tests will filter them out
Confirmatory Tests (e.g., pre-surgery biopsy, treatment decisions):
- Goal: Be absolutely certain before recommending invasive, expensive, or potentially harmful procedures to the patient
- Priority: High Precision (positive predictive value) - minimize false positives to avoid unnecessary harm
- Acceptable trade-off: Some false negatives are acceptable because patients can be retested later if symptoms persist
-
Challenge
Evaluating Probability Calibration
Understanding Calibration
Many classifiers output probabilities, but these probability estimates are not always well-calibrated or trustworthy. A well-calibrated model means: if the model predicts 70% probability for a group of patients, approximately 70% of them should actually have the condition. This alignment between predicted confidence and actual outcomes is called calibration.
Why Calibration Matters in Practice:
- Poorly calibrated probabilities mislead clinical decisions and can cause doctors to over-treat or under-treat patients based on unreliable confidence scores
- A probability of 0.8 should genuinely mean 80% confidence, not just "likely positive" - clinicians need to trust these numbers for treatment planning
- Calibration is essential for honest risk communication to patients, informed consent discussions, and shared decision-making in healthcare
Brier Score measures overall calibration quality and is one of the most common calibration metrics:
- Ranges from 0 (perfect calibration and discrimination) to 1 (worst possible performance)
- Lower Brier scores indicate better calibrated probability estimates
- Combines both calibration error (reliability) and refinement (how spread out the predictions are) ### Understanding Reliability Diagrams
A reliability diagram (also called a calibration curve or calibration plot) visually shows how well-calibrated your model's probability estimates are across the entire range of predictions:
- X-axis: Mean predicted probability within each bin (e.g., predictions between 0.2-0.3 averaged together)
- Y-axis: Actual fraction of positive outcomes observed in each bin (empirical probability)
- Perfect calibration: All points fall exactly on the diagonal line where predicted probability equals observed frequency
Interpreting the Diagram to Diagnose Calibration Problems:
- Points above the diagonal: Model is under-confident - it predicts lower probabilities than the true rate, meaning it is too cautious
- Points below the diagonal: Model is over-confident - it predicts higher probabilities than the true rate, meaning it is too aggressive
- Points scattered far from diagonal: Model has poor calibration overall and probability estimates should not be trusted for decision-making
-
Challenge
Applying Calibration Techniques
Understanding Calibration Techniques
Several techniques exist to improve probability calibration:
Platt Scaling (Sigmoid Method): Platt Scaling fits a logistic regression model to the classifier's output probabilities. This transforms raw scores into better-calibrated probabilities.
- How it works: Fit P(y=1|score) = 1 / (1 + exp(A*score + B))
- Best for: Models that produce scores rather than probabilities (e.g., SVMs)
- Scikit-learn:
CalibratedClassifierCVwithmethod='sigmoid'
Isotonic Regression: A non-parametric approach that fits a non-decreasing step function to map predicted probabilities to calibrated ones.
- How it works: Learns a monotonic mapping from raw scores to calibrated probabilities
- Best for: When you have sufficient calibration data (requires more samples than Platt Scaling)
- Scikit-learn:
CalibratedClassifierCVwithmethod='isotonic'
Temperature Scaling: A simple post-hoc calibration method popular for neural networks that divides logits by a learned temperature parameter T.
- How it works: Softmax(logits/T) where T > 1 softens predictions, T < 1 sharpens them
- Best for: Deep learning models with softmax outputs
- Note: Not directly available in scikit-learn; typically implemented manually for neural networks
In this lab, you implement Platt Scaling because it is robust with small to medium-sized datasets and less prone to overfitting. Isotonic Regression requires larger datasets (1000+ samples) to avoid overfitting, while Temperature Scaling is primarily designed for neural networks and requires manual implementation.
-
Challenge
Statistical Model Comparison
Understanding Statistical Significance Tests
When comparing two models, you need to determine if observed performance differences are statistically significant or if they could simply be due to random chance in the test data. Without statistical testing, you might wrongly conclude one model is better when the difference is just noise. Several tests are available for different experimental setups:
McNemar's Test:
- Compares two classifiers evaluated on the exact same test set by analyzing their disagreements
- Uses a 2x2 contingency table counting cases where classifiers disagree: one correct and one wrong
- Tests the null hypothesis that both classifiers have the same error rate on this population
- Best for: Comparing two classifiers on a single held-out test set without cross-validation
- Advantage: Only requires predictions, not probability scores, and handles paired observations properly
Paired t-test:
- Compares mean performance scores across multiple cross-validation folds where both models are evaluated on identical folds
- Each fold provides a paired observation (Model A's score on fold k, Model B's score on fold k)
- Tests whether the mean difference in performance is significantly different from zero across all folds
- Best for: When you have performance scores from k-fold cross-validation or repeated random splits
- Assumption: The differences between paired observations should be approximately normally distributed
When to use which test for your experiment:
- Use McNemar's test when comparing binary predictions on a single held-out test set
- Use paired t-test when comparing average metric scores across multiple CV folds or repeated experiments
In this lab, you'll implement McNemar's test as it directly compares classifier disagreements on the test set without requiring cross-validation.
Interpretation of p-values:
- p-value < 0.05: Statistically significant difference exists between models - unlikely due to chance alone
- p-value >= 0.05: No significant difference detected - observed differences could plausibly be random variation
-
Challenge
Complete Evaluation Pipeline
Bringing It All Together
Now you'll combine all the evaluation techniques you have learned into a complete, end-to-end evaluation pipeline. In real-world machine learning projects, you rarely use just one metric or technique in isolation. A comprehensive evaluation pipeline systematically:
- Loads and splits data with proper stratification to maintain class balance in both training and test sets
- Trains baseline and calibrated models to compare the effect of calibration on probability estimates
- Computes metrics based on use case priority so you focus on what matters most for your application
- Evaluates calibration quality using Brier score and reliability diagrams to assess probability trustworthiness
- Performs statistical comparison using McNemar's test to determine if model differences are real or just noise
-
Challenge
Conclusion
Congratulations on completing the Model Evaluation lab! You have successfully learned to select appropriate metrics, evaluate probability calibration, and compare models using statistical tests.
What You Have Accomplished
Throughout this lab, you have:
- Configured Data Loading - Set up reproducible train/test splits for honest model evaluation.
- Computed Core Metrics - Calculated accuracy, precision, recall, and F1 score to evaluate model performance.
- Implemented Metric Selection - Selected primary metrics based on screening vs confirmatory use cases.
- Evaluated Calibration - Computed Brier score to assess probability calibration quality.
- Generated Reliability Diagrams - Created calibration curves to visualize calibration issues.
- Applied Platt Scaling - Used calibration techniques to improve probability estimates.
- Performed Statistical Tests - Applied McNemar's test to fairly compare two classifiers.
- Built Evaluation Pipeline - Combined all techniques into a complete evaluation workflow.
Key Takeaways
- Metric Selection Depends on Use Case - Screening tests prioritize recall; confirmatory tests prioritize precision.
- Calibration Matters for Probabilities - Brier score and reliability diagrams reveal calibration issues.
- Statistical Tests Ensure Fair Comparison - McNemar's test determines if model differences are significant.
- No Single Metric Tells Everything - Combine multiple metrics and techniques for comprehensive evaluation.
Experiment Before You Go
You still have time in the lab environment. Try these explorations:
- Adjust the classification threshold and observe how metrics change
- Try Isotonic Regression calibration instead of Platt Scaling (change
method='sigmoid'tomethod='isotonic'in Task 6) - Implement a paired t-test using cross-validation scores with
scipy.stats.ttest_rel - Experiment with different test set sizes and observe calibration effects
- Compare more than two models using paired comparisons
- Research temperature scaling for neural network calibration
Take this opportunity to experiment and deepen your understanding!
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.