- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- AI
Lab: Anomaly Detection Pipeline
Modern industrial systems continuously generate machine temperature data, but detecting the difference between harmless noise and the early signs of failure is far from simple. In this Hands-on Code Lab, you will build a complete anomaly detection pipeline using real-world industrial machine temperature telemetry from the Numenta Anomaly Benchmark (NAB). Starting from raw time-series sensor data, you will learn how to identify abnormal machine behavior, engineer meaningful features, and train unsupervised anomaly detection models including Isolation Forest and Local Outlier Factor (LOF). Along the way, you will compare anomaly scoring strategies, calibrate alert thresholds, simulate sensor drift, and retrain models to maintain detection performance as machine behavior evolves over time. By the end of the lab, you will have built a production-style monitoring workflow capable of detecting subtle temperature anomalies, identifying early warning signals of system failure, and transforming raw telemetry into actionable operational insights.
Lab Info
Table of Contents
-
Challenge
Introduction
Introduction
Welcome to the machine temperature Anomaly Detection Pipeline lab.
You've just joined an industrial AI team as an ML Engineer responsible for monitoring the health of a critical machine in a production plant. Your mission: build an anomaly detection system that can catch early warning signs of failure before things go wrong.
The system you will work on is based on real industrial machine temperature telemetry from the Numenta Anomaly Benchmark (NAB). It contains timestamped temperature readings collected under realistic operating conditions: normal behavior, subtle degradation patterns, and known failure events defined as anomaly windows.
Your job is not just to detect "weird points in data." It's to understand the heartbeat of a machine, learn what "normal" looks like in context, and notice when that heartbeat starts to drift.
And yes… missing an anomaly here is not just a bad metric. It could mean downtime, maintenance chaos, or an angry plant manager asking why everything overheated at 3 AM.
No pressure 😅.
What you'll build
This lab takes you from raw sensor telemetry to a production-style anomaly detection pipeline.
You will progressively move from intuition (“this looks strange”) to structured machine monitoring:
- engineering features that expose hidden behavior,
- training unsupervised anomaly detection models,
- converting anomaly scores into operational alerts,
- evaluating false alarms and missed failures,
- and maintaining the system as machine behavior evolves over time.
Unlike supervised ML, anomaly detection often has limited labels, multiple definitions of “abnormal,” shifting operating conditions, and no universal alert threshold.
Why anomaly detection is different from supervised learning
No labeled training set
- The model must learn “normal” behavior without being explicitly shown failures.
No single definition of anomaly
- A sudden spike, an unusual reading for a specific hour, and a slow collective drift are all different types of abnormal behavior.
No universal threshold
- The same score distribution can produce either a calm monitoring system or constant alert fatigue depending on where the alert threshold is placed.
No static environment
- Operating conditions change, sensors drift, and models gradually become outdated.
Anomaly detection is not just a model. It is a pipeline for framing data, scoring behavior with multiple models, calibrating alerts, evaluating operational impact and maintaining stability over time.
The four stages of the lab
| Step | Theme | What you build | |------|-------|----------------| | 1| Data framing and feature engineering | Transform raw temperature signals into engineered features and identify point, contextual, and collective anomalies | | 2 | The detection duel | Train Isolation Forest and Local Outlier Factor (LOF) and compare how each model interprets abnormal behavior | | 3 | Thresholding and evaluation | Convert scores into alerts, evaluate precision/recall using NAB labels, and analyze operational trade-offs | | 4 | Drift simulation and maintenance | Simulate sensor drift, observe performance degradation, retrain models, and restore monitoring stability |
Core elements
In this lab, you will work with:
- Raw signal: timestamped machine temperature readings sampled every 5 minutes
- Feature engineering: cyclical time features, rolling statistics, deviation, rolling z-score, and slope
- Unsupervised models: Isolation Forest for global isolation patterns and Local Outlier Factor (LOF) for local density patterns
- Anomaly scores and thresholds: continuous scores converted into operational alerts
- Weak ground truth: NAB anomaly windows used only for evaluation; models train on timestamps and temperature values only
- Drift and retraining: synthetic drift and sliding-window retraining to maintain model performance
Three types of anomalies you will meet
Industrial systems rarely fail in only one way.
This lab introduces three important categories of anomalies:
| Type | Intuition | Example in this dataset | |------|-----------|-------------------------| | Point | A single reading far from the global norm | Sudden extreme temperature spike/drop | | Contextual | A value that is unusual in its specific context | A “normal-looking” value occurring at an abnormal time | | Collective | A sustained abnormal pattern over time | Gradual drift or prolonged deviation from baseline |
Different models react differently to these anomaly shapes.
For example:
- Isolation Forest may flag an entire abnormal descent corridor.
- LOF may only flag the single most isolated point inside that corridor.
Understanding why models disagree is one of the most important lessons in this lab.
Operational reality: false alarms vs. missed failures
Once you have scores, the hard question is not "can we detect anomalies?" but "when do we alert?"
- False positives (FP) — The system raises alarms when nothing is wrong. Operators lose trust. Alert fatigue appears.
- False negatives (FN) — Real failures occur silently without any alert. The machine may degrade or fail before intervention happens.
In industrial monitoring, missed failures are often more expensive than false alarms, but a system that alerts every hour quickly becomes unusable.
This lab therefore focuses not only on detection quality, but also on:
- alert frequency,
- alert stability,
- noisy detection behavior,
- and operational credibility.
Learning objectives
By the end of this lab, you should be able to:
- Prepare industrial time-series data for anomaly detection
- Engineer causal features that describe machine behavior over time
- Train and compare Isolation Forest and LOF models
- Convert anomaly scores into operational alerts
- Evaluate alert quality using precision, recall, and alert rate
- Simulate drift and retrain models using recent behavioral windows
- Explain why thresholds and retraining strategies must evolve together in production
How the lab is structured
You will work primarily inside the
src/directory.Each step contains guided TODO sections that you complete progressively.
Plots, orchestration logic, and narrative outputs are already provided to help you focus on the core anomaly detection concepts values.
| File | Your focus | |------|------------| |
src/step1_feature_engineering.py| Data framing, anomaly exploration, feature engineering | |src/step2_model_training.py| Isolation Forest, LOF, anomaly scoring | |src/step3_thresholding_evaluation.py| Thresholding, metrics, operational evaluation | |src/step4_drift_simulation.py| Drift simulation, retraining, maintenance |Generated plots are automatically saved inside the
output/folder.
You are now ready to transform raw telemetry into meaningful signals, train unsupervised anomaly detectors, and build a monitoring pipeline capable of detecting abnormal machine behavior before failure happens.
Let’s begin!
info> If you get stuck at some point in the lab, solution files have been provided for you within the
solutionfolder in your file tree. -
Challenge
Data framing and feature engineering
Step 1 — Data framing and feature engineering
Read the Introduction first. This page is your working guide for Step 1 only.
The situation
Step 1 is where the pipeline starts: one CSV of temperature readings becomes a structured view of how the machine behaves over time.
You already know the big picture: there are no labels during training, anomalies can appear in several forms, and the models in Step 2 need meaningful features. In this step, you make that concrete.
Your focus in this step:
- Frame the data: load, sort, and visualize the signal so you can see its daily rhythm and baseline.
- Identify three anomaly patterns in the data: point, contextual, and collective (definitions are in the Introduction; you implement the logic).
- Engineer features: time cycles, rolling baselines, deviation, rolling z-score, and slope, in a causal way (only past and present, never the future).
- Scale and keep the scaler: standardize features for modeling, and return the fitted
StandardScalerfor Step 4.
Models cannot detect what they cannot see. Raw temperature alone hides important patterns; this step builds the lens 🔎 they will look through.
What you'll build
By the end of this step you will have:
- A clean, chronologically sorted DataFrame of 22,695 temperature readings (5-minute sampling)
- Plots of the full signal and zoomed regions (saved to
output/) - Functions that surface one example each of point, contextual, and collective anomaly behavior
- A feature-enriched table with 10 engineered columns (plus matching
scaled_*columns) - A fitted
StandardScalerreturned with the DataFrame for reuse in Step 4
Running
run_step1(pre-written) ties everything together and prints validation output once your functions are complete.
Learning outcomes
After this step you should be able to:
- Load and prepare industrial time-series data for downstream modeling
- Implement the three anomaly identification strategies used in the lab
- Build rolling and time-based features without temporal leakage (
min_periods=1, no.bfill()on past rows) - Explain why
rolling_zandslopeadd information beyond raw deviation - Fit or reuse a scaler correctly and understand why Step 4 depends on the same object
Your tasks
Open
src/step1_feature_engineering.pyand fill in every# TODOzone.Plots are pre-written and save automatically to
output/when you run the file.Run your experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This ensures that all imports work correctly.
-
Run the experiment script:
python3 src/step1_feature_engineering.pyExpected output
- Shape:
(22695, 2)with no missing values in the raw data - Three anomaly identifications printed with timestamps and values
- First 10 rows of engineered features with 10 columns
(df, scaler)returned fromrun_step1- Five plots saved to
output/:raw_temperature.pngzoom_point_anomaly_region.pngzoom_contextual_anomaly_region.pngzoom_collective_anomaly_region.pngtemperature_vs_rolling.png
-
Challenge
The detection duel (isolation vs. density)
Step 2 — The detection duel: Isolation Forest vs. Local Outlier Factor
Complete the Introduction and Step 1 before starting this step.
The situation
Step 1 gave the models something to look at: a 10-dimensional feature vector at every timestamp. Step 2 asks a harder question: which readings look abnormal in that feature space?
The lab overview introduced Isolation Forest as a global isolation approach and Local Outlier Factor (LOF) as a local density approach. Here, you train both models on the same matrix, put their scores on a common scale, and compare where they agree and where they do not.
That disagreement is intentional. Isolation Forest often flags sustained regions of unusual behavior, while LOF may focus on locally isolated points. Neither model is automatically right; understanding the difference is part of the lab.
Your focus in this step:
- Train Isolation Forest and LOF on the scaled features from Step 1.
- Score every row with a single rule: higher score means more suspicious. Negate
decision_functionwhere needed. - Normalize scores to
[0, 1]so the two models are comparable. - Compare overlap using each model's own threshold, not one shared cutoff.
What you'll build
By the end of this step you will have:
- A fitted Isolation Forest model and a fitted LOF model with
novelty=True - Normalized anomaly score arrays for the full dataset
- A function to list the top-N most anomalous timestamps per model
- Overlap statistics when each model uses its own 95th-percentile threshold
- Four plots in
output/: score timelines, Isolation Forest anomalies on temperature, LOF anomalies, and model overlap
get_feature_matrix, all plots, andrun_step2are pre-written. You need Step 1's engineered DataFrame withscaled_*columns before you start.
Learning outcomes
After this step you should be able to:
- Train and score with
IsolationForestandLocalOutlierFactorin scikit-learn - Explain why LOF needs
novelty=Truefor consistent scoring with Isolation Forest - Normalize scores with min-max scaling and handle the flat-score case
- Compare model agreement without biasing overlap toward one model's threshold
- Interpret why Isolation Forest and LOF highlight different regions of the same signal
Your tasks
Open
src/step2_model_training.pyand fill in every# TODOzone.The feature matrix builder, all plots, and the orchestration are pre-written.
Run your experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This ensures that all imports work correctly.
-
Run the experiment script:
python3 src/step2_model_training.pyExpected output
- Feature matrix shape:
(22695, 10) - Score ranges:
[0.000, 1.000]for both models after normalization - Top 5 anomaly timestamps printed for each model
- Overlap printed with separate IF and LOF 95th-percentile thresholds, with about 5% flagged by each model
- Four plots saved to
output/:anomaly_scores.pnganomalies_isolation_forest.pnganomalies_lof.pngscore_overlap.png
Something to think about
When you see the results, you'll notice that Isolation Forest and LOF flag different patterns.
- Isolation Forest tends to catch sustained cluster-level shifts, such as the machine running collectively hot.
- LOF tends to catch sudden local density changes, such as a reading in a very unusual neighborhood.
Neither model is automatically more correct; they see the data differently. In production, running both and comparing their agreement is a common strategy.
-
Challenge
Thresholding, evaluation, and operational impact
Step 3 — Thresholding, evaluation, and operational impact
Complete the Introduction, Step 1, and Step 2 before starting this step.
The situation
Step 2 produced a suspicion score for every timestamp. Step 3 turns those scores into operational decisions: when should the system alert?
You have already seen the trade-off: false alarms vs. missed failures, thresholds as a business choice, and NAB labels used only for evaluation. Here, you implement that workflow in code.
You will:
-
Convert score distributions into binary alerts using percentile thresholds.
-
Build a ground-truth mask from four known NAB failure times using ±60-minute windows.
-
Measure precision, recall, TP, FP, and FN using point-level metrics rather than event-level NAB scoring.
-
Quantify alert rate per hour and review the pre-written operational analysis for burstiness, worst false-positive cluster, and hardest window.
Scores without thresholds are just rankings. Thresholds without evaluation are just guesswork. This step connects them.
What you'll build
By the end of this step you will have:
-
apply_threshold: top X% of scores become alerts -
build_ground_truth_mask: boolean mask from NAB timestamps -
compute_metricsandcompare_thresholds: precision and recall table across percentiles -
compute_alert_rate: alerts per hour for Isolation Forest and LOF -
Plots in
output/: score histograms, alert timeline, and precision/recall vs. percentile -
Console output from
[9] Operational analysis: burstiness, false-positive clusters, and missed-window recall
Label loading, confusion-matrix plots, LOF neighbor comparison, and
run_step3orchestration are pre-written.
Learning outcomes
After this step you should be able to:
-
Convert continuous anomaly scores into binary alerts with percentile thresholds
-
Build evaluation masks from weak, windowed labels rather than point-perfect ground truth
-
Compute and interpret TP, FP, FN, precision, and recall in a monitoring context
-
Explain why point-level metrics differ from event-level NAB scoring
-
Relate alert rate and burstiness to operator fatigue in production
Your tasks
Open
src/step3_thresholding_evaluation.pyand fill in every# TODOzone.The score distribution histogram, alert timeline plot, and precision/recall curve are pre-written.
Run your experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This is important so that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/step3_thresholding_evaluation.pyExpected output
- Four anomaly timestamps loaded
- About 100 points inside ground-truth windows (0.44% of data)
- Metrics table printed for percentiles
[99, 97, 95, 90, 85] - Alert rate per model in alerts per hour
- Two plots saved to
output/:alerts_over_time.png: alerts overlaid on the temperature signal with gold anomaly windowsprecision_recall_vs_threshold.png
- An operational analysis block with:
- alert burstiness per model
- worst false-positive cluster per model
- hardest-to-catch anomaly window per model
- an operational takeaway about the false-negative/false-positive trade-off
Something to think about
After running the comparison, you'll see a table like this:
| Model | Percentile | FlagRate_% | Precision | Recall | | ----- | ---------- | ---------- | --------- | ------ | | IF | 99 | 1.0% | ... | ... | | IF | 95 | 5.0% | ... | ... | | IF | 85 | 15.0% | ... | ... |
Questions to consider:
-
Which threshold would you deploy? Is it better to catch every failure with high recall and more false alarms, or alert only when the model is very confident, with higher precision and some missed events?
-
Which failure mode is worse in this context: a machine that breaks silently, or a maintenance team that stops trusting the alert system?
-
Looking at the LOF vs. IF columns, which model is more useful here, and why?
-
-
Challenge
Drift simulation and model maintenance
Step 4 — Drift simulation and model maintenance
Complete the Introduction and Steps 1–3 before starting this step.
The situation
A monitoring system that only works on day one is not a monitoring system. Machines change, sensors drift, and models trained on the past slowly judge the present against a world that no longer exists.
The Introduction described drift and retraining. Step 4 makes it tangible: you inject synthetic drift into the temperature signal, rescore with old models, then retrain on a recent window — while keeping the same scaler from Step 1 so the drift is not hidden by refitting normalization.
You will also see a production lesson printed at the end: retraining the model does not automatically fix the operating threshold — especially for LOF.
Your focus in this step:
- Inject linear and noise drift (always return a copy of the DataFrame).
- Rescore drifted data with frozen scaler + original models.
- Measure how anomaly rates change when you hold the pre-drift threshold fixed.
- Retrain Retrain Isolation Forest and LOF on the last 5,000 points and compare before / after drift / after retrain.
What you'll build
By the end of this step you will have:
inject_linear_driftandinject_noise_drift— synthetic distribution shiftrecompute_scores— original models on drifted features (scaler from Step 1)measure_anomaly_rate— fraction flagged, with optional fixed thresholdretrain_on_window— sliding-window refit; scores on full drifted series- Three plots in
output/: drift comparison, IF score-shift histogram, LOF score-shift histogram - A printed before drift, after drift, and after retrain table plus the threshold-drift commentary for LOF
recompute_features,compare_rates, plots, andrun_step4are pre-written.
Learning outcomes
After this step you should be able to:
- Simulate covariate drift in a time series without corrupting the original DataFrame
- Explain why the Step 1 scaler must stay frozen when features are recomputed on drifted data
- Compare model behavior before drift, after drift (no retrain), and after sliding-window retrain
- Articulate why alert thresholds must be recalibrated after retraining, not reused blindly
Your tasks
Open
src/step4_drift_simulation.pyand fill in every# TODOzone.
recompute_features,compare_rates, all plots, and the orchestration are pre-written.
Run your experiment
Once you have completed all TODOs in this step:
-
Make sure your current directory is the workspace root.
- This ensures that all imports work correctly.
-
Run the experiment script using the following command:
python3 src/step4_drift_simulation.pyExpected output
Final comparison table:
Before drift After drift After retrain --------------------------------------------------------------- Isolation Forest 5.00% X.XX% Y.YY% LOF 5.00% X.XX% Y.YY%3 plots saved to
output/:drift_comparison.png: original vs drifted signal side by sidescore_shift_isolation_forest.png: Isolation Forest score distribution before drift, after drift, and after retrainingscore_shift_lof.png: LOF score distribution before drift, after drift, and after retraining
-
Challenge
Conclusion
Something to think about
You will notice that Isolation Forest and LOF react differently to drift:
- LOF is highly sensitive — after drift, its anomaly rate shoots up dramatically when using a fixed threshold. It sees the shifted distribution as deeply unusual compared to what it learned.
- Isolation Forest is more robust. The rate changes less, because global isolation is less affected by a uniform shift.
Neither reaction is wrong. They reflect the fundamentally different ways these algorithms define "abnormal."
Threshold drift
After retraining, you'll see something striking in the final table: Isolation Forest's "after retrain" rate snaps back to roughly its pre-drift level (~5%), but LOF's does not — it can sit well above the original alert rate even though the model itself has clearly recovered.
The orchestration print block explains why: the pre-drift threshold (
p95of the original LOF scores) is a snapshot of the original score distribution's shape. After retraining, LOF's score distribution is reshaped — what used to be the top 5% is no longer the top 5% under the new scoring. The model is fine; the threshold isn't.This is a real production pattern: retraining the model is only half the maintenance job. The operating threshold (and any downstream policy that depends on it — alert routing, paging rules, dashboards) drifts with the model and has to be re-calibrated on each retrain. Isolation Forest happens to be less affected because random-isolation scores are more scale-stable than density-based scores, but the same principle applies to both.
Treat thresholds as part of the model contract, not a separate constant.
Question to consider: After retraining on recent data, are you confident the recent data represents a new normal, or could it be a machine that is still quietly failing?
This is the challenge of unsupervised monitoring in the real world. You have now built the full pipeline to tackle it.
--- ### You made it 🎉
You built more than anomaly detection models.
You built a monitoring system for a living machine — one that:
- Learns what normal looks like from raw sensor data
- Detects three distinct types of anomalies
- Translates detection into calibrated operational alerts
- Handles the reality that machines (and their data) change over time
The same pipeline structure you used here — feature engineering, unsupervised scoring, threshold evaluation, drift monitoring, sliding-window retraining — is what production industrial ML systems are built on.
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.