Featured resource
2026 Tech Forecast
2026 Tech Forecast

1,500+ tech insiders, business leaders, and Pluralsight Authors share their predictions on what’s shifting fastest and how to stay ahead.

Download the forecast
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • AI
Labs

Lab: Anomaly Detection Pipeline

Modern industrial systems continuously generate machine temperature data, but detecting the difference between harmless noise and the early signs of failure is far from simple. In this Hands-on Code Lab, you will build a complete anomaly detection pipeline using real-world industrial machine temperature telemetry from the Numenta Anomaly Benchmark (NAB). Starting from raw time-series sensor data, you will learn how to identify abnormal machine behavior, engineer meaningful features, and train unsupervised anomaly detection models including Isolation Forest and Local Outlier Factor (LOF). Along the way, you will compare anomaly scoring strategies, calibrate alert thresholds, simulate sensor drift, and retrain models to maintain detection performance as machine behavior evolves over time. By the end of the lab, you will have built a production-style monitoring workflow capable of detecting subtle temperature anomalies, identifying early warning signals of system failure, and transforming raw telemetry into actionable operational insights.

Lab platform
Lab Info
Level
Intermediate
Last updated
Jun 12, 2026
Duration
1h 30m

Contact sales

By clicking submit, you agree to our Privacy Policy and Terms of Use, and consent to receive marketing emails from Pluralsight.
Table of Contents
  1. Challenge

    Introduction

    Introduction

    Welcome to the machine temperature Anomaly Detection Pipeline lab.

    You've just joined an industrial AI team as an ML Engineer responsible for monitoring the health of a critical machine in a production plant. Your mission: build an anomaly detection system that can catch early warning signs of failure before things go wrong.

    The system you will work on is based on real industrial machine temperature telemetry from the Numenta Anomaly Benchmark (NAB). It contains timestamped temperature readings collected under realistic operating conditions: normal behavior, subtle degradation patterns, and known failure events defined as anomaly windows.

    Your job is not just to detect "weird points in data." It's to understand the heartbeat of a machine, learn what "normal" looks like in context, and notice when that heartbeat starts to drift.

    And yes… missing an anomaly here is not just a bad metric. It could mean downtime, maintenance chaos, or an angry plant manager asking why everything overheated at 3 AM.

    No pressure 😅.


    What you'll build

    This lab takes you from raw sensor telemetry to a production-style anomaly detection pipeline.

    You will progressively move from intuition (“this looks strange”) to structured machine monitoring:

    • engineering features that expose hidden behavior,
    • training unsupervised anomaly detection models,
    • converting anomaly scores into operational alerts,
    • evaluating false alarms and missed failures,
    • and maintaining the system as machine behavior evolves over time.

    Unlike supervised ML, anomaly detection often has limited labels, multiple definitions of “abnormal,” shifting operating conditions, and no universal alert threshold.

    Why anomaly detection is different from supervised learning

    No labeled training set

    • The model must learn “normal” behavior without being explicitly shown failures.

    No single definition of anomaly

    • A sudden spike, an unusual reading for a specific hour, and a slow collective drift are all different types of abnormal behavior.

    No universal threshold

    • The same score distribution can produce either a calm monitoring system or constant alert fatigue depending on where the alert threshold is placed.

    No static environment

    • Operating conditions change, sensors drift, and models gradually become outdated.

    Anomaly detection is not just a model. It is a pipeline for framing data, scoring behavior with multiple models, calibrating alerts, evaluating operational impact and maintaining stability over time.


    The four stages of the lab

    | Step | Theme | What you build | |------|-------|----------------| | 1| Data framing and feature engineering | Transform raw temperature signals into engineered features and identify point, contextual, and collective anomalies | | 2 | The detection duel | Train Isolation Forest and Local Outlier Factor (LOF) and compare how each model interprets abnormal behavior | | 3 | Thresholding and evaluation | Convert scores into alerts, evaluate precision/recall using NAB labels, and analyze operational trade-offs | | 4 | Drift simulation and maintenance | Simulate sensor drift, observe performance degradation, retrain models, and restore monitoring stability |


    Core elements

    In this lab, you will work with:

    • Raw signal: timestamped machine temperature readings sampled every 5 minutes
    • Feature engineering: cyclical time features, rolling statistics, deviation, rolling z-score, and slope
    • Unsupervised models: Isolation Forest for global isolation patterns and Local Outlier Factor (LOF) for local density patterns
    • Anomaly scores and thresholds: continuous scores converted into operational alerts
    • Weak ground truth: NAB anomaly windows used only for evaluation; models train on timestamps and temperature values only
    • Drift and retraining: synthetic drift and sliding-window retraining to maintain model performance

    Three types of anomalies you will meet

    Industrial systems rarely fail in only one way.

    This lab introduces three important categories of anomalies:

    | Type | Intuition | Example in this dataset | |------|-----------|-------------------------| | Point | A single reading far from the global norm | Sudden extreme temperature spike/drop | | Contextual | A value that is unusual in its specific context | A “normal-looking” value occurring at an abnormal time | | Collective | A sustained abnormal pattern over time | Gradual drift or prolonged deviation from baseline |

    Different models react differently to these anomaly shapes.

    For example:

    • Isolation Forest may flag an entire abnormal descent corridor.
    • LOF may only flag the single most isolated point inside that corridor.

    Understanding why models disagree is one of the most important lessons in this lab.


    Operational reality: false alarms vs. missed failures

    Once you have scores, the hard question is not "can we detect anomalies?" but "when do we alert?"

    • False positives (FP) — The system raises alarms when nothing is wrong. Operators lose trust. Alert fatigue appears.
    • False negatives (FN) — Real failures occur silently without any alert. The machine may degrade or fail before intervention happens.

    In industrial monitoring, missed failures are often more expensive than false alarms, but a system that alerts every hour quickly becomes unusable.

    This lab therefore focuses not only on detection quality, but also on:

    • alert frequency,
    • alert stability,
    • noisy detection behavior,
    • and operational credibility.

    Learning objectives

    By the end of this lab, you should be able to:

    • Prepare industrial time-series data for anomaly detection
    • Engineer causal features that describe machine behavior over time
    • Train and compare Isolation Forest and LOF models
    • Convert anomaly scores into operational alerts
    • Evaluate alert quality using precision, recall, and alert rate
    • Simulate drift and retrain models using recent behavioral windows
    • Explain why thresholds and retraining strategies must evolve together in production

    How the lab is structured

    You will work primarily inside the src/ directory.

    Each step contains guided TODO sections that you complete progressively.

    Plots, orchestration logic, and narrative outputs are already provided to help you focus on the core anomaly detection concepts values.

    | File | Your focus | |------|------------| | src/step1_feature_engineering.py | Data framing, anomaly exploration, feature engineering | | src/step2_model_training.py | Isolation Forest, LOF, anomaly scoring | | src/step3_thresholding_evaluation.py | Thresholding, metrics, operational evaluation | | src/step4_drift_simulation.py | Drift simulation, retraining, maintenance |

    Generated plots are automatically saved inside the output/ folder.


    You are now ready to transform raw telemetry into meaningful signals, train unsupervised anomaly detectors, and build a monitoring pipeline capable of detecting abnormal machine behavior before failure happens.

    Let’s begin!

    info> If you get stuck at some point in the lab, solution files have been provided for you within the solution folder in your file tree.

  2. Challenge

    Data framing and feature engineering

    Step 1 — Data framing and feature engineering

    Read the Introduction first. This page is your working guide for Step 1 only.


    The situation

    Step 1 is where the pipeline starts: one CSV of temperature readings becomes a structured view of how the machine behaves over time.

    You already know the big picture: there are no labels during training, anomalies can appear in several forms, and the models in Step 2 need meaningful features. In this step, you make that concrete.

    Your focus in this step:

    1. Frame the data: load, sort, and visualize the signal so you can see its daily rhythm and baseline.
    2. Identify three anomaly patterns in the data: point, contextual, and collective (definitions are in the Introduction; you implement the logic).
    3. Engineer features: time cycles, rolling baselines, deviation, rolling z-score, and slope, in a causal way (only past and present, never the future).
    4. Scale and keep the scaler: standardize features for modeling, and return the fitted StandardScaler for Step 4.

    Models cannot detect what they cannot see. Raw temperature alone hides important patterns; this step builds the lens 🔎 they will look through.


    What you'll build

    By the end of this step you will have:

    • A clean, chronologically sorted DataFrame of 22,695 temperature readings (5-minute sampling)
    • Plots of the full signal and zoomed regions (saved to output/)
    • Functions that surface one example each of point, contextual, and collective anomaly behavior
    • A feature-enriched table with 10 engineered columns (plus matching scaled_* columns)
    • A fitted StandardScaler returned with the DataFrame for reuse in Step 4

    Running run_step1 (pre-written) ties everything together and prints validation output once your functions are complete.


    Learning outcomes

    After this step you should be able to:

    • Load and prepare industrial time-series data for downstream modeling
    • Implement the three anomaly identification strategies used in the lab
    • Build rolling and time-based features without temporal leakage (min_periods=1, no .bfill() on past rows)
    • Explain why rolling_z and slope add information beyond raw deviation
    • Fit or reuse a scaler correctly and understand why Step 4 depends on the same object

    Your tasks

    Open src/step1_feature_engineering.py and fill in every # TODO zone.

    Plots are pre-written and save automatically to output/ when you run the file.

    Run your experiment

    Once you have completed all TODOs in this step:

    1. Make sure your current directory is the workspace root.

      • This ensures that all imports work correctly.
    2. Run the experiment script:

    python3 src/step1_feature_engineering.py
    

    Expected output

    • Shape: (22695, 2) with no missing values in the raw data
    • Three anomaly identifications printed with timestamps and values
    • First 10 rows of engineered features with 10 columns
    • (df, scaler) returned from run_step1
    • Five plots saved to output/:
      • raw_temperature.png
      • zoom_point_anomaly_region.png
      • zoom_contextual_anomaly_region.png
      • zoom_collective_anomaly_region.png
      • temperature_vs_rolling.png
  3. Challenge

    The detection duel (isolation vs. density)

    Step 2 — The detection duel: Isolation Forest vs. Local Outlier Factor

    Complete the Introduction and Step 1 before starting this step.


    The situation

    Step 1 gave the models something to look at: a 10-dimensional feature vector at every timestamp. Step 2 asks a harder question: which readings look abnormal in that feature space?

    The lab overview introduced Isolation Forest as a global isolation approach and Local Outlier Factor (LOF) as a local density approach. Here, you train both models on the same matrix, put their scores on a common scale, and compare where they agree and where they do not.

    That disagreement is intentional. Isolation Forest often flags sustained regions of unusual behavior, while LOF may focus on locally isolated points. Neither model is automatically right; understanding the difference is part of the lab.

    Your focus in this step:

    1. Train Isolation Forest and LOF on the scaled features from Step 1.
    2. Score every row with a single rule: higher score means more suspicious. Negate decision_function where needed.
    3. Normalize scores to [0, 1] so the two models are comparable.
    4. Compare overlap using each model's own threshold, not one shared cutoff.

    What you'll build

    By the end of this step you will have:

    • A fitted Isolation Forest model and a fitted LOF model with novelty=True
    • Normalized anomaly score arrays for the full dataset
    • A function to list the top-N most anomalous timestamps per model
    • Overlap statistics when each model uses its own 95th-percentile threshold
    • Four plots in output/: score timelines, Isolation Forest anomalies on temperature, LOF anomalies, and model overlap

    get_feature_matrix, all plots, and run_step2 are pre-written. You need Step 1's engineered DataFrame with scaled_* columns before you start.


    Learning outcomes

    After this step you should be able to:

    • Train and score with IsolationForest and LocalOutlierFactor in scikit-learn
    • Explain why LOF needs novelty=True for consistent scoring with Isolation Forest
    • Normalize scores with min-max scaling and handle the flat-score case
    • Compare model agreement without biasing overlap toward one model's threshold
    • Interpret why Isolation Forest and LOF highlight different regions of the same signal

    Your tasks

    Open src/step2_model_training.py and fill in every # TODO zone.

    The feature matrix builder, all plots, and the orchestration are pre-written.


    Run your experiment

    Once you have completed all TODOs in this step:

    1. Make sure your current directory is the workspace root.

      • This ensures that all imports work correctly.
    2. Run the experiment script:

    python3 src/step2_model_training.py
    

    Expected output

    • Feature matrix shape: (22695, 10)
    • Score ranges: [0.000, 1.000] for both models after normalization
    • Top 5 anomaly timestamps printed for each model
    • Overlap printed with separate IF and LOF 95th-percentile thresholds, with about 5% flagged by each model
    • Four plots saved to output/:
      • anomaly_scores.png
      • anomalies_isolation_forest.png
      • anomalies_lof.png
      • score_overlap.png

    Something to think about

    When you see the results, you'll notice that Isolation Forest and LOF flag different patterns.

    • Isolation Forest tends to catch sustained cluster-level shifts, such as the machine running collectively hot.
    • LOF tends to catch sudden local density changes, such as a reading in a very unusual neighborhood.

    Neither model is automatically more correct; they see the data differently. In production, running both and comparing their agreement is a common strategy.

  4. Challenge

    Thresholding, evaluation, and operational impact

    Step 3 — Thresholding, evaluation, and operational impact

    Complete the Introduction, Step 1, and Step 2 before starting this step.


    The situation

    Step 2 produced a suspicion score for every timestamp. Step 3 turns those scores into operational decisions: when should the system alert?

    You have already seen the trade-off: false alarms vs. missed failures, thresholds as a business choice, and NAB labels used only for evaluation. Here, you implement that workflow in code.

    You will:

    1. Convert score distributions into binary alerts using percentile thresholds.

    2. Build a ground-truth mask from four known NAB failure times using ±60-minute windows.

    3. Measure precision, recall, TP, FP, and FN using point-level metrics rather than event-level NAB scoring.

    4. Quantify alert rate per hour and review the pre-written operational analysis for burstiness, worst false-positive cluster, and hardest window.

    Scores without thresholds are just rankings. Thresholds without evaluation are just guesswork. This step connects them.


    What you'll build

    By the end of this step you will have:

    • apply_threshold: top X% of scores become alerts

    • build_ground_truth_mask: boolean mask from NAB timestamps

    • compute_metrics and compare_thresholds: precision and recall table across percentiles

    • compute_alert_rate: alerts per hour for Isolation Forest and LOF

    • Plots in output/: score histograms, alert timeline, and precision/recall vs. percentile

    • Console output from [9] Operational analysis: burstiness, false-positive clusters, and missed-window recall

    Label loading, confusion-matrix plots, LOF neighbor comparison, and run_step3 orchestration are pre-written.


    Learning outcomes

    After this step you should be able to:

    • Convert continuous anomaly scores into binary alerts with percentile thresholds

    • Build evaluation masks from weak, windowed labels rather than point-perfect ground truth

    • Compute and interpret TP, FP, FN, precision, and recall in a monitoring context

    • Explain why point-level metrics differ from event-level NAB scoring

    • Relate alert rate and burstiness to operator fatigue in production


    Your tasks

    Open src/step3_thresholding_evaluation.py and fill in every # TODO zone.

    The score distribution histogram, alert timeline plot, and precision/recall curve are pre-written.


    Run your experiment

    Once you have completed all TODOs in this step:

    1. Make sure your current directory is the workspace root.

      • This is important so that all imports work correctly.
    2. Run the experiment script using the following command:

    python3 src/step3_thresholding_evaluation.py
    

    Expected output

    • Four anomaly timestamps loaded
    • About 100 points inside ground-truth windows (0.44% of data)
    • Metrics table printed for percentiles [99, 97, 95, 90, 85]
    • Alert rate per model in alerts per hour
    • Two plots saved to output/:
      • alerts_over_time.png: alerts overlaid on the temperature signal with gold anomaly windows
      • precision_recall_vs_threshold.png
    • An operational analysis block with:
      • alert burstiness per model
      • worst false-positive cluster per model
      • hardest-to-catch anomaly window per model
      • an operational takeaway about the false-negative/false-positive trade-off

    Something to think about

    After running the comparison, you'll see a table like this:

    | Model | Percentile | FlagRate_% | Precision | Recall | | ----- | ---------- | ---------- | --------- | ------ | | IF | 99 | 1.0% | ... | ... | | IF | 95 | 5.0% | ... | ... | | IF | 85 | 15.0% | ... | ... |

    Questions to consider:

    1. Which threshold would you deploy? Is it better to catch every failure with high recall and more false alarms, or alert only when the model is very confident, with higher precision and some missed events?

    2. Which failure mode is worse in this context: a machine that breaks silently, or a maintenance team that stops trusting the alert system?

    3. Looking at the LOF vs. IF columns, which model is more useful here, and why?

  5. Challenge

    Drift simulation and model maintenance

    Step 4 — Drift simulation and model maintenance

    Complete the Introduction and Steps 1–3 before starting this step.


    The situation

    A monitoring system that only works on day one is not a monitoring system. Machines change, sensors drift, and models trained on the past slowly judge the present against a world that no longer exists.

    The Introduction described drift and retraining. Step 4 makes it tangible: you inject synthetic drift into the temperature signal, rescore with old models, then retrain on a recent window — while keeping the same scaler from Step 1 so the drift is not hidden by refitting normalization.

    You will also see a production lesson printed at the end: retraining the model does not automatically fix the operating threshold — especially for LOF.

    Your focus in this step:

    1. Inject linear and noise drift (always return a copy of the DataFrame).
    2. Rescore drifted data with frozen scaler + original models.
    3. Measure how anomaly rates change when you hold the pre-drift threshold fixed.
    4. Retrain Retrain Isolation Forest and LOF on the last 5,000 points and compare before / after drift / after retrain.

    What you'll build

    By the end of this step you will have:

    • inject_linear_drift and inject_noise_drift — synthetic distribution shift
    • recompute_scores — original models on drifted features (scaler from Step 1)
    • measure_anomaly_rate — fraction flagged, with optional fixed threshold
    • retrain_on_window — sliding-window refit; scores on full drifted series
    • Three plots in output/: drift comparison, IF score-shift histogram, LOF score-shift histogram
    • A printed before drift, after drift, and after retrain table plus the threshold-drift commentary for LOF

    recompute_features, compare_rates, plots, and run_step4 are pre-written.


    Learning outcomes

    After this step you should be able to:

    • Simulate covariate drift in a time series without corrupting the original DataFrame
    • Explain why the Step 1 scaler must stay frozen when features are recomputed on drifted data
    • Compare model behavior before drift, after drift (no retrain), and after sliding-window retrain
    • Articulate why alert thresholds must be recalibrated after retraining, not reused blindly

    Your tasks

    Open src/step4_drift_simulation.py and fill in every # TODO zone.
    recompute_features, compare_rates, all plots, and the orchestration are pre-written.


    Run your experiment

    Once you have completed all TODOs in this step:

    1. Make sure your current directory is the workspace root.

      • This ensures that all imports work correctly.
    2. Run the experiment script using the following command:

    python3 src/step4_drift_simulation.py
    

    Expected output

    Final comparison table:

                         Before drift   After drift   After retrain
      ---------------------------------------------------------------
      Isolation Forest         5.00%         X.XX%          Y.YY%
      LOF                      5.00%         X.XX%          Y.YY%
    

    3 plots saved to output/:

    • drift_comparison.png: original vs drifted signal side by side
    • score_shift_isolation_forest.png: Isolation Forest score distribution before drift, after drift, and after retraining
    • score_shift_lof.png: LOF score distribution before drift, after drift, and after retraining
  6. Challenge

    Conclusion

    Something to think about

    You will notice that Isolation Forest and LOF react differently to drift:

    • LOF is highly sensitive — after drift, its anomaly rate shoots up dramatically when using a fixed threshold. It sees the shifted distribution as deeply unusual compared to what it learned.
    • Isolation Forest is more robust. The rate changes less, because global isolation is less affected by a uniform shift.

    Neither reaction is wrong. They reflect the fundamentally different ways these algorithms define "abnormal."

    Threshold drift

    After retraining, you'll see something striking in the final table: Isolation Forest's "after retrain" rate snaps back to roughly its pre-drift level (~5%), but LOF's does not — it can sit well above the original alert rate even though the model itself has clearly recovered.

    The orchestration print block explains why: the pre-drift threshold (p95 of the original LOF scores) is a snapshot of the original score distribution's shape. After retraining, LOF's score distribution is reshaped — what used to be the top 5% is no longer the top 5% under the new scoring. The model is fine; the threshold isn't.

    This is a real production pattern: retraining the model is only half the maintenance job. The operating threshold (and any downstream policy that depends on it — alert routing, paging rules, dashboards) drifts with the model and has to be re-calibrated on each retrain. Isolation Forest happens to be less affected because random-isolation scores are more scale-stable than density-based scores, but the same principle applies to both.

    Treat thresholds as part of the model contract, not a separate constant.

    Question to consider: After retraining on recent data, are you confident the recent data represents a new normal, or could it be a machine that is still quietly failing?

    This is the challenge of unsupervised monitoring in the real world. You have now built the full pipeline to tackle it.

    --- ### You made it 🎉

    You built more than anomaly detection models.

    You built a monitoring system for a living machine — one that:

    • Learns what normal looks like from raw sensor data
    • Detects three distinct types of anomalies
    • Translates detection into calibrated operational alerts
    • Handles the reality that machines (and their data) change over time

    The same pipeline structure you used here — feature engineering, unsupervised scoring, threshold evaluation, drift monitoring, sliding-window retraining — is what production industrial ML systems are built on.

About the author

Marc is a Senior Data Scientist with a solid foundation in Communication and Computer Engineering and holds a Master's degree in AI and Deep Learning from one of France's leading universities. His career is driven by a deep passion for data science and artificial intelligence, combining technical expertise with innovative thinking to deliver impactful solutions.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight