Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Azure icon

Detecting Anomalies in Azure Machine Learning

Machine learning is particularly good at finding patterns in data. One application for this is training a model to find anomalies, which are data points that don't fit the discovered pattern. This has value across many industries, such as finance, information security, and medicine. Unfortunately, most anomalies are rare, so the quantity of examples for them is very small compared to normal data. In this lab, you will work with credit card transactions, which are labeled as either valid or fraudulent, and you need to create a model to identify the fraud. You will learn how to prepare data for training an anomaly detection model, as well as how to use one common anomaly detection algorithm.

Azure icon

Path Info

Clock icon Advanced
Clock icon 45m
Clock icon Sep 24, 2020

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Set up the Workspace

    1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
    2. Create a Training Cluster of D2 instances. This lab will require a lot of compute for the pipeline, so set the max instances to 4.
    3. Create a new blank Pipeline in the Azure Machine Learning Studio Designer.
  2. Challenge

    Prepare the Data

    We need to do a few things to make the data suitable for training and evaluating the model. First, the label column has values of 1 for the normal class and 2 for the abnormal class. However, the scoring uses 0 for abnormal and 1 for normal, so we have to correct the abnormal value to evaluate our results properly. Next, we need to set the label column as a label. Finally, we need to split the data into training and test sets. However, unlike most other machine learning algorithms, Principal Component Analysis models are only trained on the normal data, so we have to further split the training data to only include normal examples.

    1. Start with the German Credit Card UCI dataset. The column named 'Col21' is the label. Inspect the label values.
    2. Use an Apply Math Operation node to change the label value of 2 to 0.
    3. Use an Edit Metadata node to rename the label column to 'Label' and set it as a label column.
    4. Create training and testing datasets using Split Data. The German Credit Card dataset is fairly small, so use 80% of the data for training. Make sure the training and test datasets contain a proportional number of examples of both normal and abnormal data.
    5. From the training dataset, Split Data again to only contain the normal examples. Remember, normal values are labeled as 1 in the data.
  3. Challenge

    Train Anomaly Detection Models

    Principal Component Analysis (PCA) reduces our feature space as an unsupervised process to make the model more efficient. This reduction inherently loses information. To compensate for the lost information, we can oversample the data, generating statistically similar examples to the rest of our training data to help boost the anomalous information. See this research paper for a much more detailed explanation.

    The oversampling rate is another hyperparameter to tune in your model. I've done this for you for the lab. For this dataset, an oversampling rate of 5 (which means 500% extra data) will produce decent results. We still have to tune the number of features needed to produce a good model.

    1. With three PCA-Based Anomaly Detection nodes, select about 1/3 of the features, 1/2 of the features, and 2/3 of the features. Set the oversampling rate to 5. Do not use feature normalization for this dataset.
    2. Use a Train Anomaly Detection Model node to train each model. Make sure to only pass in the normal data to this training process.
    3. Use Score Model nodes to predict the testing data.
    4. Use Evaluate Model nodes to see prediction stats.

    Submit the pipeline. Due to the large amount of processing required, this can take 10-15 minutes. Grab a coffee, read the linked research paper, or watch another lesson while you wait.

  4. Challenge

    Evaluate the Models

    For anomaly detection, we are concerned primarily with True Negatives, which are correctly predicted anomalies. However, we must also pay attention to False Negatives, which are normal values incorrectly predicted as anomalous. Having a large amount of False Negatives adds a lot of noise to the signal we are trying to find. The metric we want is called Negative Predictive Value, which is the ratio of True Negatives to all predicted negatives (True Negatives + False Negatives).

    1. View the results of each Evaluate Model node. Which produces the most True Negatives?
    2. Which model has the best Negative Predictive Rate?

    There are many more combinations of features and oversampling that can be tried to produce an optimal model, but this gives you a good idea of the spectrum of possible results. You can also try with more or less data. There are plenty of options for hyperparameter tuning in this pipeline.

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans