Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Azure icon
Labs

Developing with SMOTE in Azure Machine Learning

One of the hard truths of machine learning is that certain kinds of data that would be really useful are also really hard to get. Some data points are more plentiful than others, but understanding (and predicting!) the minority class of data is incredibly useful in many areas. What if you want to detect fraudulent transactions, diagnose rare medical conditions, or discover anomalous behavior in your networks? You'll easily be able to gather plenty of examples of non-fraudulent transactions, common conditions, and normal user behavior, but you may only have a small amount of data for what you want to predict. In this lab, we will explore the Synthetic Minority Oversampling Technique, better known as SMOTE, as a way of boosting the signal of the minority class.

Azure icon
Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 1h 0m
Published
Clock icon Sep 24, 2020

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Set Up the Workspace

    1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.

    2. Create a training cluster of D2 instances.

    3. Create a new blank pipeline in the Azure Machine Learning Studio designer.

  2. Challenge

    Create a Baseline Model

    We need a baseline model to compare our SMOTE models against. For this, we will create a basic classification model.

    1. The data you will be working with is in the CRM Churn Labels Shared and CRM Dataset Shared nodes. Set the data in CRM Churn Labels Shared as a label, and rename the column to "Label". Join the CRM dataset with the labels.

    2. Split the data into training and test sets. Use 70% of the data for training. Set a random seed to repeatably split the data.

    3. Using a Two-Class Boosted Decision Tree algorithm, train a classification model. Make sure to use the training data for this step.

    4. Generate predictions for the testing data.

    5. Generate statistics for the predictions.

    6. Submit the pipeline. Create a new experiment to hold the results.

    7. Once the pipeline completes, view the prediction statistics and find the area under the curve (AUC). This is what we will try to beat with our other models.

  3. Challenge

    Use SMOTE to Increase Underrepresented Samples

    While we have established our baseline, we should be able to get better results if we have more examples of the underrepresented data. Let's experiment with SMOTE to increase the churn data and see what impact it has on our model.

    1. Create twice as many synthetic churn samples with SMOTE using the single nearest neighboring data point. Allow SMOTE to use all available data. Choose a non-zero random seed so we can start with the same intialization parameters for multiple experiments.

    Note: SMOTE percentage is the percentage increase in the minority examples. This value must be in multiples of 100. A value of 100 means that we will create 100% extra examples, so we'll effectively have twice as many examples. This will not affect the majority class. It will only change the proportion of data being used to train the model.

    Number of nearest neighbors, frequently referred to as K, allows us to define how many similar examples to use when creating the synthetic data. The more neighbors you use, the more varied the synthetic samples. More neighbors gives more variety, but it can also lead to noise in the synthetic data. For our first pipeline, we'll only use 1 nearest neighbor.

    1. Split the oversampled data into training and testing data sets. Use 70% of the data for training and set the same random seed as the baseline model's split.

    2. Train another classification model with the same architecture as before, using the synthetic testing data.

    Note: The synthetic data we are creating must only be used for training. This is crucially important. The data generated by SMOTE is statistically similar to the real data, but it is not the real data. When you see "synthetic data", you should be thinking "manufactured, imaginary data". We do not want to score the model using this. Instead, we will score the model using the test data split from the baseline model.

    1. Use SMOTE to triple the churn examples using the two closest data points. Train another model using this data as above.

    2. Use SMOTE to quadruple the churn examples using the three closest data points. Train another model using this data.

    3. Submit the pipeline reusing the same experiment. This will take a few minutes since we've added so many nodes to the pipeline.

  4. Challenge

    Evaluate the SMOTE Results

    The area under the curve (AUC) is a good proxy for how well the model performs across many threshold values, so we'll use it to determine the best model. Higher values are better.

    Which model do you think will be best?

    1. Check the AUC of the baseline model so it is fresh in your mind.

    2. For each SMOTE pipeline, find the AUC. Using these, determine which model performed best. Did any of your models perform worse than you expected?

    3. Adjust the threshold value of your best model up and down to balance the model performance. For this problem, we can use the F1 score to help us determine the right threshold. Determining the optimum threshold will involve balancing how much time your Customer Relations team has for outreach against how many customers you are willing to let churn.

    Note: The F1 score tells us the balance between precision and recall, which will help us find a good ratio for our true positives (users correctly predicted to churn) and false positives (users that wouldn't churn but are predicted to). This will help balance the amount of work the Customer Relations team will have to do.

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans