- A Cloud Guru
Feature Selection Before Training in Azure Machine Learning
If you are presented with a large number of distinct features to use to train your model, it is rarely a good idea to throw them all at the model. Many of the features will not have any predictive power for your desired label. In the best case scenario, using these extraneous features will only increase training times. At worst, they will increase model complexity, training time, and prediction error rates. To avoid these costly increases, we can use feature engineering to pick the most relevant features before we start training our model. In this lab, we will explore using the Pearson's Correlation, and Chi Squared statistics to pick the best features for our model.
Table of Contents
Setup the Workspace
Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
Create a Training Cluster of
Create a new blank Pipeline in the Azure Machine Learning Studio Designer.
Create the Baseline Model
Create a model using the
Boosted Decision Tree Regressionalgorithm.
Note: Since we are comparing multiple models, we want them to be initialized the same way. For good science, we want there to be only one difference between the control and experiment groups, which in our case, will be the features passed to the model. To accomplish this, set the random seed to any non-zero number.
Train the model. Make sure to use the training data for this step.
Generate predictions using the testing data.
Generate statistics for the predictions.
Submit the pipeline. This will take a couple minutes to run.
When the pipeline completes, view the prediction statistics.
Note: Since we're comparing models, we need a metric to compare against. Root Mean Square Error (RMSE) determines how far off our model is, on average, from the true price. It is measured in the same units as the label, which makes it very easy to work with. Lower values are better. This model produces the RMSE we will try to beat by engineering features.
Select Features with Pearson's Correlation
- Create another Boosted Decision Tree Regression model. Use the same random seed as the first model to initialize it in the same way.
- Using Pearson's correlation, rank the features in the training data based on their correlation to the
pricedata. Pass the top 5 to the model.
Note: The algorithm for Pearson's correlation requires numbers, so only numeric columns will be considered.
- Select the same features from the testing data as you did in the training data.
Note: You cannot use the same node for this because it will run the selection algorithm again, but use the training data, which can produce different results.
- Train the model using the selected features from the training data.
- Generate predictions on the testing data filtered to the same set of selected features.
- Generate statistics for the predictions.
- Submit the pipeline. This will take a couple minutes to run, but should be faster since the data doesn't have to reprocess.
- When the pipeline completes, view the chosen features for both the training and test data sets to see if they line up.
- Find Pearson's
rvalues and see how strongly the selected features correlate to the price.
Note: The closer that the value is to 1, the more strongly positively predictive the feature is, meaning it helps predict our label. At 0, there is no correlation (this also applies to non-numerical columns). Closer to -1, the feature is strongly negatively predictive, meaning it predicts the opposite of what we want.
- Check the RMSE (Root Mean Square Error). Did this model perform better or worse than our baseline?
Select Features with Chi Squared
- Copy and paste each node we created in the previous step, then wire it up the same way as before.
- For this model, change the feature selection to use Chi Squared instead of Pearson's correlation. Also, since we learned that 5 features was not enough in the previous step, try increasing the number of features to 10.
Note: The Chi Squared algorithm does not require numerical data, so all columns will be considered.
- Submit the pipeline. This will again be quick since we don't have to redo any of the previous steps.
- Once the pipeline completes, view the chosen features for both the training and testing data sets to see if they line up.
- Find the Chi Squared values for the chosen columns. Note that the top 5 fields chosen by Pearson's correlation are still considered predictive of price, but they are no longer ranked in the same order.
- Lastly, check the RMSE. This model performed better than our previous experiment, so this is a better feature-selected model of this data. How does it compare to the baseline?
Prepare the Data
- Use the data from the Automobile price data (Raw) dataset.
- Remove the
- Remove rows that are missing the price. We can't train the model using data missing the label.
- Replace all missing values with 0.
- Split the data into training and testing sets. Use 70% of the data for training. Be sure to set a random seed for repeatability.
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.