Preparing Data for Machine Learning

Oct 28, 2019 • 14 Minute Read

Introduction

It's a well known saying that the quality of output depends on the quality of input. A machine learning model is no different. The accuracy of your machine learning model prediction depends on the quality of data and the amount of data that is fed to it. Though we cannot control the amount of data, we certainly can control the quality of it. It's not uncommon in typical machine learning projects for teams to spend 50%-60% of their time preparing data.

In this article, you will learn different techniques for using some of the standard Python libraries, such as pandas and numpy, to convert raw data to quality data.

Data Preparation Basics

Data preparation in a machine learning project can be broadly subdivided into two major categories.

1. Data wrangling. Also known as data munging, this is the process of normalizing data, identifying missing data and performing cleanup either to remove the missing data or transform existing data using basic statistical operations like mean or median to impute missing values.

2. Feature engineering. In machine learning terminology, a column of data is often called a "feature." In most cases, we may need to combine two or more features to create a single feature or to derive a new feature-based on the existing feature set. For example, if the data has "employee hire date" as one of the features, we can derive the employee's age with the company by subtracting the current date and his hire date.

In this article, we are going to use a hypothetical store that has branches all over the U.S. The sample data includes columns for total inventory items and the number of items that have been sold by a specific branch on a specific date. This can be viewed as a small sample of a huge data set.

Lets import pandas and print the data set.

      import pandas as pd
df = pd.read_csv('Downloads/inventory.csv')
print(df)
    

      Date   Amount  SKU  dept  # Sold  branch id branch loc  Tot Inv
 9/10/2019  3231.00  232   100     125       1123         tx      225
 9/10/2019  4455.00  345   100     343       1123         tx      400
  9/9/2019   322.90  231   100       5       2343         il       25
 9/16/2019    33.88  342   100       1       2341         ma       10
 9/17/2019    34.44  456   300       2       3323         ma       25
 9/14/2019      NaN  238   300      44       4565         ma      100
 9/15/2019   181.90  678   300      23       1123         tx       50
 9/13/2019  3434.77  567   300     771       2343         il     1000
  9/9/2019    22.11  453   400       2       2343         il        5
 9/16/2019    23.10  560   400       3       1123         ca        5
9/10/2019    33.56  789   200      34       3456         ca       50
 9/9/2019    67.88  679   200      45       6655         ca       50
9/16/2019    23.33  134   200       2       3438         ri        5
9/14/2019    34.00  567   300      33       3356         ny       50
9/14/2019  6777.77  667   700    7788       6678         ny    10000
9/15/2019   233.33  889   700     233       2234         tx      500
9/13/2019   989.99  898   700     213       2245         sc      500
9/15/2019  9899.88  901   900     234       3344         nc      500
9/10/2019      NaN  998   900      33       2233         fl      100
    

Data wrangling

Missing Data

Methods for dealing with missing data include deleting, imputing, or predicting.

1. Delete. One of the easier ways to address null values in your data is just to drop them. This is a preferred approach if the null values are relatively smaller in number. If not we might be losing some meaningful information by dropping them. Pandas have dropna() function that can be used in dropping all null values. In our sample data, there are two rows with null values.

      df.dropna()

Once we execute the above statements, you can see the resulting data with null rows being removed.

      Date  Amount     SKU   dept  # Sold     branch id     branch loc Tot Inv
  9/10/2019  3231.00    232   100   125   1123     tx    225
  9/10/2019  4455.00    345   100   343   1123     tx    400
  9/9/2019   322.90     231   100   5     2343     il    25
  9/16/2019  33.88 342   100   1    2341  ma     10
  9/17/2019  34.44 456   300   2    3323  ma     25
  9/15/2019  181.90     678   300   23    1123     tx    50
  9/13/2019  3434.77    567   300   771   2343     il    1000
  9/9/2019   22.11 453   400   2    2343  il    5
  9/16/2019  23.10 560   400   3    1123  ca    5
  9/10/2019  33.56 789   200   34    3456  ca     50
  9/9/2019   67.88 679   200   45    6655  ca     50
  9/16/2019  23.33 134   200   2    3438  ri    5
  9/14/2019  34.00 567   300   33    3356  ny     50
  9/14/2019  6777.77    667   700   7788  6678     ny    10000
  9/15/2019  233.33     889   700   233   2234     tx    500
  9/13/2019  989.99     898   700   213   2245     sc    500
  9/15/2019  9899.88    901   900   234   3344     nc    500
    

2. Impute. There are cases where we cannot afford to drop existing data, especially if the sample size of the data is relatively small or if the ratio of null values is relatively high. In these cases, we need to impute missing data, and different strategies are available for that. Some of the commonly used approaches for continuous data are the mean/average, median, or mode value of the features. For categorical data, the mode is always the preferred approach. Pandas have fillna() method to accomplish this.

3. Predict. This is the scenario where we cannot afford to guess the wrong value. Instead of imputing random values, we predict the values using machine learning algorithms. We use a regression model to predict continuous data and a classification model for categorical data.

Detecting Outliers

While preparing the data, we must look out for extreme values. Some of these values may be genuine cases, but some could be erroneous. The presence of outliers would significantly affect the modeling process and hence the prediction accuracy.

For example, in the data above you can see that observation (row) 17 has an extreme price. Looking at other data, it seems there is a high possibility that this could be a user error.

The z-score of observation is a common way to detect outliers. To calculate z-score of the 'Amount' feature, we will use the following formula.

      zscore = (Amount - Mean Amount)
          ----------------------------
          Standard Deviation of Amount
    

We can set a specific threshold for standard deviation (> 2.0 or 2.5), and once the z-score exceeds this value, we can safely reject values as outliers. Let's compute the zscore of the 'Amount' feature and plot a graph.

      df['Amount_zscore'] = (df['Amount'] - df['Amount'].mean())/df['Amount'].std()
print(df['Amount_zscore'])
    

      df['Amount_zscore'].plot.kde()

! zscore Chart

This is one technique used to detect and eliminate outliers. There are other techniques, like Dbscan and Isolation Forest, that can be used based on particular data. The explanation of these is beyond the scope of this article.

Normalization

When you have data with multiple features and each has a different unit of measurement, there is a high possibility of skewed data. Because of this, it's important that we convert all possible features to the same standard scale. This technique is called normalizing, or feature scaling.

For example, in our data, inventory quantities range from 1 to 1000, whereas cost ranges from 1 to 10000.

Min-Max scaling is a commonly used technique in normalizing the data. The formula is:

      Feature(Normalized) = (Feature Value - Min. Feature Value)
                      --------------------------------------------
                     (Max. Feature Value - Min. Feature Value)
    

It's important to apply the outlier technique mentioned above before normalizing the data. Otherwise, you will run the risk of skewing the normal values in your data to a small interval.

Encoding data

Data is not always numerical. You may have, for example, text data that is categorical. Referring to our dataset, though most of the features are numerical, "branch loc" refers to the state in which a specific branch is located, and it contains text data. As part of preparing the data, it's important to convert this feature to numerical data. There are many ways you can do this. We are going to use a "Label Encoding" technique.

      df["branch loc"] = df["branch loc"].astype('category')
df["branch_loc_cat"] = df["branch loc"].cat.codes
print (df[["branch loc", "branch_loc_cat"]])
    

      branch loc  branch_loc_cat
        tx               7
        tx               7
        il               1
        ma               2
        ma               2
        tx               7
        il               1
        il               1
        ca               0
       ca               0
       ca               0
       ri               5
       ny               4
       ny               4
       tx               7
       sc               6
       nc               3
    

You can see that for each state, a numerical value has been assigned. We first converted this feature to a categorical type before applying categorical codes to it.

Feature Engineering

Feature engineering requires expertise both in domain knowledge and technical knowledge. Having too few features in the data may result in a poorly performing model, and too many features may result in an overly complex model. When there are too many features in the training data, it's highly possible that the model may be over-fitting the data — that is, performing accurately on training data but poorly on new, untrained test data. It's important to select an optimal number of features that help us design a better performing model. Selecting an optimal set of features is a two-step process :

Removing unwanted features from the features list, also called "feature selection."
Extracting new features from the existing set of features, also called "feature extraction."

Dimensionality Reduction

Having too many features in the data will increase the complexity of the model prediction, training time, and computation cost, and it may decrease the accuracy of the prediction because of too many variables. So it is advisable to reduce the number of features that can yield optimal accuracy in the prediction process. One technique for doing this is dimensionality reduction. It is achieved in two different ways:

Identifying the features that are relevant for the training process and eliminating those features that are redundant and provide less meaning to the data.
Combining multiple features and creating a new set of features without losing the information conveyed in those features.

There are many other techniques used to accomplish this. Some of the most common are:

Principal Component Analysis (PCA)
Random Forest
Low Variance Filter
High Correlation Filter

Imbalanced Data

For business problems like credit card fraud detection, spam filtering, or medical diagnosis, the actual data could be less than 1% of the actual sample size. In cases like this, we need to be careful not to reject accurate data as noise or outliers. If we use accuracy as a performance metric for this model, it's obvious that the model will predict every credit card transaction with 99% accuracy. But businesses are more concerned about the 1% of false credit card transactions. So, accuracy is not the best performance metric in this case, and we may need to rely on other performance metrics like precision, recall, and sensitivity. All these metrics can be derived once we have tabulated the confusion matrix.

Some of the commonly used techniques to improve prediction score include:

Resampling — Undersample majority class: When we have a lot of positive data (Valid credit card/Valid email) in our sample dataset compared to the negative data (invalid Credit card/spam email), we can use this technique to remove some of the observations of positive data. By reducing the ratio of positive data, we effectively increase the ratio of negative sample data.
Resampling — Oversample minority class: In this technique, we add more copies of negative data (invalid credit card/spam email) to our sample dataset. We need to be cautious not to run into the risk of overfitting the data.
SMOTE technique: Synthetic Minority Oversampling Technique (SMOTE) is used to generate new and synthetic data that can be used as training data to train the model.

Conclusion

Though this guide addresses the important aspects of preparing the data, it is just the tip of the iceberg. Multiple other approaches and techniques are available depending on the type of data. The Python ecosystem offers multiple libraries that are well equipped to address this optimally. This is also an iterative process, and you may need to go through some of these steps more than once to fine-tune your data to prepare a better-predicting model.