Handling Missing Data in Machine Learning Models

There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation, each with their own risks.

Jan 29, 2019 • 6 Minute Read

Subscribe to the newsletter

Introduction

Missing data is one of the annoying aspects that occur when dealing with data sets of varying sizes. There are multiple reasons due to which data might be missing in the data sets. Some of the common reasons are:

Data is merged from various sources: One set of data did not capture some value and another set did not have other value. So, there will be gaps in the data.
Data gets richer over time: The data that was collected first, chronologically, may not have the attributes that were collected at a later time.
Data not getting collected anymore: Due to some reasons (ethical, political) the data may not be collected for certain attributes e.g. Government may decide that data related to religion, race, and ethnicity should not be collected anymore.
During a survey process, some individuals could not answer all of the questions.

Terminology Used To Describe the Missing Data

Based on the origin of the missing data, the following terminology is used to describe it:

Missing At Random (MAR): This category of missing data refers to the attributes that could not be answered due to the way the survey was designed. For example, consider the following questions in a survey:

a. Do you smoke? Yes, No

b. If yes, how frequently? once a week, once a day, twice a day, more than 2 times in a day

You can see that answer to question b can be given only if the answer to the question a is ‘Yes’. This kind of missing values in the dataset arise due to the dependency of one attribute on another attribute.

Missing Completely At Random (MCAR): This category of missing data is truly missed data or data that was not captured due to oversight or for other reasons. In a survey, a person may take a break while filling in a questionnaire and, after coming back, he may start from the next page leaving a few of the questions on the previous page unanswered.
Missing Not At Random (MNAR): This category of missing data is dependent on the value of the data itself. For example, a survey needs people to reveal their 10th-grade marks in Chemistry. It may happen that people with lower marks may choose not to reveal them, so you would see only high marks in the data sample.

What to Do with Missing Data

There are two primary ways in which we can handle the missing data.

Deleting the Data

In this method of handling missing data, the user removes the record or column for which data is missing from the data set.

Let’s consider the following data set:

      import pandas as pd 
df = pd.read_csv('household_data_missing.csv')   
print(df)
    

Output:

      Item_Category  Gender  Age   Salary Purchased  satisfaction
     Fitness    Male   20      NaN       Yes            NaN
     Fitness  Female   50  70000.0        No            NaN
        Food    Male   35  50000.0       Yes            NaN
     Kitchen    Male   22      NaN        No            NaN
     Kitchen  Female   30  35000.0       Yes            NaN
    

Remove all of the columns that have all values as NA.

      print(df.dropna(axis='columns', how='all'))

Output:

      Item_Category  Gender  Age   Salary Purchased
     Fitness    Male   20      NaN       Yes
     Fitness  Female   50  70000.0        No
        Food    Male   35  50000.0       Yes
     Kitchen    Male   22      NaN        No
     Kitchen  Female   30  35000.0       Yes
    

Retain all rows that have at least five values present.

      print(df.dropna(axis='rows', thresh=5))

Output:

      Item_Category  Gender  Age   Salary Purchased  satisfaction
     Fitness  Female   50  70000.0        No            NaN
        Food    Male   35  50000.0       Yes            NaN
     Kitchen  Female   30  35000.0       Yes            NaN
    

Interpolation

It is advisable to retain the data as much as possible without deleting it. To achieve this, the user can utilize the available data points to estimate the values of the unknown data by using the technique known as interpolation. There are various methods provided in pandas interpolate function that can be used to obtain the data values.

      print(df.interpolate(method='linear'))

Output:

      Item_Category  Gender  Age        Salary Purchased  satisfaction
     Fitness    Male   20  25000.000000       Yes            NaN
     Fitness  Female   50  70000.000000        No            NaN
        Food    Male   35  58333.333333       Yes            NaN
     Kitchen    Male   22  46666.666667        No            NaN
     Kitchen  Female   30  35000.000000       Yes            NaN
    

      print(df.interpolate(method='quadratic'))

Output:

      Item_Category  Gender  Age        Salary Purchased  satisfaction
     Fitness    Male   20  25000.000000       Yes            NaN
     Fitness  Female   50  70000.000000        No            NaN
        Food    Male   35  86666.666667       Yes            NaN
     Kitchen    Male   22  75000.000000        No            NaN
     Kitchen  Female   30  35000.000000       Yes            NaN
    

Other Methods

There are many other methods that are provided that can be used in different situations:

Spline: If the estimated values are outside the known minimum and maximum range.
Kringing: This model uses the correlation of all existing data points to predict the values of the missing data.
Quadratic: When the value of data is changing at an increased rate.
Akima: If the aim is to get the smooth movement from one point to another then Akima interpolation should be used.

Conclusion

There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation. However, there are many factors and risks involved in each of the strategies that need to be understood before making the selection of the method. As seen above user should try to make use of data available in hand as much as possible but using Interpolation over a scarcely scattered data may lead to overfitting of the data and thus resulting in unpredictable results.