Author avatar

Vaibhav Sharma

Handling Missing Data in Machine Learning Models

Vaibhav Sharma

  • Jan 29, 2019
  • 6 Min read
  • 955 Views
  • Jan 29, 2019
  • 6 Min read
  • 955 Views
Data
Python

Introduction

Missing data is one of the annoying aspects that occur when dealing with data sets of varying sizes. There are multiple reasons due to which data might be missing in the data sets. Some of the common reasons are:

  • Data is merged from various sources: One set of data did not capture some value and another set did not have other value. So, there will be gaps in the data.
  • Data gets richer over time: The data that was collected first, chronologically, may not have the attributes that were collected at a later time.
  • Data not getting collected anymore: Due to some reasons (ethical, political) the data may not be collected for certain attributes e.g. Government may decide that data related to religion, race, and ethnicity should not be collected anymore.
  • During a survey process, some individuals could not answer all of the questions.

Terminology Used To Describe the Missing Data

Based on the origin of the missing data, the following terminology is used to describe it:

  1. Missing At Random (MAR): This category of missing data refers to the attributes that could not be answered due to the way the survey was designed. For example, consider the following questions in a survey:

    a. Do you smoke? Yes, No

    b. If yes, how frequently? once a week, once a day, twice a day, more than 2 times in a day

You can see that answer to question b can be given only if the answer to the question a is ‘Yes’. This kind of missing values in the dataset arise due to the dependency of one attribute on another attribute.

  1. Missing Completely At Random (MCAR): This category of missing data is truly missed data or data that was not captured due to oversight or for other reasons. In a survey, a person may take a break while filling in a questionnaire and, after coming back, he may start from the next page leaving a few of the questions on the previous page unanswered.

  2. Missing Not At Random (MNAR): This category of missing data is dependent on the value of the data itself. For example, a survey needs people to reveal their 10th-grade marks in Chemistry. It may happen that people with lower marks may choose not to reveal them, so you would see only high marks in the data sample.

What to Do with Missing Data

There are two primary ways in which we can handle the missing data.

Deleting the Data

In this method of handling missing data, the user removes the record or column for which data is missing from the data set.

Let’s consider the following data set:

1
2
3
import pandas as pd 
df = pd.read_csv('household_data_missing.csv')   
print(df) 
pd

Output:

1
2
3
4
5
6
  Item_Category  Gender  Age   Salary Purchased  satisfaction
0       Fitness    Male   20      NaN       Yes            NaN
1       Fitness  Female   50  70000.0        No            NaN
2          Food    Male   35  50000.0       Yes            NaN
3       Kitchen    Male   22      NaN        No            NaN
4       Kitchen  Female   30  35000.0       Yes            NaN
pd

Remove all of the columns that have all values as NA.

1
print(df.dropna(axis='columns', how='all'))
pd

Output:

1
2
3
4
5
6
  Item_Category  Gender  Age   Salary Purchased
0       Fitness    Male   20      NaN       Yes
1       Fitness  Female   50  70000.0        No
2          Food    Male   35  50000.0       Yes
3       Kitchen    Male   22      NaN        No
4       Kitchen  Female   30  35000.0       Yes
pd

Retain all rows that have at least five values present.

1
print(df.dropna(axis='rows', thresh=5))
pd

Output:

1
2
3
4
  Item_Category  Gender  Age   Salary Purchased  satisfaction
1       Fitness  Female   50  70000.0        No            NaN
2          Food    Male   35  50000.0       Yes            NaN
4       Kitchen  Female   30  35000.0       Yes            NaN
pd

Interpolation

It is advisable to retain the data as much as possible without deleting it. To achieve this, the user can utilize the available data points to estimate the values of the unknown data by using the technique known as interpolation. There are various methods provided in pandas interpolate function that can be used to obtain the data values.

1
print(df.interpolate(method='linear'))
pd

Output:

1
2
3
4
5
6
  Item_Category  Gender  Age        Salary Purchased  satisfaction
0       Fitness    Male   20  25000.000000       Yes            NaN
1       Fitness  Female   50  70000.000000        No            NaN
2          Food    Male   35  58333.333333       Yes            NaN
3       Kitchen    Male   22  46666.666667        No            NaN
4       Kitchen  Female   30  35000.000000       Yes            NaN
pd
1
print(df.interpolate(method='quadratic'))
pd

Output:

1
2
3
4
5
6
  Item_Category  Gender  Age        Salary Purchased  satisfaction
0       Fitness    Male   20  25000.000000       Yes            NaN
1       Fitness  Female   50  70000.000000        No            NaN
2          Food    Male   35  86666.666667       Yes            NaN
3       Kitchen    Male   22  75000.000000        No            NaN
4       Kitchen  Female   30  35000.000000       Yes            NaN
pd

Other Methods

There are many other methods that are provided that can be used in different situations:

  1. Spline: If the estimated values are outside the known minimum and maximum range.
  2. Kringing: This model uses the correlation of all existing data points to predict the values of the missing data.
  3. Quadratic: When the value of data is changing at an increased rate.
  4. Akima: If the aim is to get the smooth movement from one point to another then Akima interpolation should be used.

Conclusion

There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation. However, there are many factors and risks involved in each of the strategies that need to be understood before making the selection of the method. As seen above user should try to make use of data available in hand as much as possible but using Interpolation over a scarcely scattered data may lead to overfitting of the data and thus resulting in unpredictable results.

46