Missing data is one of the annoying aspects that occur when dealing with data sets of varying sizes. There are multiple reasons due to which data might be missing in the data sets. Some of the common reasons are:
Based on the origin of the missing data, the following terminology is used to describe it:
Missing At Random (MAR): This category of missing data refers to the attributes that could not be answered due to the way the survey was designed. For example, consider the following questions in a survey:
a. Do you smoke? Yes, No
b. If yes, how frequently? once a week, once a day, twice a day, more than 2 times in a day
You can see that answer to question b can be given only if the answer to the question a is ‘Yes’. This kind of missing values in the dataset arise due to the dependency of one attribute on another attribute.
Missing Completely At Random (MCAR): This category of missing data is truly missed data or data that was not captured due to oversight or for other reasons. In a survey, a person may take a break while filling in a questionnaire and, after coming back, he may start from the next page leaving a few of the questions on the previous page unanswered.
There are two primary ways in which we can handle the missing data.
In this method of handling missing data, the user removes the record or column for which data is missing from the data set.
Let’s consider the following data set:
1import pandas as pd
2df = pd.read_csv('household_data_missing.csv')
3print(df)
Output:
1 Item_Category Gender Age Salary Purchased satisfaction
20 Fitness Male 20 NaN Yes NaN
31 Fitness Female 50 70000.0 No NaN
42 Food Male 35 50000.0 Yes NaN
53 Kitchen Male 22 NaN No NaN
64 Kitchen Female 30 35000.0 Yes NaN
Remove all of the columns that have all values as NA.
1print(df.dropna(axis='columns', how='all'))
Output:
1 Item_Category Gender Age Salary Purchased
20 Fitness Male 20 NaN Yes
31 Fitness Female 50 70000.0 No
42 Food Male 35 50000.0 Yes
53 Kitchen Male 22 NaN No
64 Kitchen Female 30 35000.0 Yes
Retain all rows that have at least five values present.
1print(df.dropna(axis='rows', thresh=5))
Output:
1 Item_Category Gender Age Salary Purchased satisfaction
21 Fitness Female 50 70000.0 No NaN
32 Food Male 35 50000.0 Yes NaN
44 Kitchen Female 30 35000.0 Yes NaN
It is advisable to retain the data as much as possible without deleting it. To achieve this, the user can utilize the available data points to estimate the values of the unknown data by using the technique known as interpolation. There are various methods provided in pandas interpolate function that can be used to obtain the data values.
1print(df.interpolate(method='linear'))
Output:
1 Item_Category Gender Age Salary Purchased satisfaction
20 Fitness Male 20 25000.000000 Yes NaN
31 Fitness Female 50 70000.000000 No NaN
42 Food Male 35 58333.333333 Yes NaN
53 Kitchen Male 22 46666.666667 No NaN
64 Kitchen Female 30 35000.000000 Yes NaN
1print(df.interpolate(method='quadratic'))
Output:
1 Item_Category Gender Age Salary Purchased satisfaction
20 Fitness Male 20 25000.000000 Yes NaN
31 Fitness Female 50 70000.000000 No NaN
42 Food Male 35 86666.666667 Yes NaN
53 Kitchen Male 22 75000.000000 No NaN
64 Kitchen Female 30 35000.000000 Yes NaN
There are many other methods that are provided that can be used in different situations:
There are many ways that a user can handle missing data, from deleting the data points having missing data to interpolation. However, there are many factors and risks involved in each of the strategies that need to be understood before making the selection of the method. As seen above user should try to make use of data available in hand as much as possible but using Interpolation over a scarcely scattered data may lead to overfitting of the data and thus resulting in unpredictable results.