Cleaning data is a critical component of data science and predictive modeling. Even the best of machine learning algorithms will fail if the data is not clean. In this guide, you will learn about the techniques required to perform the most widely used data cleaning tasks in Python.
In this guide, we will be using fictitious data from loan applicants which contains 600 observations and 12 variables, as described below:
UID - Unique identifier of an applicant.
Marital_status - Whether the applicant is married ("Yes") or not ("No").
Dependents - Number of dependents of the applicant.
Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").
Income - Annual Income of the applicant (in US dollars).
Loan_amount - Loan amount (in US dollars) for which the application was submitted.
Term_months - Tenure of the loan (in months).
Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
Age - The applicant’s age in years.
Sex - Whether the applicant is female (F) or male (M).
Purpose - Purpose of applying for the loan.
Let's start by loading the required libraries and the data.
1# Import required libraries
2import pandas as pd
3import numpy as np
4
5# Reading the data
6df = pd.read_csv("data_cle.csv")
7print(df.shape)
8print(df.info())
Output:
1 (600, 12)
2 <class 'pandas.core.frame.DataFrame'>
3 RangeIndex: 600 entries, 0 to 599
4 Data columns (total 12 columns):
5 UID 600 non-null object
6 Marital_status 600 non-null object
7 Dependents 593 non-null float64
8 Is_graduate 594 non-null object
9 Income 600 non-null int64
10 Loan_amount 593 non-null float64
11 Term_months 594 non-null float64
12 Credit_score 600 non-null object
13 approval_status 600 non-null int64
14 Age 598 non-null float64
15 Sex 600 non-null object
16 Purpose 600 non-null object
17 dtypes: float64(4), int64(2), object(6)
18 memory usage: 56.3+ KB
19 None
The above output shows that there are 600 observations of 12 variables. Some of the variables have less than 600 records, indicating the presence of missing values.
One of the most common errors in real world data is the presence of duplicate records. Such records are of no use and must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the drop_duplicates() function to find and drop the duplicates. The second line prints the shape of the data - 597 observations of 12 variables. There were three duplicates in the 'UID' variable that have been removed.
1df = df.drop_duplicates('UID', keep='first')
2print(df.shape)
Output:
1 (597, 12)
Let us look at the data, using the line of code below. The output shows that the minimum value of the variable, 'Age', is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of wrong records.
1df.describe()
Output:
1| | Dependents | Income | Loan_amount | Term_months | approval_status | Age |
2|------- |------------ |-------------- |-------------- |------------- |----------------- |------------ |
3| count | 590.000000 | 5.970000e+02 | 5.900000e+02 | 591.000000 | 597.000000 | 595.000000 |
4| mean | 0.755932 | 7.073414e+05 | 3.287864e+05 | 366.111675 | 0.681742 | 49.030252 |
5| std | 1.029987 | 7.125482e+05 | 7.293830e+05 | 59.970906 | 0.466191 | 15.698842 |
6| min | 0.000000 | 3.000000e+04 | 1.860000e+04 | 60.000000 | 0.000000 | -12.000000 |
7| 25% | 0.000000 | 3.852000e+05 | 6.150000e+04 | 384.000000 | 0.000000 | 36.500000 |
8| 50% | 0.000000 | 5.084000e+05 | 7.650000e+04 | 384.000000 | 1.000000 | 51.000000 |
9| 75% | 1.000000 | 7.661000e+05 | 1.346250e+05 | 384.000000 | 1.000000 | 61.000000 |
10| max | 6.000000 | 8.444900e+06 | 7.780000e+06 | 504.000000 | 1.000000 | 76.000000 |
There are many techniques to handle such incorrect entries. One such approach is to delete the records but, instead, we will replace them with the value of 20. The assumption here is that the minimum age of loan applicants would be twenty. The first three lines of code below creates a 'for' loop, which locates the negative values in the 'Age' variable and replaces them with the value of twenty. The fourth line of code prints the summary statistics for the 'Age' variable, which shows that the correction has been done.
1for i in range(df.shape[0]):
2 if df.Age.iloc[i] < 0.0:
3 df.Age.iloc[i] = 20.0
4
5df['Age'].describe()
Output:
1 count 595.000000
2 mean 49.373109
3 std 14.703877
4 min 20.000000
5 25% 36.500000
6 50% 51.000000
7 75% 61.000000
8 max 76.000000
9 Name: Age, dtype: float64
Missing value treatment is the most common data cleaning step performed in a data science project. The isnull().sum() function can be used to find the number of missing values across the variables.
1df.isnull().sum()
Output:
1 UID 0
2 Marital_status 0
3 Dependents 7
4 Is_graduate 6
5 Income 0
6 Loan_amount 7
7 Term_months 6
8 Credit_score 0
9 approval_status 0
10 Age 2
11 Sex 0
12 Purpose 0
13 dtype: int64
The output above shows the presence of missing values across the variables; most of which are numerical variables, except 'Is_graduate', which is a categorical variable. For categorical variables, it is important to understand the frequency distribution, which you can see in the line of code below. The output shows that the majority of the applicants were graduates, identified with the label 'Yes'.
1df['Is_graduate'].value_counts()
Output:
1 Yes 463
2 No 128
3 Name: Is_graduate, dtype: int64
We are now ready for missing value imputation. The most widely used technique is to replace the missing values with the measures of central tendency - mean, median, and mode. This is done in the first four lines of code below. The fifth line deletes the rows where any variable may still have missing values.
1#Missing value treatment for continous variable
2df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
3df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
4df['Age'].fillna(df['Age'].mean(), inplace=True)
5
6#Missing value treatment for categorical variable
7df['Is_graduate'].fillna('Yes',inplace=True)
8
9# Deleting the missing values in remaining variables
10df = df.dropna()
11
12# Checking the missing values again
13df.isnull().sum()
Output:
1 UID 0
2 Marital_status 0
3 Dependents 0
4 Is_graduate 0
5 Income 0
6 Loan_amount 0
7 Term_months 0
8 Credit_score 0
9 approval_status 0
10 Age 0
11 Sex 0
12 Purpose 0
13 dtype: int64
There are two types of variables required for building machine learning models. These are termed as the dependent (or target) and the independent (or predictor) variables. The unique identifier variable, 'UID', does not belong to either of these types. We can separate or remove such variables from the data, as is done in the first line of code below. The second line prints the shape of the resulting dataframe which now contains 590 observations of 11 variables.
1df = df.drop(['UID'], axis=1)
2print(df.shape)
Output:
1 (590, 11)
Some of the variables in the dataset, such as the 'Sex' variable, contain the labels ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric. So, we will have to convert these variables using a technique called dummy encoding. In this technique, the features are encoded so that there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first five lines of code below. The sixth line prints the information about the data, which indicates that the data now has 15 variables.
1df = pd.get_dummies(df, columns=['Sex'], drop_first=True, prefix='S')
2df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
3df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
4df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
5df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
6df.info()
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 Int64Index: 590 entries, 1 to 599
3 Data columns (total 15 columns):
4 Dependents 590 non-null float64
5 Income 590 non-null int64
6 Loan_amount 590 non-null float64
7 Term_months 590 non-null float64
8 approval_status 590 non-null int64
9 Age 590 non-null float64
10 S_F 590 non-null uint8
11 Pur_Education 590 non-null uint8
12 Pur_Furniture 590 non-null uint8
13 Pur_Personal 590 non-null uint8
14 Pur_Travel 590 non-null uint8
15 Pur_Wedding 590 non-null uint8
16 M_Yes 590 non-null uint8
17 Ed_Yes 590 non-null uint8
18 CSc_Satisfactory 590 non-null uint8
19 dtypes: category(1), float64(4), int64(2), uint8(8)
20 memory usage: 57.5 KB
One of the biggest problems in machine learning is the presence of outliers, which are extreme values different from the other data points. For numerical variables, we can identify outliers through the skewness value. The two lines of code below prints the skewness value, along with the summary statistics, for the 'Income' variable.
1print(df['Income'].skew())
2df['Income'].describe()
Output:
1 5.31688331023
2
3 count 5.900000e+02
4 mean 7.112837e+05
5 std 7.157367e+05
6 min 3.000000e+04
7 25% 3.861750e+05
8 50% 5.099500e+05
9 75% 7.726500e+05
10 max 8.444900e+06
11 Name: Income, dtype: float64
The output above shows that the variable 'Income' has a right-skewed distribution with a skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques for handling outliers, one of which is quantile-based capping or flooring. We will do the flooring for lower values and capping for the higher values.
The first two lines of code below print the 10th and 90th percentile of the variable, 'Income', which will be used for flooring and capping.
1print(df['Income'].quantile(0.10))
2print(df['Income'].quantile(0.90))
Output:
1 303010.0
2 1250010.0
Now we will execute the outlier removal, which is done in the lines of code below. The skewness value is now much better.
1df["Income"] = np.where(df["Income"] <303010.0, 303010.0,df['Income'])
2df["Income"] = np.where(df["Income"] >1250010.0, 1250010.0,df['Income'])
3print(df['Income'].skew())
Output:
1 1.00971894475
In this guide, you have learned about the most common problems faced while dealing with the real data and the techniques for cleaning such messy data. The usage of these techniques varies with the data, the problem statement, and the machine learning algorithm selected for building the model.
To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides: