Author avatar

Deepika Singh

Cleaning Data Using Python

Deepika Singh

  • Sep 16, 2019
  • 13 Min read
  • 2,958 Views
  • Sep 16, 2019
  • 13 Min read
  • 2,958 Views
Languages Frameworks and Tools
Python

Introduction

Cleaning data is a critical component of data science and predictive modeling. Even the best of machine learning algorithms will fail if the data is not clean. In this guide, you will learn about the techniques required to perform the most widely used data cleaning tasks in Python.

Data

In this guide, we will be using fictitious data from loan applicants which contains 600 observations and 12 variables, as described below:

  1. UID - Unique identifier of an applicant.

  2. Marital_status - Whether the applicant is married ("Yes") or not ("No").

  3. Dependents - Number of dependents of the applicant.

  4. Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").

  5. Income - Annual Income of the applicant (in US dollars).

  6. Loan_amount - Loan amount (in US dollars) for which the application was submitted.

  7. Term_months - Tenure of the loan (in months).

  8. Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").

  9. Age - The applicant’s age in years.

  10. Sex - Whether the applicant is female (F) or male (M).

  11. Purpose - Purpose of applying for the loan.

  12. approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.

Let's start by loading the required libraries and the data.

1# Import required libraries
2import pandas as pd
3import numpy as np 
4
5# Reading the data
6df = pd.read_csv("data_cle.csv")
7print(df.shape)
8print(df.info())
python

Output:

1    (600, 12)
2    <class 'pandas.core.frame.DataFrame'>
3    RangeIndex: 600 entries, 0 to 599
4    Data columns (total 12 columns):
5    UID                600 non-null object
6    Marital_status     600 non-null object
7    Dependents         593 non-null float64
8    Is_graduate        594 non-null object
9    Income             600 non-null int64
10    Loan_amount        593 non-null float64
11    Term_months        594 non-null float64
12    Credit_score       600 non-null object
13    approval_status    600 non-null int64
14    Age                598 non-null float64
15    Sex                600 non-null object
16    Purpose            600 non-null object
17    dtypes: float64(4), int64(2), object(6)
18    memory usage: 56.3+ KB
19    None

The above output shows that there are 600 observations of 12 variables. Some of the variables have less than 600 records, indicating the presence of missing values.

Dropping Duplicates

One of the most common errors in real world data is the presence of duplicate records. Such records are of no use and must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the drop_duplicates() function to find and drop the duplicates. The second line prints the shape of the data - 597 observations of 12 variables. There were three duplicates in the 'UID' variable that have been removed.

1df = df.drop_duplicates('UID', keep='first')
2print(df.shape)
python

Output:

1 (597, 12)

Handling Incorrect Records

Let us look at the data, using the line of code below. The output shows that the minimum value of the variable, 'Age', is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of wrong records.

1df.describe()
python

Output:

1|       	| Dependents 	| Income       	| Loan_amount  	| Term_months 	| approval_status 	| Age        	|
2|-------	|------------	|--------------	|--------------	|-------------	|-----------------	|------------	|
3| count 	| 590.000000 	| 5.970000e+02 	| 5.900000e+02 	| 591.000000  	| 597.000000      	| 595.000000 	|
4| mean  	| 0.755932   	| 7.073414e+05 	| 3.287864e+05 	| 366.111675  	| 0.681742        	| 49.030252  	|
5| std   	| 1.029987   	| 7.125482e+05 	| 7.293830e+05 	| 59.970906   	| 0.466191        	| 15.698842  	|
6| min   	| 0.000000   	| 3.000000e+04 	| 1.860000e+04 	| 60.000000   	| 0.000000        	| -12.000000 	|
7| 25%   	| 0.000000   	| 3.852000e+05 	| 6.150000e+04 	| 384.000000  	| 0.000000        	| 36.500000  	|
8| 50%   	| 0.000000   	| 5.084000e+05 	| 7.650000e+04 	| 384.000000  	| 1.000000        	| 51.000000  	|
9| 75%   	| 1.000000   	| 7.661000e+05 	| 1.346250e+05 	| 384.000000  	| 1.000000        	| 61.000000  	|
10| max   	| 6.000000   	| 8.444900e+06 	| 7.780000e+06 	| 504.000000  	| 1.000000        	| 76.000000  	|

There are many techniques to handle such incorrect entries. One such approach is to delete the records but, instead, we will replace them with the value of 20. The assumption here is that the minimum age of loan applicants would be twenty. The first three lines of code below creates a 'for' loop, which locates the negative values in the 'Age' variable and replaces them with the value of twenty. The fourth line of code prints the summary statistics for the 'Age' variable, which shows that the correction has been done.

1for i in range(df.shape[0]): 
2    if df.Age.iloc[i] < 0.0:
3        df.Age.iloc[i] = 20.0
4        
5df['Age'].describe() 
python

Output:

1    count    595.000000
2    mean      49.373109
3    std       14.703877
4    min       20.000000
5    25%       36.500000
6    50%       51.000000
7    75%       61.000000
8    max       76.000000
9    Name: Age, dtype: float64

Missing Values

Missing value treatment is the most common data cleaning step performed in a data science project. The isnull().sum() function can be used to find the number of missing values across the variables.

1df.isnull().sum() 
python

Output:

1    UID                0
2    Marital_status     0
3    Dependents         7
4    Is_graduate        6
5    Income             0
6    Loan_amount        7
7    Term_months        6
8    Credit_score       0
9    approval_status    0
10    Age                2
11    Sex                0
12    Purpose            0
13    dtype: int64

The output above shows the presence of missing values across the variables; most of which are numerical variables, except 'Is_graduate', which is a categorical variable. For categorical variables, it is important to understand the frequency distribution, which you can see in the line of code below. The output shows that the majority of the applicants were graduates, identified with the label 'Yes'.

1df['Is_graduate'].value_counts()
python

Output:

1    Yes    463
2    No     128
3    Name: Is_graduate, dtype: int64

We are now ready for missing value imputation. The most widely used technique is to replace the missing values with the measures of central tendency - mean, median, and mode. This is done in the first four lines of code below. The fifth line deletes the rows where any variable may still have missing values.

1#Missing value treatment for continous variable
2df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
3df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
4df['Age'].fillna(df['Age'].mean(), inplace=True)
5
6#Missing value treatment for categorical variable
7df['Is_graduate'].fillna('Yes',inplace=True)
8
9# Deleting the missing values in remaining variables
10df = df.dropna()
11
12# Checking the missing values again
13df.isnull().sum()
python

Output:

1    UID                0
2    Marital_status     0
3    Dependents         0
4    Is_graduate        0
5    Income             0
6    Loan_amount        0
7    Term_months        0
8    Credit_score       0
9    approval_status    0
10    Age                0
11    Sex                0
12    Purpose            0
13    dtype: int64

Dropping Irrelevant Columns

There are two types of variables required for building machine learning models. These are termed as the dependent (or target) and the independent (or predictor) variables. The unique identifier variable, 'UID', does not belong to either of these types. We can separate or remove such variables from the data, as is done in the first line of code below. The second line prints the shape of the resulting dataframe which now contains 590 observations of 11 variables.

1df = df.drop(['UID'], axis=1)
2print(df.shape)
python

Output:

1  (590, 11)

Dummy Encoding

Some of the variables in the dataset, such as the 'Sex' variable, contain the labels ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric. So, we will have to convert these variables using a technique called dummy encoding. In this technique, the features are encoded so that there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first five lines of code below. The sixth line prints the information about the data, which indicates that the data now has 15 variables.

1df = pd.get_dummies(df, columns=['Sex'], drop_first=True, prefix='S')
2df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
3df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
4df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
5df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
6df.info()
python

Output:

1    <class 'pandas.core.frame.DataFrame'>
2    Int64Index: 590 entries, 1 to 599
3    Data columns (total 15 columns):
4    Dependents          590 non-null float64
5    Income              590 non-null int64
6    Loan_amount         590 non-null float64
7    Term_months         590 non-null float64
8    approval_status     590 non-null int64
9    Age                 590 non-null float64
10    S_F                 590 non-null uint8
11    Pur_Education       590 non-null uint8
12    Pur_Furniture       590 non-null uint8
13    Pur_Personal        590 non-null uint8
14    Pur_Travel          590 non-null uint8
15    Pur_Wedding         590 non-null uint8
16    M_Yes               590 non-null uint8
17    Ed_Yes              590 non-null uint8
18    CSc_Satisfactory    590 non-null uint8
19    dtypes: category(1), float64(4), int64(2), uint8(8)
20    memory usage: 57.5 KB

Outlier Treatment

One of the biggest problems in machine learning is the presence of outliers, which are extreme values different from the other data points. For numerical variables, we can identify outliers through the skewness value. The two lines of code below prints the skewness value, along with the summary statistics, for the 'Income' variable.

1print(df['Income'].skew())
2df['Income'].describe()
python

Output:

1    5.31688331023
2
3    count    5.900000e+02
4    mean     7.112837e+05
5    std      7.157367e+05
6    min      3.000000e+04
7    25%      3.861750e+05
8    50%      5.099500e+05
9    75%      7.726500e+05
10    max      8.444900e+06
11    Name: Income, dtype: float64

The output above shows that the variable 'Income' has a right-skewed distribution with a skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques for handling outliers, one of which is quantile-based capping or flooring. We will do the flooring for lower values and capping for the higher values.

The first two lines of code below print the 10th and 90th percentile of the variable, 'Income', which will be used for flooring and capping.

1print(df['Income'].quantile(0.10))
2print(df['Income'].quantile(0.90))
python

Output:

1    303010.0
2    1250010.0

Now we will execute the outlier removal, which is done in the lines of code below. The skewness value is now much better.

1df["Income"] = np.where(df["Income"] <303010.0, 303010.0,df['Income'])
2df["Income"] = np.where(df["Income"] >1250010.0, 1250010.0,df['Income'])
3print(df['Income'].skew())
python

Output:

1    1.00971894475

Conclusion

In this guide, you have learned about the most common problems faced while dealing with the real data and the techniques for cleaning such messy data. The usage of these techniques varies with the data, the problem statement, and the machine learning algorithm selected for building the model.

To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides: