Data preparation often takes eighty percent of the data scientist's time in a data science project, which emphasizes its importance in the machine learning life-cycle.
In this guide, you will learn the basics and implementation of several data preparation techniques, mentioned below:
Converting the Column Types
In this guide, we will be using fictitious data of loan applicants which contains 600 observations and 10 variables, as described below:
approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.
Let's start by loading the required libraries and modules.
1# Import required libraries
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6%matplotlib inline
7
8# Import necessary modules
9from sklearn.linear_model import LogisticRegression
10from sklearn.model_selection import train_test_split
11from sklearn.metrics import confusion_matrix, classification_report
The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 600 observations of 10 variables. The third line gives the summary statistics of the variables.
1# Load data
2dat2 = pd.read_csv("data_prep.csv")
3print(dat2.shape)
4dat2.describe()
Output:
1(600, 10)
2
3| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age |
4|------- |---------------- |------------ |------------- |--------------- |------------- |------------- |-------------- |----------------- |------------ |
5| count | 600.000000 | 598.000000 | 599.000000 | 600.000000 | 600.000000 | 600.00000 | 600.000000 | 600.000000 | 600.000000 |
6| mean | 0.651667 | 0.730769 | 2.449082 | 7210.720000 | 161.571667 | 367.10000 | 0.788333 | 0.686667 | 51.766667 |
7| std | 0.476840 | 0.997194 | 40.788143 | 8224.445086 | 93.467598 | 63.40892 | 0.408831 | 0.464236 | 21.240704 |
8| min | 0.000000 | 0.000000 | 0.000000 | 200.000000 | 10.000000 | 36.00000 | 0.000000 | 0.000000 | 0.000000 |
9| 25% | 0.000000 | 0.000000 | 1.000000 | 3832.500000 | 111.000000 | 384.00000 | 1.000000 | 0.000000 | 36.000000 |
10| 50% | 1.000000 | 0.000000 | 1.000000 | 5075.000000 | 140.000000 | 384.00000 | 1.000000 | 1.000000 | 51.000000 |
11| 75% | 1.000000 | 1.000000 | 1.000000 | 7641.500000 | 180.500000 | 384.00000 | 1.000000 | 1.000000 | 64.000000 |
12| max | 1.000000 | 3.000000 | 999.000000 | 108000.000000 | 778.000000 | 504.00000 | 1.000000 | 1.000000 | 200.000000 |
13
The above output shows that the variable 'Age' has minimum and maximum values of 0 and 200, respectively. Also, the variable 'Is_graduate' has a maximum value of 999, instead of the binary values of '0' and '1'. These entries are incorrect and needs correction. One approach would be to delete these records but instead, we will treat these records as missing values and replace them with a measure of central tendency - i.e., mean, median, or mode.
Starting with the 'Age' variable, the first two lines of code below replace the incorrect values '0' and '200' with 'NaN', an indicator of missing values. We repeat the same process for the variable 'Is_graduate' in the third line of code. The fourth line prints the information about the variables.
1dat2.Age.replace(0, np.nan, inplace=True)
2dat2.Age.replace(200, np.nan, inplace=True)
3dat2.Is_graduate.replace(999, np.nan, inplace=True)
4dat2.info()
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 600 entries, 0 to 599
3 Data columns (total 10 columns):
4 Marital_status 600 non-null int64
5 Dependents 598 non-null float64
6 Is_graduate 598 non-null float64
7 Income 600 non-null int64
8 Loan_amount 600 non-null int64
9 Term_months 600 non-null int64
10 Credit_score 600 non-null int64
11 approval_status 600 non-null int64
12 Age 594 non-null float64
13 Sex 595 non-null object
14 dtypes: float64(3), int64(6), object(1)
15 memory usage: 47.0+ KB
Now, the variables 'Age' and 'Is_graduate' have 594 and 598 records, respectively. The left out entries have been tagged as missing, which we will learn about in the next section.
There are various techniques for handling missing values. The most widely used one is replacing the values with the measures of central tendency. The first line of code below replaces the missing values of the 'Age' variable with the mean of the remaining values. The second line replaces the missing values of the 'Is_graduate' variable with the value of '1', which indicates that the applicant's education status is 'graduate'. The third line gives the summary statistics of the variables.
1dat2['Age'].fillna(dat2['Age'].mean(), inplace=True)
2dat2['Is_graduate'].fillna(1,inplace=True)
3dat2.describe()
Output:
1| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age |
2|------- |---------------- |------------ |------------- |--------------- |------------- |------------- |-------------- |----------------- |------------ |
3| count | 600.000000 | 598.000000 | 600.000000 | 600.000000 | 600.000000 | 600.00000 | 600.000000 | 600.000000 | 600.000000 |
4| mean | 0.651667 | 0.730769 | 0.783333 | 7210.720000 | 161.571667 | 367.10000 | 0.788333 | 0.686667 | 50.606061 |
5| std | 0.476840 | 0.997194 | 0.412317 | 8224.445086 | 93.467598 | 63.40892 | 0.408831 | 0.464236 | 16.184651 |
6| min | 0.000000 | 0.000000 | 0.000000 | 200.000000 | 10.000000 | 36.00000 | 0.000000 | 0.000000 | 22.000000 |
7| 25% | 0.000000 | 0.000000 | 1.000000 | 3832.500000 | 111.000000 | 384.00000 | 1.000000 | 0.000000 | 36.000000 |
8| 50% | 1.000000 | 0.000000 | 1.000000 | 5075.000000 | 140.000000 | 384.00000 | 1.000000 | 1.000000 | 50.606061 |
9| 75% | 1.000000 | 1.000000 | 1.000000 | 7641.500000 | 180.500000 | 384.00000 | 1.000000 | 1.000000 | 64.000000 |
10| max | 1.000000 | 3.000000 | 1.000000 | 108000.000000 | 778.000000 | 504.00000 | 1.000000 | 1.000000 | 80.000000 |
11
The corrections have now been made in both of the variables. The data also has a variable, 'Sex', with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.
1dat2['Sex'].value_counts()
Output:
1 M 484
2 F 111
3 Name: Sex, dtype: int64
The output shows that 484 out of 595 applicants are male, so we will replace the missing values with label 'M'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the 'Sex' variable, which means the missing values have been accounted for.
1dat2['Sex'].fillna('M',inplace=True)
2dat2['Sex'].value_counts()
Output:
1 M 489
2 F 111
3 Name: Sex, dtype: int64
We will now check if any more variables have missing values, which is done in the line of code below. The output shows that we still have two missing values in the variable 'Dependents'.
1dat2.isnull().sum()
Output:
1 Marital_status 0
2 Dependents 2
3 Is_graduate 0
4 Income 0
5 Loan_amount 0
6 Term_months 0
7 Credit_score 0
8 approval_status 0
9 Age 0
10 Sex 0
11 dtype: int64
Since there are only two missing values in the dataset, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset.
1dat2 = dat2.dropna()
2dat2.info()
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 Int64Index: 598 entries, 0 to 599
3 Data columns (total 10 columns):
4 Marital_status 598 non-null int64
5 Dependents 598 non-null float64
6 Is_graduate 598 non-null float64
7 Income 598 non-null int64
8 Loan_amount 598 non-null int64
9 Term_months 598 non-null int64
10 Credit_score 598 non-null int64
11 approval_status 598 non-null int64
12 Age 598 non-null float64
13 Sex 598 non-null object
14 dtypes: float64(3), int64(6), object(1)
15 memory usage: 51.4+ KB
The missing values have been treated in the data, but the labels in the variable 'Sex' use letters ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. Since there are two labels, we can do binary encoding which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.
1dat2["Sex"] = dat2["Sex"].map({"M": 0, "F":1})
2dat2['Sex'].value_counts()
Output:
1 0 487
2 1 111
3 Name: Sex, dtype: int64
One of the biggest obstacles in predictive modeling can be the presence of outliers which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.
For numerical variables, we can identify outliers visually through a histogram or numerically through the skewness value. The two lines of code below plot the histogram along with the skewness value for the 'Income' variable.
1plot1 = sns.distplot(dat2["Income"], color="b", label="Skewness : %.1f"%(dat2["Income"].skew()))
2plot1 = plot1.legend(loc="best")
Output:
The histogram shows that the variable 'Income' has a right-skewed distribution with the skewness value of 6.5. Ideally, the skewness value should be between -1 and 1.
Apart from the variable 'Income', we also have other variables ('Loan_amount' and 'Age') that have differences in scale which require normalization. We will learn a couple of techniques in the subsequent sections to deal with these preprocessing problems.
The previous chart showed that the variable 'Income' is skewed. One of the ways to make its distribution normal is by logarithmic transformation. The first line of code below creates a new variable, 'LogIncome', while the second and third lines of code plot the histogram and skewness value of this new variable.
1dat2["LogIncome"] = dat2["Income"].map(lambda i: np.log(i) if i > 0 else 0)
2plot2 = sns.distplot(dat2["LogIncome"], color="m", label="Skewness : %.1f"%(dat2["LogIncome"].skew()))
3plot2 = plot2.legend(loc="best")
Output:
The above chart shows that taking the log of the 'Income' variable makes the distribution roughly normal and reduces the skewness. We can use the same transformation for other numerical variables, but, instead, we will learn another transformation technique called Standardization.
Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as 'Age' in years and 'Income' in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to be using a similar scale that can be achieved through scaling techniques.
One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the three variables, 'Income','Loan_amount', and 'Age'. Finally, the third line prints the variance of the scaled variables.
1from sklearn.preprocessing import StandardScaler
2dat2[['Income','Loan_amount', 'Age']] = StandardScaler().fit_transform(dat2[['Income','Loan_amount', 'Age']])
3print(dat2['Income'].var()); print(dat2['Loan_amount'].var()); print(dat2['Age'].var())
Output:
11.0016750418760463
21.0016750418760472
31.001675041876044
There is one variance for all the standardized variables. Let us now look at the variables after all the preprocessing till now.
1print(dat2.info())
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 Int64Index: 598 entries, 0 to 599
3 Data columns (total 11 columns):
4 Marital_status 598 non-null int64
5 Dependents 598 non-null float64
6 Is_graduate 598 non-null float64
7 Income 598 non-null float64
8 Loan_amount 598 non-null float64
9 Term_months 598 non-null int64
10 Credit_score 598 non-null int64
11 approval_status 598 non-null int64
12 Age 598 non-null float64
13 Sex 598 non-null int64
14 LogIncome 598 non-null float64
15 dtypes: float64(6), int64(5)
16 memory usage: 56.1 KB
17 None
The two variables, 'Dependents' and 'Is_graduate', have been read as 'float64' which indicates numeric variables with a decimal value. This is not correct, as both of these variables are taking integer values. For carrying out any mathematical operations on the variables during the modeling process, it is important that the variables have the correct data types.
The first two lines of code below converts these variables to the integer data type, while the third line prints the data types of the variables.
1dat2["Dependents"] = dat2["Dependents"].astype("int")
2dat2["Is_graduate"] = dat2["Is_graduate"].astype("int")
3print(dat2.dtypes)
Output:
1 Marital_status int64
2 Dependents int32
3 Is_graduate int32
4 Income float64
5 Loan_amount float64
6 Term_months int64
7 Credit_score int64
8 approval_status int64
9 Age float64
10 Sex int64
11 LogIncome float64
12 dtype: object
The data type for the variables, 'Dependents' and 'Is_graduate', have been corrected. We have created an additional variable, 'LogIncome', to demonstrate logarithmic transformation, however, the same transformation could have been applied to the 'Income' variable without creating a new one.
All the variables now seem to be in the right form and we can use the modeling to predict 'approval_status' of the loan applications. However, that is not within the scope of this guide and you can learn about them through other pluralsight guides on scikit-learn whose links are given in the end.
In this guide, you have learned about the fundamental techniques of data preprocessing for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting the data types.
To learn more about building machine learning models using scikit-learn , please refer to the following guides:
To learn more about building deep learning models using keras , please refer to the following guides: