Preparing Data for Modeling with scikit-learn

Jun 24, 2019 • 18 Minute Read

Introduction

Data preparation often takes eighty percent of the data scientist's time in a data science project, which emphasizes its importance in the machine learning life-cycle.

In this guide, you will learn the basics and implementation of several data preparation techniques, mentioned below:

Dealing with Incorrect Entries
Missing Value Treatment
Encoding Categorical Labels
Handling Outliers
Logarithmic Transformation
Standardization
Converting the Column Types

Data

In this guide, we will be using fictitious data of loan applicants which contains 600 observations and 10 variables, as described below:

Marital_status - Whether the applicant is married ("1") or not ("0").
Dependents - Number of dependents claimed by the applicant.
Is_graduate - Whether the applicant is a graduate ("1") or not ("0").
Income - Annual Income of the applicant (in hundreds of dollars).
Loan_amount - Loan amount (in hundreds of dollars) for which the application was submitted.
Term_months - Tenure of the loan (in months).
Credit_score - Whether the applicant's credit score was good ("1") or not ("0").
Age - The applicant’s age in years.
Sex - Whether the applicant is female (F) or male (M).
approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.

Let's start by loading the required libraries and modules.

      # Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
 
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import confusion_matrix, classification_report
    

Reading the Data and Performing Basic Data Checks

The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 600 observations of 10 variables. The third line gives the summary statistics of the variables.

      # Load data
dat2 = pd.read_csv("data_prep.csv")
print(dat2.shape)
dat2.describe()
    

Output:

      (600, 10)
 
|   	   | Marital_status  	| Dependents    | Is_graduate 	| Income    	 	| Loan_amount   | Term_months 	| Credit_score  	| approval_status 	| Age         	|
|-------   |---------------- 	|------------   |------------- |--------------- |------------- |-------------  |-------------- |-----------------	|------------  |
| count    | 600.000000 	 	| 598.000000    | 599.000000  	| 600.000000	 	| 600.000000    | 600.00000   	| 600.000000    	| 600.000000  		| 600.000000  	|
| mean     | 0.651667   	 	| 0.730769      | 2.449082     	| 7210.720000   | 161.571667  	| 367.10000    	| 0.788333 	 | 0.686667    	 	| 51.766667     |
| std      | 0.476840   	 	| 0.997194      | 40.788143   	| 8224.445086    	| 93.467598     | 63.40892	 	| 0.408831 	 | 0.464236    	 	| 21.240704     |
| min      | 0.000000   	 	| 0.000000      | 0.000000     	| 200.000000	 | 10.000000   	| 36.00000     	| 0.000000 	 | 0.000000    	 	| 0.000000      |
| 25%      | 0.000000   	 	| 0.000000      | 1.000000     	| 3832.500000   | 111.000000  	| 384.00000    	| 1.000000 	 | 0.000000    	 	| 36.000000     |
| 50%      | 1.000000   	 	| 0.000000      | 1.000000     	| 5075.000000   | 140.000000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 51.000000     |
| 75%      | 1.000000   	 	| 1.000000      | 1.000000     	| 7641.500000   | 180.500000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 64.000000     |
| max      | 1.000000   	 	| 3.000000      | 999.000000  	| 108000.000000  	| 778.000000    | 504.00000   	| 1.000000 	 	| 1.000000    		| 200.000000  	|
    

Dealing with Incorrect Entries

The above output shows that the variable 'Age' has minimum and maximum values of 0 and 200, respectively. Also, the variable 'Is_graduate' has a maximum value of 999, instead of the binary values of '0' and '1'. These entries are incorrect and needs correction. One approach would be to delete these records but instead, we will treat these records as missing values and replace them with a measure of central tendency - i.e., mean, median, or mode.

Starting with the 'Age' variable, the first two lines of code below replace the incorrect values '0' and '200' with 'NaN', an indicator of missing values. We repeat the same process for the variable 'Is_graduate' in the third line of code. The fourth line prints the information about the variables.

      dat2.Age.replace(0, np.nan, inplace=True)
dat2.Age.replace(200, np.nan, inplace=True)
dat2.Is_graduate.replace(999, np.nan, inplace=True)
dat2.info()
    

Output:

      <class 'pandas.core.frame.DataFrame'>
	RangeIndex: 600 entries, 0 to 599
	Data columns (total 10 columns):
	Marital_status 	600 non-null int64
	Dependents     	598 non-null float64
	Is_graduate    	598 non-null float64
	Income         	600 non-null int64
	Loan_amount    	600 non-null int64
	Term_months    	600 non-null int64
	Credit_score   	600 non-null int64
	approval_status	600 non-null int64
	Age            	594 non-null float64
	Sex            	595 non-null object
	dtypes: float64(3), int64(6), object(1)
	memory usage: 47.0+ KB
    

Now, the variables 'Age' and 'Is_graduate' have 594 and 598 records, respectively. The left out entries have been tagged as missing, which we will learn about in the next section.

Missing Value Treatment

There are various techniques for handling missing values. The most widely used one is replacing the values with the measures of central tendency. The first line of code below replaces the missing values of the 'Age' variable with the mean of the remaining values. The second line replaces the missing values of the 'Is_graduate' variable with the value of '1', which indicates that the applicant's education status is 'graduate'. The third line gives the summary statistics of the variables.

      dat2['Age'].fillna(dat2['Age'].mean(), inplace=True)
dat2['Is_graduate'].fillna(1,inplace=True)
dat2.describe()
    

Output:

      |   	   | Marital_status  	| Dependents    | Is_graduate 	| Income    	 	| Loan_amount   | Term_months 	| Credit_score  	| approval_status 	| Age         	|
|-------   |---------------- 	|------------   |------------- |--------------- |------------- |-------------  |-------------- |-----------------	|------------  |
| count    | 600.000000 	 	| 598.000000    | 600.000000  	| 600.000000	 	| 600.000000    | 600.00000   	| 600.000000    	| 600.000000  		| 600.000000  	|
| mean     | 0.651667   	 	| 0.730769      | 0.783333     	| 7210.720000   | 161.571667  	| 367.10000    	| 0.788333 	 | 0.686667    	 	| 50.606061     |
| std      | 0.476840   	 	| 0.997194      | 0.412317     	| 8224.445086   | 93.467598   	| 63.40892	 	| 0.408831 	 | 0.464236    	 	| 16.184651     |
| min      | 0.000000   	 	| 0.000000      | 0.000000     	| 200.000000	 | 10.000000   	| 36.00000	 	| 0.000000 	 | 0.000000    	 	| 22.000000     |
| 25%      | 0.000000   	 	| 0.000000      | 1.000000     	| 3832.500000   | 111.000000  	| 384.00000    	| 1.000000 	 | 0.000000    	 	| 36.000000     |
| 50%      | 1.000000   	 	| 0.000000      | 1.000000     	| 5075.000000   | 140.000000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 50.606061     |
| 75%      | 1.000000   	 	| 1.000000      | 1.000000     	| 7641.500000   | 180.500000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 64.000000     |
| max      | 1.000000   	 	| 3.000000      | 1.000000     	| 108000.000000 | 778.000000  	| 504.00000    	| 1.000000 	 | 1.000000    	 	| 80.000000     |
    

The corrections have now been made in both of the variables. The data also has a variable, 'Sex', with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.

      dat2['Sex'].value_counts()

Output:

      M    484
	F    111
	Name: Sex, dtype: int64
    

The output shows that 484 out of 595 applicants are male, so we will replace the missing values with label 'M'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the 'Sex' variable, which means the missing values have been accounted for.

      dat2['Sex'].fillna('M',inplace=True)
dat2['Sex'].value_counts()
    

Output:

      M    489
	F    111
	Name: Sex, dtype: int64
    

We will now check if any more variables have missing values, which is done in the line of code below. The output shows that we still have two missing values in the variable 'Dependents'.

      dat2.isnull().sum()

Output:

      Marital_status 	0
	Dependents     	2
	Is_graduate    	0
	Income         	0
	Loan_amount    	0
	Term_months    	0
	Credit_score   	0
	approval_status	0
	Age            	0
	Sex            	0
	dtype: int64
    

Since there are only two missing values in the dataset, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset.

      dat2 = dat2.dropna()
dat2.info()
    

Output:

      <class 'pandas.core.frame.DataFrame'>
	Int64Index: 598 entries, 0 to 599
	Data columns (total 10 columns):
	Marital_status 	598 non-null int64
	Dependents     	598 non-null float64
	Is_graduate    	598 non-null float64
	Income         	598 non-null int64
	Loan_amount    	598 non-null int64
	Term_months    	598 non-null int64
	Credit_score   	598 non-null int64
	approval_status	598 non-null int64
	Age            	598 non-null float64
	Sex            	598 non-null object
	dtypes: float64(3), int64(6), object(1)
	memory usage: 51.4+ KB
    

Encoding Categorical Labels

The missing values have been treated in the data, but the labels in the variable 'Sex' use letters ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. Since there are two labels, we can do binary encoding which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.

      dat2["Sex"] = dat2["Sex"].map({"M": 0, "F":1})
dat2['Sex'].value_counts()
    

Output:

      0	487
	1    111
	Name: Sex, dtype: int64
    

Handling Outliers

One of the biggest obstacles in predictive modeling can be the presence of outliers which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.

For numerical variables, we can identify outliers visually through a histogram or numerically through the skewness value. The two lines of code below plot the histogram along with the skewness value for the 'Income' variable.

      plot1 = sns.distplot(dat2["Income"], color="b", label="Skewness : %.1f"%(dat2["Income"].skew()))
plot1 = plot1.legend(loc="best")
    

Output:

The histogram shows that the variable 'Income' has a right-skewed distribution with the skewness value of 6.5. Ideally, the skewness value should be between -1 and 1.

Apart from the variable 'Income', we also have other variables ('Loan_amount' and 'Age') that have differences in scale which require normalization. We will learn a couple of techniques in the subsequent sections to deal with these preprocessing problems.

Logarithmic Transformation of Numerical Variables

The previous chart showed that the variable 'Income' is skewed. One of the ways to make its distribution normal is by logarithmic transformation. The first line of code below creates a new variable, 'LogIncome', while the second and third lines of code plot the histogram and skewness value of this new variable.

      dat2["LogIncome"] = dat2["Income"].map(lambda i: np.log(i) if i > 0 else 0)
plot2 = sns.distplot(dat2["LogIncome"], color="m", label="Skewness : %.1f"%(dat2["LogIncome"].skew()))
plot2 = plot2.legend(loc="best")
    

Output:

The above chart shows that taking the log of the 'Income' variable makes the distribution roughly normal and reduces the skewness. We can use the same transformation for other numerical variables, but, instead, we will learn another transformation technique called Standardization.

Standardization

Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as 'Age' in years and 'Income' in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to be using a similar scale that can be achieved through scaling techniques.

One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the three variables, 'Income','Loan_amount', and 'Age'. Finally, the third line prints the variance of the scaled variables.

      from sklearn.preprocessing import StandardScaler
dat2[['Income','Loan_amount', 'Age']] = StandardScaler().fit_transform(dat2[['Income','Loan_amount', 'Age']])
print(dat2['Income'].var()); print(dat2['Loan_amount'].var()); print(dat2['Age'].var())
    

Output:

0016750418760463
0016750418760472
001675041876044
    

There is one variance for all the standardized variables. Let us now look at the variables after all the preprocessing till now.

      print(dat2.info())

Output:

      <class 'pandas.core.frame.DataFrame'>
	Int64Index:                598 entries, 0 to 599
	Data columns (total 11 columns):
	Marital_status    	598 non-null int64
	Dependents     	598 non-null float64
	Is_graduate     	598 non-null float64
	Income         	            598 non-null float64
	Loan_amount    	598 non-null float64
	Term_months    	598 non-null int64
	Credit_score   	            598 non-null int64
	approval_status	598 non-null int64
	Age            	            598 non-null float64
	Sex            	            598 non-null int64
	LogIncome      	598 non-null float64
	dtypes: float64(6), int64(5)
	memory usage: 56.1 KB
	None
    

Converting the Column Types

The two variables, 'Dependents' and 'Is_graduate', have been read as 'float64' which indicates numeric variables with a decimal value. This is not correct, as both of these variables are taking integer values. For carrying out any mathematical operations on the variables during the modeling process, it is important that the variables have the correct data types.

The first two lines of code below converts these variables to the integer data type, while the third line prints the data types of the variables.

      dat2["Dependents"] = dat2["Dependents"].astype("int")
dat2["Is_graduate"] = dat2["Is_graduate"].astype("int")
print(dat2.dtypes)
    

Output:

      Marital_status   	int64
	Dependents       	int32
	Is_graduate      	int32
	Income         	            float64
	Loan_amount    	float64
	Term_months      	int64
	Credit_score     	int64
	approval_status  	int64
	Age            	           float64
	Sex              	           int64
	LogIncome      	float64
	dtype: object
    

The data type for the variables, 'Dependents' and 'Is_graduate', have been corrected. We have created an additional variable, 'LogIncome', to demonstrate logarithmic transformation, however, the same transformation could have been applied to the 'Income' variable without creating a new one.

All the variables now seem to be in the right form and we can use the modeling to predict 'approval_status' of the loan applications. However, that is not within the scope of this guide and you can learn about them through other pluralsight guides on scikit-learn whose links are given in the end.

Conclusion

In this guide, you have learned about the fundamental techniques of data preprocessing for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting the data types.

To learn more about building machine learning models using scikit-learn , please refer to the following guides:

To learn more about building deep learning models using keras , please refer to the following guides: