 Deepika Singh

# Preparing Data for Modeling with scikit-learn

• Jun 24, 2019
• 460 Views
• Jun 24, 2019
• 460 Views
Data
scikit-learn

## Introduction

Data preparation often takes eighty percent of the data scientist's time in a data science project, which emphasizes its importance in the machine learning life-cycle.

In this guide, you will learn the basics and implementation of several data preparation techniques, mentioned below:

1. Dealing with Incorrect Entries
2. Missing Value Treatment
3. Encoding Categorical Labels
4. Handling Outliers
5. Logarithmic Transformation
6. Standardization
7. Converting the Column Types

## Data

In this guide, we will be using fictitious data of loan applicants which contains 600 observations and 10 variables, as described below:

1. Marital_status - Whether the applicant is married ("1") or not ("0").
2. Dependents - Number of dependents claimed by the applicant.
3. Is_graduate - Whether the applicant is a graduate ("1") or not ("0").
4. Income - Annual Income of the applicant (in hundreds of dollars).
5. Loan_amount - Loan amount (in hundreds of dollars) for which the application was submitted.
6. Term_months - Tenure of the loan (in months).
7. Credit_score - Whether the applicant's credit score was good ("1") or not ("0").
8. Age - The applicant’s age in years.
9. Sex - Whether the applicant is female (F) or male (M).
10. approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.

``````1
2
3
4
5
6
7
8
9
10
11
``````# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report``````
python

### Reading the Data and Performing Basic Data Checks

The first line of code below reads in the data as a pandas dataframe, while the second line prints the shape - 600 observations of 10 variables. The third line gives the summary statistics of the variables.

``````1
2
3
4
``````# Load data
print(dat2.shape)
dat2.describe()``````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
``````(600, 10)

|   	   | Marital_status  	| Dependents    | Is_graduate 	| Income    	 	| Loan_amount   | Term_months 	| Credit_score  	| approval_status 	| Age         	|
|-------   |---------------- 	|------------   |------------- |--------------- |------------- |-------------  |-------------- |-----------------	|------------  |
| count    | 600.000000 	 	| 598.000000    | 599.000000  	| 600.000000	 	| 600.000000    | 600.00000   	| 600.000000    	| 600.000000  		| 600.000000  	|
| mean     | 0.651667   	 	| 0.730769      | 2.449082     	| 7210.720000   | 161.571667  	| 367.10000    	| 0.788333 	 | 0.686667    	 	| 51.766667     |
| std      | 0.476840   	 	| 0.997194      | 40.788143   	| 8224.445086    	| 93.467598     | 63.40892	 	| 0.408831 	 | 0.464236    	 	| 21.240704     |
| min      | 0.000000   	 	| 0.000000      | 0.000000     	| 200.000000	 | 10.000000   	| 36.00000     	| 0.000000 	 | 0.000000    	 	| 0.000000      |
| 25%      | 0.000000   	 	| 0.000000      | 1.000000     	| 3832.500000   | 111.000000  	| 384.00000    	| 1.000000 	 | 0.000000    	 	| 36.000000     |
| 50%      | 1.000000   	 	| 0.000000      | 1.000000     	| 5075.000000   | 140.000000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 51.000000     |
| 75%      | 1.000000   	 	| 1.000000      | 1.000000     	| 7641.500000   | 180.500000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 64.000000     |
| max      | 1.000000   	 	| 3.000000      | 999.000000  	| 108000.000000  	| 778.000000    | 504.00000   	| 1.000000 	 	| 1.000000    		| 200.000000  	|
``````

## Dealing with Incorrect Entries

The above output shows that the variable 'Age' has minimum and maximum values of 0 and 200, respectively. Also, the variable 'Is_graduate' has a maximum value of 999, instead of the binary values of '0' and '1'. These entries are incorrect and needs correction. One approach would be to delete these records but instead, we will treat these records as missing values and replace them with a measure of central tendency - i.e., mean, median, or mode.

Starting with the 'Age' variable, the first two lines of code below replace the incorrect values '0' and '200' with 'NaN', an indicator of missing values. We repeat the same process for the variable 'Is_graduate' in the third line of code. The fourth line prints the information about the variables.

``````1
2
3
4
``````dat2.Age.replace(0, np.nan, inplace=True)
dat2.Age.replace(200, np.nan, inplace=True)
dat2.info()``````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
``````	<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 	600 non-null int64
Dependents     	598 non-null float64
Income         	600 non-null int64
Loan_amount    	600 non-null int64
Term_months    	600 non-null int64
Credit_score   	600 non-null int64
approval_status	600 non-null int64
Age            	594 non-null float64
Sex            	595 non-null object
dtypes: float64(3), int64(6), object(1)
memory usage: 47.0+ KB``````

Now, the variables 'Age' and 'Is_graduate' have 594 and 598 records, respectively. The left out entries have been tagged as missing, which we will learn about in the next section.

## Missing Value Treatment

There are various techniques for handling missing values. The most widely used one is replacing the values with the measures of central tendency. The first line of code below replaces the missing values of the 'Age' variable with the mean of the remaining values. The second line replaces the missing values of the 'Is_graduate' variable with the value of '1', which indicates that the applicant's education status is 'graduate'. The third line gives the summary statistics of the variables.

``````1
2
3
``````dat2['Age'].fillna(dat2['Age'].mean(), inplace=True)
dat2.describe()``````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
``````|   	   | Marital_status  	| Dependents    | Is_graduate 	| Income    	 	| Loan_amount   | Term_months 	| Credit_score  	| approval_status 	| Age         	|
|-------   |---------------- 	|------------   |------------- |--------------- |------------- |-------------  |-------------- |-----------------	|------------  |
| count    | 600.000000 	 	| 598.000000    | 600.000000  	| 600.000000	 	| 600.000000    | 600.00000   	| 600.000000    	| 600.000000  		| 600.000000  	|
| mean     | 0.651667   	 	| 0.730769      | 0.783333     	| 7210.720000   | 161.571667  	| 367.10000    	| 0.788333 	 | 0.686667    	 	| 50.606061     |
| std      | 0.476840   	 	| 0.997194      | 0.412317     	| 8224.445086   | 93.467598   	| 63.40892	 	| 0.408831 	 | 0.464236    	 	| 16.184651     |
| min      | 0.000000   	 	| 0.000000      | 0.000000     	| 200.000000	 | 10.000000   	| 36.00000	 	| 0.000000 	 | 0.000000    	 	| 22.000000     |
| 25%      | 0.000000   	 	| 0.000000      | 1.000000     	| 3832.500000   | 111.000000  	| 384.00000    	| 1.000000 	 | 0.000000    	 	| 36.000000     |
| 50%      | 1.000000   	 	| 0.000000      | 1.000000     	| 5075.000000   | 140.000000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 50.606061     |
| 75%      | 1.000000   	 	| 1.000000      | 1.000000     	| 7641.500000   | 180.500000  	| 384.00000    	| 1.000000 	 | 1.000000    	 	| 64.000000     |
| max      | 1.000000   	 	| 3.000000      | 1.000000     	| 108000.000000 | 778.000000  	| 504.00000    	| 1.000000 	 | 1.000000    	 	| 80.000000     |
``````

The corrections have now been made in both of the variables. The data also has a variable, 'Sex', with five missing values. Since this is a categorical variable, we will check the distribution of labels, which is done in the line of code below.

``````1
````dat2['Sex'].value_counts()````
python

Output:

``````1
2
3
``````	M    484
F    111
Name: Sex, dtype: int64``````

The output shows that 484 out of 595 applicants are male, so we will replace the missing values with label 'M'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the 'Sex' variable, which means the missing values have been accounted for.

``````1
2
``````dat2['Sex'].fillna('M',inplace=True)
dat2['Sex'].value_counts()``````
python

Output:

``````1
2
3
``````	M    489
F    111
Name: Sex, dtype: int64``````

We will now check if any more variables have missing values, which is done in the line of code below. The output shows that we still have two missing values in the variable 'Dependents'.

``````1
````dat2.isnull().sum()````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
``````	Marital_status 	0
Dependents     	2
Income         	0
Loan_amount    	0
Term_months    	0
Credit_score   	0
approval_status	0
Age            	0
Sex            	0
dtype: int64``````

Since there are only two missing values in the dataset, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset.

``````1
2
``````dat2 = dat2.dropna()
dat2.info()``````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
``````	<class 'pandas.core.frame.DataFrame'>
Int64Index: 598 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 	598 non-null int64
Dependents     	598 non-null float64
Income         	598 non-null int64
Loan_amount    	598 non-null int64
Term_months    	598 non-null int64
Credit_score   	598 non-null int64
approval_status	598 non-null int64
Age            	598 non-null float64
Sex            	598 non-null object
dtypes: float64(3), int64(6), object(1)
memory usage: 51.4+ KB``````

## Encoding Categorical Labels

The missing values have been treated in the data, but the labels in the variable 'Sex' use letters ('M' and 'F'). For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. Since there are two labels, we can do binary encoding which is done in the first line of code below. The output from the second line shows that we have successfully performed the encoding.

``````1
2
``````dat2["Sex"] = dat2["Sex"].map({"M": 0, "F":1})
dat2['Sex'].value_counts()``````
python

Output:

``````1
2
3
``````    0	487
1    111
Name: Sex, dtype: int64``````

## Handling Outliers

One of the biggest obstacles in predictive modeling can be the presence of outliers which are extreme values that are different from the other data points. Outliers are often a problem because they mislead the training process and lead to inaccurate models.

For numerical variables, we can identify outliers visually through a histogram or numerically through the skewness value. The two lines of code below plot the histogram along with the skewness value for the 'Income' variable.

``````1
2
``````plot1 = sns.distplot(dat2["Income"], color="b", label="Skewness : %.1f"%(dat2["Income"].skew()))
plot1 = plot1.legend(loc="best")``````
python

Output: The histogram shows that the variable 'Income' has a right-skewed distribution with the skewness value of 6.5. Ideally, the skewness value should be between -1 and 1.

Apart from the variable 'Income', we also have other variables ('Loan_amount' and 'Age') that have differences in scale which require normalization. We will learn a couple of techniques in the subsequent sections to deal with these preprocessing problems.

## Logarithmic Transformation of Numerical Variables

The previous chart showed that the variable 'Income' is skewed. One of the ways to make its distribution normal is by logarithmic transformation. The first line of code below creates a new variable, 'LogIncome', while the second and third lines of code plot the histogram and skewness value of this new variable.

``````1
2
3
``````dat2["LogIncome"] = dat2["Income"].map(lambda i: np.log(i) if i > 0 else 0)
plot2 = plot2.legend(loc="best")``````
python

Output: The above chart shows that taking the log of the 'Income' variable makes the distribution roughly normal and reduces the skewness. We can use the same transformation for other numerical variables, but, instead, we will learn another transformation technique called Standardization.

## Standardization

Several machine learning algorithms use some form of a distance matrix to learn from the data. However, when the features are using different scales, such as 'Age' in years and 'Income' in hundreds of dollars, the features using larger scales can unduly influence the model. As a result, we want the features to be using a similar scale that can be achieved through scaling techniques.

One such technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the three variables, 'Income','Loan_amount', and 'Age'. Finally, the third line prints the variance of the scaled variables.

``````1
2
3
``````from sklearn.preprocessing import StandardScaler
dat2[['Income','Loan_amount', 'Age']] = StandardScaler().fit_transform(dat2[['Income','Loan_amount', 'Age']])
print(dat2['Income'].var()); print(dat2['Loan_amount'].var()); print(dat2['Age'].var())``````
python

Output:

``````1
2
3
``````1.0016750418760463
1.0016750418760472
1.001675041876044``````

There is one variance for all the standardized variables. Let us now look at the variables after all the preprocessing till now.

``````1
````print(dat2.info())````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
``````	<class 'pandas.core.frame.DataFrame'>
Int64Index:                598 entries, 0 to 599
Data columns (total 11 columns):
Marital_status    	598 non-null int64
Dependents     	598 non-null float64
Income         	            598 non-null float64
Loan_amount    	598 non-null float64
Term_months    	598 non-null int64
Credit_score   	            598 non-null int64
approval_status	598 non-null int64
Age            	            598 non-null float64
Sex            	            598 non-null int64
dtypes: float64(6), int64(5)
memory usage: 56.1 KB
None``````

## Converting the Column Types

The two variables, 'Dependents' and 'Is_graduate', have been read as 'float64' which indicates numeric variables with a decimal value. This is not correct, as both of these variables are taking integer values. For carrying out any mathematical operations on the variables during the modeling process, it is important that the variables have the correct data types.

The first two lines of code below converts these variables to the integer data type, while the third line prints the data types of the variables.

``````1
2
3
``````dat2["Dependents"] = dat2["Dependents"].astype("int")
print(dat2.dtypes)``````
python

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````	Marital_status   	int64
Dependents       	int32
Income         	            float64
Loan_amount    	float64
Term_months      	int64
Credit_score     	int64
approval_status  	int64
Age            	           float64
Sex              	           int64
dtype: object``````

The data type for the variables, 'Dependents' and 'Is_graduate', have been corrected. We have created an additional variable, 'LogIncome', to demonstrate logarithmic transformation, however, the same transformation could have been applied to the 'Income' variable without creating a new one.

All the variables now seem to be in the right form and we can use the modeling to predict 'approval_status' of the loan applications. However, that is not within the scope of this guide and you can learn about them through other pluralsight guides on scikit-learn whose links are given in the end.

## Conclusion

In this guide, you have learned about the fundamental techniques of data preprocessing for machine learning. You learned about dealing with missing values, identifying and treating outliers, normalizing and transforming data, and converting the data types.