Great machine learning models are often the ones with high quality features as the inputs to the model. This means that data preparation and feature engineering becomes imperative. In this guide, the reader will learn how to prepare data and create powerful features for predictive modeling. We will begin by exploring the data.
In this guide, we will be using a fictitious data of loan applicants which contains 600 observations and 11 variables, as described below:
Let's start by loading the required libraries and the data.
1# Import required libraries
2import pandas as pd
3import numpy as np
4import matplotlib.pyplot as plt
5import seaborn as sns
6%matplotlib inline
The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 600 observations of 11 variables. The third line prints the information about the variables and its type.
1df = pd.read_csv("data_eng.csv")
2print(df.shape)
3print(df.info())
Output:
1 (600, 11)
2
3 <class 'pandas.core.frame.DataFrame'>
4 RangeIndex: 600 entries, 0 to 599
5 Data columns (total 11 columns):
6 Marital_status 600 non-null object
7 Dependents 593 non-null float64
8 Is_graduate 594 non-null object
9 Income 600 non-null int64
10 Loan_amount 592 non-null float64
11 Term_months 593 non-null float64
12 Credit_score 600 non-null object
13 approval_status 600 non-null int64
14 Age 597 non-null float64
15 Sex 599 non-null object
16 Purpose 600 non-null object
17 dtypes: float64(4), int64(2), object(5)
18 memory usage: 51.6+ KB
19 None
The above output shows that there are five categorical (object) and six numerical variables - four float and two integers. The output also shows that few of the variables in the data have less than 600 observations, suggesting the presence of missing values.
Most of the machine learning algorithms find it difficult to work with the data that has missing values. The line of code below prints the number of missing values in the different variables.
1df.apply(lambda x: sum(x.isnull()),axis=0)
Output:
1 Marital_status 0
2 Dependents 7
3 Is_graduate 6
4 Income 0
5 Loan_amount 8
6 Term_months 7
7 Credit_score 0
8 approval_status 0
9 Age 3
10 Sex 1
11 Purpose 0
12 dtype: int64
We will treat the missing values as per the variable type.
The most widely used method for treating missing values in numeric variables is by replacing them with the measures of central tendency. The three lines of code below replaces the missing values in the variables 'Loan_amount', 'Term_months', and 'Age', respectively, with the mean of these variables. The fourth line prints the number of missing values in the data. The output shows the corrections that have been made in these three variables.
1df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
2df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
3df['Age'].fillna(df['Age'].mean(), inplace=True)
4df.apply(lambda x: sum(x.isnull()),axis=0)
Output:
1 Marital_status 0
2 Dependents 7
3 Is_graduate 6
4 Income 0
5 Loan_amount 0
6 Term_months 0
7 Credit_score 0
8 approval_status 0
9 Age 0
10 Sex 1
11 Purpose 0
12 dtype: int64
One of the categorical variables, 'Is_graduate', has six missing values. Since this is a categorical variable, we will replace the missing values with the mode of the distribution. The line of code below displays the frequency count of the variable.
1df['Is_graduate'].value_counts()
Output:
1 Yes 466
2 No 128
3 Name: Is_graduate, dtype: int64
The label 'Yes' has the highest frequency (or mode) for the variable 'Is_graduate', so we will replace the missing values with label 'Yes'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the variable, which means the missing values have been replaced.
1df['Is_graduate'].fillna('Yes',inplace=True)
2df['Is_graduate'].value_counts()
Output:
1 Yes 472
2 No 128
3 Name: Is_graduate, dtype: int64
The missing values of the variable 'Is_graduate' has been replaced. However, the data still has seven missing values in the variable 'Dependents' and one missing value in the variable 'Sex'.
Since the number of missing values is small, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset. The output shows that all the variables have the same number of observations and that the missing value imputation is complete.
1df = df.dropna()
2df.info()
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 Int64Index: 592 entries, 0 to 598
3 Data columns (total 11 columns):
4 Marital_status 592 non-null object
5 Dependents 592 non-null float64
6 Is_graduate 592 non-null object
7 Income 592 non-null int64
8 Loan_amount 592 non-null float64
9 Term_months 592 non-null float64
10 Credit_score 592 non-null object
11 approval_status 592 non-null int64
12 Age 592 non-null float64
13 Sex 592 non-null object
14 Purpose 592 non-null object
15 dtypes: float64(4), int64(2), object(5)
16 memory usage: 55.5+ KB
The data is free from missing values and we can now work on the feature engineering part. We will start with the categorical variables.
When a categorical variable has too many unique categories (or labels), it may limit the predictive power of the machine learning model. In such a scenario, limiting the count of these labels can be a solution. In our data, the variable 'Purpose' has several categories, which can be displayed using the line of code below.
1df['Purpose'].value_counts()
Output:
1 Education 185
2 Personal 164
3 Travel 123
4 Business 43
5 Wedding 39
6 Furniture 38
7 Name: Purpose, dtype: int64
Notice that the categories 'Business', 'Wedding', and 'Furniture', have significantly lower counts compared to the other three categories. It is likely that all these categories will not have statistically significant difference in predicting 'approval_status'. Therefore, we will aggregate these three categories into a new category, called 'other'.
The first two lines of code below use a Python dictionary to re-code the number of 'Purpose' categories into a smaller number, while the third line prints the resulting frequency table.
1purpose_categories = {'Education':'Education', 'Personal':'Personal',
2 'Travel':'Travel', 'Business':'other',
3 'Wedding':'other', 'Furniture':'other'}
4df['Purpose'] = [purpose_categories[x] for x in df['Purpose']]
5df['Purpose'].value_counts()
Output:
1 Education 185
2 Personal 164
3 Travel 123
4 other 120
5 Name: Purpose, dtype: int64
For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. There are two common techniques of performing this.
In this technique, the features are encoded using a one-hot ('dummy') encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array.
The first line of code below performs the one-hot encoding for the 'Sex' variable, while the second line prints the first five observations of the data.
1df = pd.get_dummies(df, columns=['Sex'],prefix='S')
2df.head(5)
Output:
1| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Purpose | S_F | S_M |
2|--- |---------------- |------------ |------------- |-------- |------------- |------------- |------------------- |----------------- |------ |----------- |----- |----- |
3| 0 | Yes | 2.0 | No | 816700 | 104000.0 | 365.079258 | Satisfactory | 0 | 55.0 | Education | 0 | 1 |
4| 1 | No | 0.0 | Yes | 722300 | 79500.0 | 365.079258 | Not _satisfactory | 0 | 32.0 | Education | 1 | 0 |
5| 2 | No | 0.0 | Yes | 666700 | 84000.0 | 365.079258 | Satisfactory | 0 | 59.0 | Education | 1 | 0 |
6| 3 | No | 0.0 | Yes | 500000 | 63000.0 | 365.079258 | Satisfactory | 0 | 36.0 | Education | 0 | 1 |
7| 4 | No | 0.0 | Yes | 409200 | 39500.0 | 504.000000 | Satisfactory | 0 | 42.0 | Education | 0 | 1 |
The output shows that the variable 'Sex' has been replaced by two binary variables 'S_F' and 'S_M'. However, there is repetition of the information as the presence of 'Female' category automatically means the absence of the 'Male' category. This duplication of information can be taken into consideration by a technique called Dummy Encoding.
In this technique, the features are encoded so that there is no duplication of the information. It can be achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first four lines of code below. The fifth line of code drops the category 'S_M', while the sixth line prints the information about the data.
1df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
2df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
3df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
4df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
5df = df.drop(['S_M'], axis=1)
6df.info()
Output:
1 <class 'pandas.core.frame.DataFrame'>
2 Int64Index: 592 entries, 0 to 598
3 Data columns (total 13 columns):
4 Dependents 592 non-null float64
5 Income 592 non-null int64
6 Loan_amount 592 non-null float64
7 Term_months 592 non-null float64
8 approval_status 592 non-null int64
9 Age 592 non-null float64
10 S_F 592 non-null uint8
11 Pur_Personal 592 non-null uint8
12 Pur_Travel 592 non-null uint8
13 Pur_other 592 non-null uint8
14 M_Yes 592 non-null uint8
15 Ed_Yes 592 non-null uint8
16 CSc_Satisfactory 592 non-null uint8
17 dtypes: float64(4), int64(2), uint8(7)
18 memory usage: 36.4 KB
Feature Engineering for the numerical variables require a different strategy compared to the categorical features. The data has five numerical features - Dependents, Income, Loan_amount, Term_months, and Age. In the subsequent sections, we will learn about the various techniques of handling numerical variables.
Modeling numerical features can be difficult in the presence of extreme values, or outliers, which are observations that are significantly different from the other data points. For numerical variables, we can quantify the presence of outliers through the skewness value. The lines of code below print the skewness value along with the summary statistics for the 'Income' variable.
1print(df['Income'].skew())
2df['Income'].describe()
Output:
1 5.32574034245
2
3 count 5.920000e+02
4 mean 7.105902e+05
5 std 7.146531e+05
6 min 3.000000e+04
7 25% 3.865250e+05
8 50% 5.099500e+05
9 75% 7.713500e+05
10 max 8.444900e+06
11 Name: Income, dtype: float64
The output above shows that the variable 'Income' has a right-skewed distribution with the skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques of handling these extreme values, one of which is quantile based capping or flooring. Since the variable 'Income' is right skewed, we will do the capping for the higher values.
The first line of code below prints the 90th percentile of the variable, while comes out to be USD 1,247,209. The second line of code is where the capping takes place, while the third line prints the skewness value post the capping.
1print(df['Income'].quantile(0.90))
2df["Income"] = np.where(df["Income"] >1247209, 1247209,df['Income'])
3print(df['Income'].skew())
Output:
1 1247209.9999999998
2
3 0.894679717523
The skewness values is now 0.89 which is much better.
Creating new features from two or more existing features can lead to an improvement in the predictive power of a machine learning algorithm. These new features are often referred to as interaction terms.
In the loan application data, the loan to income ratio can be an important metric for approving loan applicants. Another example could be the duration of the loan with respect to the loan amount. The lines of code below create two new features, 'loan_income_ratio' and 'loan_term_ratio', that take the interaction between the loan amount, the applicant's income and the loan term into consideration.
1### Creating new features
2df['loan_income_ratio'] = df['Loan_amount']/df['Income']
3df['loan_term_ratio'] = df['Loan_amount']/df['Term_months']
Let us now look at the summary statistics of the variables which can be achieved with the line of code below.
1df.describe()
Output:
1| | Dependents | Income | Loan_amount | Term_months | approval_status | Age | S_F | Pur_Personal | Pur_Travel | Pur_other | M_Yes | Ed_Yes | CSc_Satisfactory | loan_income_ratio | loan_term_ratio |
2|------- |------------ |-------------- |-------------- |------------- |----------------- |------------ |------------ |-------------- |------------ |------------ |------------ |------------ |------------------ |------------------- |----------------- |
3| count | 592.000000 | 5.920000e+02 | 5.920000e+02 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 | 592.000000 |
4| mean | 0.753378 | 6.076007e+05 | 3.311038e+05 | 364.956884 | 0.690878 | 49.677365 | 0.182432 | 0.277027 | 0.207770 | 0.202703 | 0.653716 | 0.787162 | 0.788851 | 0.471830 | 1029.732920 |
5| std | 1.029179 | 3.067529e+05 | 7.277030e+05 | 61.877355 | 0.462522 | 14.427056 | 0.386527 | 0.447908 | 0.406055 | 0.402353 | 0.476187 | 0.409660 | 0.408469 | 0.996969 | 2647.188471 |
6| min | 0.000000 | 3.000000e+04 | 1.860000e+04 | 36.000000 | 0.000000 | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.086564 | 48.437500 |
7| 25% | 0.000000 | 3.865250e+05 | 6.200000e+04 | 384.000000 | 0.000000 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.122285 | 166.341146 |
8| 50% | 0.000000 | 5.099500e+05 | 7.800000e+04 | 384.000000 | 1.000000 | 51.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.162164 | 217.604425 |
9| 75% | 1.000000 | 7.713500e+05 | 1.415000e+05 | 384.000000 | 1.000000 | 61.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.243491 | 423.023897 |
10| max | 6.000000 | 1.247209e+06 | 7.780000e+06 | 504.000000 | 1.000000 | 76.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 10.930828 | 32696.078431 |
We have seven numerical features - Dependents, Income, Loan_amount, Term_months, Age, loan_income_ratio, and loan_term_ratio. However, most of them have different units and scale, such as 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform the features. There are several techniques to achieve this, and some of the popular ones are discussed in the subsequent sections.
In this approach, the data is scaled to a fixed range - usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers.
The first line of code below imports the 'MinMaxScaler' from the 'sklearn.preprocessing' module. The second line instantiates the 'MinMaxScaler()', while the third line fits the scaler on the Age column. The fourth line of code transforms the 'Age' column with the scaler we just fitted, while the fifth line prints the summary statistic of the 'scaled_age' variable. The output shows that the variable has been scaled between zero to one.
1from sklearn.preprocessing import MinMaxScaler
2minmaxscaler = MinMaxScaler()
3minmaxscaler.fit(df[['Age']])
4df['scaled_age'] = minmaxscaler.transform(df[['Age']])
5df['scaled_age'].describe()
Output:
1 count 592.000000
2 mean 0.512544
3 std 0.267168
4 min 0.000000
5 25% 0.277778
6 50% 0.537037
7 75% 0.722222
8 max 1.000000
9 Name: scaled_age, dtype: float64
Another technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.
The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the four variables, 'Loan_amount', 'Term_months', 'loan_income_ratio', and 'loan_term_ratio'. Finally, the third line prints the variance of the standardized variables.
1### Standardization
2from sklearn.preprocessing import StandardScaler
3df[['Loan_amount', 'Term_months', 'loan_income_ratio', 'loan_term_ratio']] = StandardScaler().fit_transform(df[['Loan_amount', 'Term_months','loan_income_ratio', 'loan_term_ratio']])
4print(df['Loan_amount'].var()); print(df['Term_months'].var()); print(df['loan_income_ratio'].var()); print(df['loan_term_ratio'].var())
Output:
1 1.0016920473773263
2 1.0016920473773332
3 1.0016920473773272
4 1.0016920473773252
The output shows that all the standardized variables have unit variance.
Earlier in the guide, we did quantile capping of the income variable. We could have also done logarithmic transformation to treat the extreme values. This can be achieved with the line of code below which creates a new feature, 'LogIncome'.
1df["LogIncome"] = df["Income"].map(lambda i: np.log(i) if i > 0 else 0)
One more feature engineering technique is that of binning the numeric variables, in which the numeric variable is grouped into categories, as per the distribution of the data. The line of code below prints the summary statistics of the variable 'Dependents'.
1df['Dependents'].describe()
Output:
1 count 592.000000
2 mean 0.753378
3 std 1.029179
4 min 0.000000
5 25% 0.000000
6 50% 0.000000
7 75% 1.000000
8 max 6.000000
9 Name: Dependents, dtype: float64
The range of the 'Dependents' variable is between zero to six, but the majority of the applicants have one dependents. We can group this variable into three bins which is done in the line of code below.
1df['Binned_Dependents'] = pd.cut(df['Dependents'], bins=[-np.inf, 0, 2, np.inf], labels=[0, '1-2', '>2'])
We have done a lot of data preparation and feature engineering. As a result, certain variables are now redundant and should be dropped from the data. The first line of code below drops three such variables, while the second line prints the shape of the new data. The third line displays the first five observations of the data, which has 592 observations and 15 variables.
1df_new = df.drop(['Dependents', 'Income', 'Age'], axis=1)
2print(df_new.shape)
3df_new.head(5)
Output:
1 (592, 15)
2
3
4| | Loan_amount | Term_months | approval_status | S_F | Pur_Personal | Pur_Travel | Pur_other | M_Yes | Ed_Yes | CSc_Satisfactory | loan_income_ratio | loan_term_ratio | scaled_age | LogIncome | Binned_Dependents |
5|--- |------------- |------------- |----------------- |----- |-------------- |------------ |----------- |------- |-------- |------------------ |------------------- |----------------- |------------ |----------- |------------------- |
6| 0 | -0.312347 | 0.001979 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | -0.345828 | -0.281617 | 0.611111 | 13.613027 | 1-2 |
7| 1 | -0.346043 | 0.001979 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -0.363171 | -0.306989 | 0.185185 | 13.490196 | 0 |
8| 2 | -0.339854 | 0.001979 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | -0.347181 | -0.302329 | 0.685185 | 13.410095 | 0 |
9| 3 | -0.368736 | 0.001979 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | -0.347174 | -0.324077 | 0.259259 | 13.122363 | 0 |
10| 4 | -0.401057 | 2.248976 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | -0.376759 | -0.359689 | 0.370370 | 12.921959 | 0 |
In this guide, you have learned about the most popular data preparation and feature engineering techniques. The usage of these techniques depend on the data, the problem statement, and the machine learning algorithm selected for building the model.
To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides: