Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Preparing Data for Feature Engineering and Machine Learning

Jul 29, 2019 • 25 Minute Read

Introduction

Great machine learning models are often the ones with high quality features as the inputs to the model. This means that data preparation and feature engineering becomes imperative. In this guide, the reader will learn how to prepare data and create powerful features for predictive modeling. We will begin by exploring the data.

Data

In this guide, we will be using a fictitious data of loan applicants which contains 600 observations and 11 variables, as described below:

  1. Marital_status - Whether the applicant is married ("Yes") or not ("No").

  2. Dependents - Number of dependents claimed by the applicant.

  3. Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No").

  4. Income - Annual Income of the applicant (in US dollars).

  5. Loan_amount - Loan amount (in US dollars) for which the application was submitted.

  6. Term_months - Tenure of the loan (in months).

  7. Credit_score - Whether the applicant's credit score was good ("Satisfactory") or not ("Not _satisfactory").

  8. Age - The applicant’s age in years.

  9. Sex - Whether the applicant is female (F) or male (M).

  10. Purpose - Purpose of taking the loan as per the applicant (Education, Personal, Travel, Business, etc).

  11. approval_status - Whether the loan application was approved ("1") or not ("0"). This is the dependent variable.

Let's start by loading the required libraries and the data.

      # Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
    

Load the Data

The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 600 observations of 11 variables. The third line prints the information about the variables and its type.

      df = pd.read_csv("data_eng.csv")
print(df.shape)
print(df.info())
    

Output:

      (600, 11)

 <class 'pandas.core.frame.DataFrame'>
 RangeIndex: 600 entries, 0 to 599
 Data columns (total 11 columns):
 Marital_status     600 non-null object
 Dependents         593 non-null float64
 Is_graduate        594 non-null object
 Income             600 non-null int64
 Loan_amount        592 non-null float64
 Term_months        593 non-null float64
 Credit_score       600 non-null object
 approval_status    600 non-null int64
 Age                597 non-null float64
 Sex                599 non-null object
 Purpose            600 non-null object
 dtypes: float64(4), int64(2), object(5)
 memory usage: 51.6+ KB
 None
    

The above output shows that there are five categorical (object) and six numerical variables - four float and two integers. The output also shows that few of the variables in the data have less than 600 observations, suggesting the presence of missing values.

Missing Values

Most of the machine learning algorithms find it difficult to work with the data that has missing values. The line of code below prints the number of missing values in the different variables.

      df.apply(lambda x: sum(x.isnull()),axis=0)
    

Output:

      Marital_status     0
 Dependents         7
 Is_graduate        6
 Income             0
 Loan_amount        8
 Term_months        7
 Credit_score       0
 approval_status    0
 Age                3
 Sex                1
 Purpose            0
 dtype: int64
    

We will treat the missing values as per the variable type.

Missing Values for Continuous Variables

The most widely used method for treating missing values in numeric variables is by replacing them with the measures of central tendency. The three lines of code below replaces the missing values in the variables 'Loan_amount', 'Term_months', and 'Age', respectively, with the mean of these variables. The fourth line prints the number of missing values in the data. The output shows the corrections that have been made in these three variables.

      df['Loan_amount'].fillna(df['Loan_amount'].mean(), inplace=True)
df['Term_months'].fillna(df['Term_months'].mean(), inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df.apply(lambda x: sum(x.isnull()),axis=0)
    

Output:

      Marital_status     0
 Dependents         7
 Is_graduate        6
 Income             0
 Loan_amount        0
 Term_months        0
 Credit_score       0
 approval_status    0
 Age                0
 Sex                1
 Purpose            0
 dtype: int64
    

Missing Values for Categorical Variables

One of the categorical variables, 'Is_graduate', has six missing values. Since this is a categorical variable, we will replace the missing values with the mode of the distribution. The line of code below displays the frequency count of the variable.

      df['Is_graduate'].value_counts()
    

Output:

      Yes    466
 No     128
 Name: Is_graduate, dtype: int64
    

The label 'Yes' has the highest frequency (or mode) for the variable 'Is_graduate', so we will replace the missing values with label 'Yes'. The first line of code below performs this task, while the second line prints the distribution of the variable. The output shows 600 records for the variable, which means the missing values have been replaced.

      df['Is_graduate'].fillna('Yes',inplace=True)
df['Is_graduate'].value_counts()
    

Output:

      Yes    472
 No     128
 Name: Is_graduate, dtype: int64
    

Deleting the Missing Records

The missing values of the variable 'Is_graduate' has been replaced. However, the data still has seven missing values in the variable 'Dependents' and one missing value in the variable 'Sex'.

Since the number of missing values is small, we will learn another approach for dropping records with missing values. The first line of code below uses the 'dropna()' function to drop rows with any missing values in it, while the second line checks the information about the dataset. The output shows that all the variables have the same number of observations and that the missing value imputation is complete.

      df = df.dropna()
df.info()
    

Output:

      <class 'pandas.core.frame.DataFrame'>
 Int64Index: 592 entries, 0 to 598
 Data columns (total 11 columns):
 Marital_status     592 non-null object
 Dependents         592 non-null float64
 Is_graduate        592 non-null object
 Income             592 non-null int64
 Loan_amount        592 non-null float64
 Term_months        592 non-null float64
 Credit_score       592 non-null object
 approval_status    592 non-null int64
 Age                592 non-null float64
 Sex                592 non-null object
 Purpose            592 non-null object
 dtypes: float64(4), int64(2), object(5)
 memory usage: 55.5+ KB
    

The data is free from missing values and we can now work on the feature engineering part. We will start with the categorical variables.

Feature Engineering for the Categorical Variables

Aggregating Categories

When a categorical variable has too many unique categories (or labels), it may limit the predictive power of the machine learning model. In such a scenario, limiting the count of these labels can be a solution. In our data, the variable 'Purpose' has several categories, which can be displayed using the line of code below.

      df['Purpose'].value_counts()
    

Output:

      Education    185
 Personal     164
 Travel       123
 Business      43
 Wedding       39
 Furniture     38
 Name: Purpose, dtype: int64
    

Notice that the categories 'Business', 'Wedding', and 'Furniture', have significantly lower counts compared to the other three categories. It is likely that all these categories will not have statistically significant difference in predicting 'approval_status'. Therefore, we will aggregate these three categories into a new category, called 'other'.

The first two lines of code below use a Python dictionary to re-code the number of 'Purpose' categories into a smaller number, while the third line prints the resulting frequency table.

      purpose_categories = {'Education':'Education', 'Personal':'Personal', 
                    'Travel':'Travel', 'Business':'other',
                    'Wedding':'other', 'Furniture':'other'}
df['Purpose'] = [purpose_categories[x] for x in df['Purpose']]
df['Purpose'].value_counts()
    

Output:

      Education    185
 Personal     164
 Travel       123
 other        120
 Name: Purpose, dtype: int64
    

Encoding Categorical Variables

For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels. There are two common techniques of performing this.

One Hot Encoding

In this technique, the features are encoded using a one-hot ('dummy') encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array.

The first line of code below performs the one-hot encoding for the 'Sex' variable, while the second line prints the first five observations of the data.

      df = pd.get_dummies(df, columns=['Sex'],prefix='S')
df.head(5)
    

Output:

      |   	| Marital_status 	| Dependents 	| Is_graduate 	| Income 	| Loan_amount 	| Term_months 	| Credit_score      	| approval_status 	| Age  	| Purpose   	| S_F 	| S_M 	|
|---	|----------------	|------------	|-------------	|--------	|-------------	|-------------	|-------------------	|-----------------	|------	|-----------	|-----	|-----	|
| 0 	| Yes            	| 2.0        	| No          	| 816700 	| 104000.0    	| 365.079258  	| Satisfactory      	| 0               	| 55.0 	| Education 	| 0   	| 1   	|
| 1 	| No             	| 0.0        	| Yes         	| 722300 	| 79500.0     	| 365.079258  	| Not _satisfactory 	| 0               	| 32.0 	| Education 	| 1   	| 0   	|
| 2 	| No             	| 0.0        	| Yes         	| 666700 	| 84000.0     	| 365.079258  	| Satisfactory      	| 0               	| 59.0 	| Education 	| 1   	| 0   	|
| 3 	| No             	| 0.0        	| Yes         	| 500000 	| 63000.0     	| 365.079258  	| Satisfactory      	| 0               	| 36.0 	| Education 	| 0   	| 1   	|
| 4 	| No             	| 0.0        	| Yes         	| 409200 	| 39500.0     	| 504.000000  	| Satisfactory      	| 0               	| 42.0 	| Education 	| 0   	| 1   	|
    

The output shows that the variable 'Sex' has been replaced by two binary variables 'S_F' and 'S_M'. However, there is repetition of the information as the presence of 'Female' category automatically means the absence of the 'Male' category. This duplication of information can be taken into consideration by a technique called Dummy Encoding.

Dummy Encoding

In this technique, the features are encoded so that there is no duplication of the information. It can be achieved by passing in the argument drop_first=True to the .get_dummies function, as done in the first four lines of code below. The fifth line of code drops the category 'S_M', while the sixth line prints the information about the data.

      df = pd.get_dummies(df, columns=['Purpose'], drop_first=True, prefix='Pur')
df = pd.get_dummies(df, columns=['Marital_status'], drop_first=True, prefix='M')
df = pd.get_dummies(df, columns=['Is_graduate'], drop_first=True, prefix='Ed')
df = pd.get_dummies(df, columns=['Credit_score'], drop_first=True, prefix='CSc')
df = df.drop(['S_M'], axis=1)
df.info()
    

Output:

      <class 'pandas.core.frame.DataFrame'>
 Int64Index: 592 entries, 0 to 598
 Data columns (total 13 columns):
 Dependents          592 non-null float64
 Income              592 non-null int64
 Loan_amount         592 non-null float64
 Term_months         592 non-null float64
 approval_status     592 non-null int64
 Age                 592 non-null float64
 S_F                 592 non-null uint8
 Pur_Personal        592 non-null uint8
 Pur_Travel          592 non-null uint8
 Pur_other           592 non-null uint8
 M_Yes               592 non-null uint8
 Ed_Yes              592 non-null uint8
 CSc_Satisfactory    592 non-null uint8
 dtypes: float64(4), int64(2), uint8(7)
 memory usage: 36.4 KB
    

Feature Engineering for the Numerical Variables

Feature Engineering for the numerical variables require a different strategy compared to the categorical features. The data has five numerical features - Dependents, Income, Loan_amount, Term_months, and Age. In the subsequent sections, we will learn about the various techniques of handling numerical variables.

Handling Extreme Values

Modeling numerical features can be difficult in the presence of extreme values, or outliers, which are observations that are significantly different from the other data points. For numerical variables, we can quantify the presence of outliers through the skewness value. The lines of code below print the skewness value along with the summary statistics for the 'Income' variable.

      print(df['Income'].skew())
df['Income'].describe()
    

Output:

      5.32574034245

 count    5.920000e+02
 mean     7.105902e+05
 std      7.146531e+05
 min      3.000000e+04
 25%      3.865250e+05
 50%      5.099500e+05
 75%      7.713500e+05
 max      8.444900e+06
 Name: Income, dtype: float64
    

The output above shows that the variable 'Income' has a right-skewed distribution with the skewness value of 5.3. Ideally, the skewness value should be between -1 and 1. There are many techniques of handling these extreme values, one of which is quantile based capping or flooring. Since the variable 'Income' is right skewed, we will do the capping for the higher values.

The first line of code below prints the 90th percentile of the variable, while comes out to be USD 1,247,209. The second line of code is where the capping takes place, while the third line prints the skewness value post the capping.

      print(df['Income'].quantile(0.90))
df["Income"] = np.where(df["Income"] >1247209, 1247209,df['Income'])
print(df['Income'].skew())
    

Output:

      1247209.9999999998
 
 0.894679717523
    

The skewness values is now 0.89 which is much better.

Creating New Features

Creating new features from two or more existing features can lead to an improvement in the predictive power of a machine learning algorithm. These new features are often referred to as interaction terms.

In the loan application data, the loan to income ratio can be an important metric for approving loan applicants. Another example could be the duration of the loan with respect to the loan amount. The lines of code below create two new features, 'loan_income_ratio' and 'loan_term_ratio', that take the interaction between the loan amount, the applicant's income and the loan term into consideration.

      ### Creating new features
df['loan_income_ratio'] = df['Loan_amount']/df['Income']
df['loan_term_ratio'] = df['Loan_amount']/df['Term_months']
    

Data Scaling and Transformations

Let us now look at the summary statistics of the variables which can be achieved with the line of code below.

      df.describe()
    

Output:

      |       	| Dependents 	| Income       	| Loan_amount  	| Term_months 	| approval_status 	| Age        	| S_F        	| Pur_Personal 	| Pur_Travel 	| Pur_other  	| M_Yes      	| Ed_Yes     	| CSc_Satisfactory 	| loan_income_ratio 	| loan_term_ratio 	|
|-------	|------------	|--------------	|--------------	|-------------	|-----------------	|------------	|------------	|--------------	|------------	|------------	|------------	|------------	|------------------	|-------------------	|-----------------	|
| count 	| 592.000000 	| 5.920000e+02 	| 5.920000e+02 	| 592.000000  	| 592.000000      	| 592.000000 	| 592.000000 	| 592.000000   	| 592.000000 	| 592.000000 	| 592.000000 	| 592.000000 	| 592.000000       	| 592.000000        	| 592.000000      	|
| mean  	| 0.753378   	| 6.076007e+05 	| 3.311038e+05 	| 364.956884  	| 0.690878        	| 49.677365  	| 0.182432   	| 0.277027     	| 0.207770   	| 0.202703   	| 0.653716   	| 0.787162   	| 0.788851         	| 0.471830          	| 1029.732920     	|
| std   	| 1.029179   	| 3.067529e+05 	| 7.277030e+05 	| 61.877355   	| 0.462522        	| 14.427056  	| 0.386527   	| 0.447908     	| 0.406055   	| 0.402353   	| 0.476187   	| 0.409660   	| 0.408469         	| 0.996969          	| 2647.188471     	|
| min   	| 0.000000   	| 3.000000e+04 	| 1.860000e+04 	| 36.000000   	| 0.000000        	| 22.000000  	| 0.000000   	| 0.000000     	| 0.000000   	| 0.000000   	| 0.000000   	| 0.000000   	| 0.000000         	| 0.086564          	| 48.437500       	|
| 25%   	| 0.000000   	| 3.865250e+05 	| 6.200000e+04 	| 384.000000  	| 0.000000        	| 37.000000  	| 0.000000   	| 0.000000     	| 0.000000   	| 0.000000   	| 0.000000   	| 1.000000   	| 1.000000         	| 0.122285          	| 166.341146      	|
| 50%   	| 0.000000   	| 5.099500e+05 	| 7.800000e+04 	| 384.000000  	| 1.000000        	| 51.000000  	| 0.000000   	| 0.000000     	| 0.000000   	| 0.000000   	| 1.000000   	| 1.000000   	| 1.000000         	| 0.162164          	| 217.604425      	|
| 75%   	| 1.000000   	| 7.713500e+05 	| 1.415000e+05 	| 384.000000  	| 1.000000        	| 61.000000  	| 0.000000   	| 1.000000     	| 0.000000   	| 0.000000   	| 1.000000   	| 1.000000   	| 1.000000         	| 0.243491          	| 423.023897      	|
| max   	| 6.000000   	| 1.247209e+06 	| 7.780000e+06 	| 504.000000  	| 1.000000        	| 76.000000  	| 1.000000   	| 1.000000     	| 1.000000   	| 1.000000   	| 1.000000   	| 1.000000   	| 1.000000         	| 10.930828         	| 32696.078431    	|
    

We have seven numerical features - Dependents, Income, Loan_amount, Term_months, Age, loan_income_ratio, and loan_term_ratio. However, most of them have different units and scale, such as 'Age' in years and 'Income' in dollars. These differences can unduly influence the model and, therefore, we need to scale or transform the features. There are several techniques to achieve this, and some of the popular ones are discussed in the subsequent sections.

Min-max Scaling

In this approach, the data is scaled to a fixed range - usually 0 to 1. The impact is that we end up with smaller standard deviations, which can suppress the effect of outliers.

The first line of code below imports the 'MinMaxScaler' from the 'sklearn.preprocessing' module. The second line instantiates the 'MinMaxScaler()', while the third line fits the scaler on the Age column. The fourth line of code transforms the 'Age' column with the scaler we just fitted, while the fifth line prints the summary statistic of the 'scaled_age' variable. The output shows that the variable has been scaled between zero to one.

      from sklearn.preprocessing import MinMaxScaler 
minmaxscaler = MinMaxScaler() 
minmaxscaler.fit(df[['Age']]) 
df['scaled_age'] = minmaxscaler.transform(df[['Age']]) 
df['scaled_age'].describe()
    

Output:

      count    592.000000
 mean       0.512544
 std        0.267168
 min        0.000000
 25%        0.277778
 50%        0.537037
 75%        0.722222
 max        1.000000
 Name: scaled_age, dtype: float64
    

Standardization

Another technique is standardization, in which all the features are centered around zero and have, roughly, unit variance.

The first line of code below imports the 'StandardScaler' from the 'sklearn.preprocessing' module. The second line does the normalization for the four variables, 'Loan_amount', 'Term_months', 'loan_income_ratio', and 'loan_term_ratio'. Finally, the third line prints the variance of the standardized variables.

      ### Standardization
from sklearn.preprocessing import StandardScaler
df[['Loan_amount', 'Term_months', 'loan_income_ratio', 'loan_term_ratio']] = StandardScaler().fit_transform(df[['Loan_amount', 'Term_months','loan_income_ratio', 'loan_term_ratio']])
print(df['Loan_amount'].var()); print(df['Term_months'].var()); print(df['loan_income_ratio'].var()); print(df['loan_term_ratio'].var())
    

Output:

      1.0016920473773263
 1.0016920473773332
 1.0016920473773272
 1.0016920473773252
    

The output shows that all the standardized variables have unit variance.

Log Transformation

Earlier in the guide, we did quantile capping of the income variable. We could have also done logarithmic transformation to treat the extreme values. This can be achieved with the line of code below which creates a new feature, 'LogIncome'.

      df["LogIncome"] = df["Income"].map(lambda i: np.log(i) if i > 0 else 0)
    

Binning Numeric Variables

One more feature engineering technique is that of binning the numeric variables, in which the numeric variable is grouped into categories, as per the distribution of the data. The line of code below prints the summary statistics of the variable 'Dependents'.

      df['Dependents'].describe()
    

Output:

      count    592.000000
 mean       0.753378
 std        1.029179
 min        0.000000
 25%        0.000000
 50%        0.000000
 75%        1.000000
 max        6.000000
 Name: Dependents, dtype: float64
    

The range of the 'Dependents' variable is between zero to six, but the majority of the applicants have one dependents. We can group this variable into three bins which is done in the line of code below.

      df['Binned_Dependents'] = pd.cut(df['Dependents'], bins=[-np.inf, 0, 2, np.inf], labels=[0, '1-2', '>2'])
    

We have done a lot of data preparation and feature engineering. As a result, certain variables are now redundant and should be dropped from the data. The first line of code below drops three such variables, while the second line prints the shape of the new data. The third line displays the first five observations of the data, which has 592 observations and 15 variables.

      df_new = df.drop(['Dependents', 'Income', 'Age'], axis=1)
print(df_new.shape)
df_new.head(5)
    

Output:

      (592, 15)


|   	| Loan_amount 	| Term_months 	| approval_status 	| S_F 	| Pur_Personal 	| Pur_Travel 	| Pur_other 	| M_Yes 	| Ed_Yes 	| CSc_Satisfactory 	| loan_income_ratio 	| loan_term_ratio 	| scaled_age 	| LogIncome 	| Binned_Dependents 	|
|---	|-------------	|-------------	|-----------------	|-----	|--------------	|------------	|-----------	|-------	|--------	|------------------	|-------------------	|-----------------	|------------	|-----------	|-------------------	|
| 0 	| -0.312347   	| 0.001979    	| 0               	| 0   	| 0            	| 0          	| 0         	| 1     	| 0      	| 1                	| -0.345828         	| -0.281617       	| 0.611111   	| 13.613027 	| 1-2               	|
| 1 	| -0.346043   	| 0.001979    	| 0               	| 1   	| 0            	| 0          	| 0         	| 0     	| 1      	| 0                	| -0.363171         	| -0.306989       	| 0.185185   	| 13.490196 	| 0                 	|
| 2 	| -0.339854   	| 0.001979    	| 0               	| 1   	| 0            	| 0          	| 0         	| 0     	| 1      	| 1                	| -0.347181         	| -0.302329       	| 0.685185   	| 13.410095 	| 0                 	|
| 3 	| -0.368736   	| 0.001979    	| 0               	| 0   	| 0            	| 0          	| 0         	| 0     	| 1      	| 1                	| -0.347174         	| -0.324077       	| 0.259259   	| 13.122363 	| 0                 	|
| 4 	| -0.401057   	| 2.248976    	| 0               	| 0   	| 0            	| 0          	| 0         	| 0     	| 1      	| 1                	| -0.376759         	| -0.359689       	| 0.370370   	| 12.921959 	| 0                 	|
    

Conclusion

In this guide, you have learned about the most popular data preparation and feature engineering techniques. The usage of these techniques depend on the data, the problem statement, and the machine learning algorithm selected for building the model.

To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:

  1. Scikit Machine Learning
  2. Linear, Lasso, and Ridge Regression with scikit-learn
  3. Non-Linear Regression Trees with scikit-learn
  4. Machine Learning with Neural Networks Using scikit-learn
  5. Validating Machine Learning Models with scikit-learn
  6. Ensemble Modeling with scikit-learn
  7. Preparing Data for Modeling with scikit-learn
  8. Interpreting Data Using Descriptive Statistics with Python
Deepika Singh

Deepika S.

Coming soon...

More about this author