Introduction

19

Statistics provide answers to many important underlying patterns in the data. Statistical models help to concisely summarize and make inferences about the relationships between the variables. Predictive modeling is often incomplete without understanding these relationships.

In this guide, the reader will learn how to fit and analyze statistical models on quantitative (linear regression) and qualitative (logistic regression) target variables. We will be using the Statsmodels library for statistical modeling. We will begin by importing the libraries that we will be using.

`1 2 3 4 5 6 7 8`

`import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import warnings warnings.filterwarnings('ignore') import statsmodels.api as sm`

python

For building linear regression models, we will be using the fictitious data of loan applicants containing 600 observations and 10 variables. Out of the ten variables, we will be using the following six variables:

Dependents: Number of dependents of the applicant.

Is_graduate: Whether the applicant is a graduate ("1") or not ("0").

Loan_amount: Loan amount (in USD) for which the application was submitted.

Term_months: Tenure of the loan (in months).

Age: The applicant’s age in years.

Income: Annual Income of the applicant (in USD). This is the dependent variable.

For building logistic regression models, we will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:

pregnancies: Number of times pregnant.

glucose: Plasma glucose concentration.

diastolic: Diastolic blood pressure (mm Hg).

triceps: Skinfold thickness (mm).

insulin: Hour serum insulin (mu U/ml).

bmi: BMI (weight in kg/(height in m).

dpf: Diabetes pedigree function.

age: Age in years.

diabetes: '1' represents the presence of diabetes while '0' represents the absence of it. This is the target variable.

Linear Regression models are models which predict a continuous label. The goal is to produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion we choose. Good examples of this are predicting the price of the house, sales of a retail store, or life expectancy of an individual. Linear Regression models assume a linear relationship between the independent and the dependent variables.

Let us start by loading the data. The *first line of code* reads in the data as pandas dataframe, while the *second line* prints the shape of the data. The *third line* prints the first five observations of the data. We will try to predict 'Income' basis other variables.

`1 2 3 4`

`# Load data df = pd.read_csv("data_smodel.csv") print(df.shape) df.head(5)`

python

Output:

`1 2 3 4 5 6 7 8 9`

`(600, 10) | | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Sex | |--- |---------------- |------------ |------------- |-------- |------------- |------------- |-------------- |----------------- |----- |----- | | 0 | 0 | 0 | 0 | 362700 | 44500 | 384 | 0 | 0 | 55 | 1 | | 1 | 0 | 3 | 0 | 244000 | 70000 | 384 | 0 | 0 | 30 | 1 | | 2 | 1 | 0 | 0 | 286500 | 99000 | 384 | 0 | 0 | 32 | 1 | | 3 | 0 | 0 | 1 | 285100 | 55000 | 384 | 0 | 0 | 68 | 1 | | 4 | 0 | 0 | 1 | 320000 | 58000 | 384 | 0 | 0 | 53 | 1 |`

We will start with a simple linear regression model with only one covariate, 'Loan_amount', predicting 'Income'.The lines of code below fits the univariate linear regression model and prints a summary of the result.

`1 2 3`

`model_lin = sm.OLS.from_formula("Income ~ Loan_amount", data=df) result_lin = model_lin.fit() result_lin.summary()`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23`

`| Dep. Variable: | Income | R-squared: | 0.587 | |------------------- |------------------ |--------------------- |----------- | | Model: | OLS | Adj. R-squared: | 0.587 | | Method: | Least Squares | F-statistic: | 851.4 | | Date: | Fri, 26 Jul 2019 | Prob (F-statistic): | 4.60e-117 | | Time: | 22:02:50 | Log-Likelihood: | -8670.3 | | No. Observations: | 600 | AIC: | 1.734e+04 | | Df Residuals: | 598 | BIC: | 1.735e+04 | | Df Model: | 1 | | | | Covariance Type: | nonrobust | | | | | coef | std err | t | P>|t| | [0.025 | 0.975] | |------------- |----------- |---------- |-------- |------- |---------- |---------- | | Intercept | 4.618e+05 | 2.05e+04 | 22.576 | 0.000 | 4.22e+05 | 5.02e+05 | | Loan_amount | 0.7528 | 0.026 | 29.180 | 0.000 | 0.702 | 0.803 | | Omnibus: | 459.463 | Durbin-Watson: | 1.955 | |---------------- |--------- |------------------- |----------- | | Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10171.070 | | Skew: | 3.186 | Prob(JB): | 0.00 | | Kurtosis: | 22.137 | Cond. No. | 8.69e+05 |`

The central section of the output, where the header begins with ** coef**, is important for model interpretation. The fitted model implies that, when comparing two applicants whose 'Loan_amount' differ by one unit, the applicant with the higher 'Loan_amount' will, on average, have 0.75 units higher 'Income'. This difference is statistically significant, because the p-value, shown under the column labeled

The other parameter to test the efficacy of the model is the ** R-squared** value, which represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Loan_amount). The higher the value, the better the explainability of the model, with the highest value being one. In our case, the R-squared value of 0.587 means that 59% of the variation in the variable 'Income' is explained by the variable 'Loan_amount'.

The Pearson correlation coefficient is also an indicator of the extent and strength of the linear relationship between the two variables. The lines of code below calculate and print the correlation coefficient, which comes out to be 0.766. This is a strong positive correlation between the two variables, with the highest value being one.

`1 2`

`cc = df[["Income", "Loan_amount"]].corr() print(cc)`

python

Output:

`1 2 3`

`Income Loan_amount Income 1.00000 0.76644 Loan_amount 0.76644 1.00000`

In the previous section, we covered simple linear regression using one variable. However, in real world cases, we will deal with multiple variables. This is called multivariate regression. In our case, we will build the multivariate statistical model using five independent variables.

The lines of code below fits the multivariate linear regression model and prints the result summary. It is to be noted that the syntax `Income ~ Loan_amount + Age + Term_months + Dependents + Is_graduate`

does not mean that these five variables are literally added together. Instead, it only means that these variables were included in the model as predictors of the variable 'Income'.

`1 2 3`

`model_lin = sm.OLS.from_formula("Income ~ Loan_amount + Age + Term_months + Dependents + Is_graduate", data=df) result_lin = model_lin.fit() result_lin.summary()`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26`

`| Dep. Variable: | Income | R-squared: | 0.595 | |------------------- |------------------ |--------------------- |----------- | | Model: | OLS | Adj. R-squared: | 0.592 | | Method: | Least Squares | F-statistic: | 174.7 | | Date: | Fri, 26 Jul 2019 | Prob (F-statistic): | 3.95e-114 | | Time: | 22:04:27 | Log-Likelihood: | -8664.6 | | No. Observations: | 600 | AIC: | 1.734e+04 | | Df Residuals: | 594 | BIC: | 1.737e+04 | | Df Model: | 5 | | | | Covariance Type: | nonrobust | | | | | coef | std err | t | P>|t| | [0.025 | 0.975] | |------------- |----------- |---------- |-------- |------- |----------- |---------- | | Intercept | 2.68e+05 | 1.32e+05 | 2.029 | 0.043 | 8575.090 | 5.27e+05 | | Loan_amount | 0.7489 | 0.026 | 28.567 | 0.000 | 0.697 | 0.800 | | Age | -856.1704 | 1265.989 | -0.676 | 0.499 | -3342.530 | 1630.189 | | Term_months | 338.6069 | 295.449 | 1.146 | 0.252 | -241.644 | 918.858 | | Dependents | 8437.9050 | 1.84e+04 | 0.460 | 0.646 | -2.76e+04 | 4.45e+04 | | Is_graduate | 1.365e+05 | 4.56e+04 | 2.995 | 0.003 | 4.7e+04 | 2.26e+05 | | Omnibus: | 460.035 | Durbin-Watson: | 1.998 | |---------------- |--------- |------------------- |----------- | | Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10641.667 | | Skew: | 3.173 | Prob(JB): | 0.00 | | Kurtosis: | 22.631 | Cond. No. | 5.66e+06 |`

The output above shows that, when the other variables remain constant, if we compare two applicants whose 'Loan_amount' differ by one unit, the applicant with higher 'Loan_amount' will, on average, have 0.75 units higher 'Income'.

Using the ** P>|t|** result, we can infer that the variables 'Loan_amount' and 'Is_graduate' are the two statistically significant variables, as their p-value is less than 0.05.

Whenever a categorical variable is used as a covariate in a regression model, one level of the variable is omitted and is automatically given a coefficient of zero. This level is called the reference level of the covariate. In the model above, 'Is_graduate' is a categorical variable, and only the coefficient for 'Graduate' applicants is included in the regression output, while 'Not Graduate' is the reference level.

The ** R-squared** value marginally increased from 0.587 to 0.595, which means that now 59.5% of the variation in 'Income' is explained by the five independent variables, as compared to 58.7% earlier. The marginal increase could be because of the inclusion of the 'Is_graduate' variable that is also statistically significant.

Logistic Regression is a type of generalized linear model which is used for classification problems. The goal is to predict a categorical outcome, such as predicting whether a customer will churn or not, or whether a bank loan will default or not.

In this guide, we will be building statistical models for predicting a binary outcome, meaning an outcome that can take only two distinct values. Let us start by loading the data. The *first line of code* reads in the data as pandas dataframe, while the *second line* prints the shape of the data. The *third line* prints the first five observations. We will try to predict 'diabetes' basis other variables.

`1 2 3`

`df2 = pd.read_csv("diabetes.csv") print(df2.shape) df2.head(5)`

python

Output:

`1 2 3 4 5 6 7 8 9`

`(768, 9) | | pregnancies | glucose | diastolic | triceps | insulin | bmi | dpf | age | diabetes | |--- |------------- |--------- |----------- |--------- |--------- |------ |------- |----- |---------- | | 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | | 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 | | 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 | | 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 | | 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |`

We will start with the basic logistic regression model with only one covariate, 'age', predicting 'diabetes'.The lines of code below fits the univariate logistic regression model and prints the result summary.

`1 2 3`

`model = sm.GLM.from_formula("diabetes ~ age", family=sm.families.Binomial(), data=df2) result = model.fit() result.summary()`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`| Dep. Variable: | diabetes | No. Observations: | 768 | |----------------- |------------------ |------------------- |--------- | | Model: | GLM | Df Residuals: | 766 | | Model Family: | Binomial | Df Model: | 1 | | Link Function: | logit | Scale: | 1.0 | | Method: | IRLS | Log-Likelihood: | -475.36 | | Date: | Fri, 26 Jul 2019 | Deviance: | 950.72 | | Time: | 22:08:35 | Pearson chi2: | 761. | | No. Iterations: | 4 | | | | | coef | stderr | z | P>|z| | [0.025 | 0.975] | |----------- |--------- |------- |-------- |------- |-------- |-------- | | Intercept | -2.0475 | 0.239 | -8.572 | 0.000 | -2.516 | -1.579 | | age | 0.0420 | 0.007 | 6.380 | 0.000 | 0.029 | 0.055 |`

Using the ** P>|t|** result above, we can conclude that the variable 'age' is an important predictor for 'diabetes', as the value is less than 0.05.

As with linear regression, we can also include multiple variables in a logistic regression model. Below we fit a logistic regression for 'diabetes' using all the other variables.

`1 2 3`

`model = sm.GLM.from_formula("diabetes ~ age + pregnancies + glucose + triceps + diastolic + insulin + bmi + dpf", family=sm.families.Binomial(), data=df2) result = model.fit() result.summary()`

python

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22`

`| Dep. Variable: | diabetes | No. Observations: | 768 | |----------------- |------------------ |------------------- |--------- | | Model: | GLM | Df Residuals: | 759 | | Model Family: | Binomial | Df Model: | 8 | | Link Function: | logit | Scale: | 1.0 | | Method: | IRLS | Log-Likelihood: | -361.72 | | Date: | Fri, 26 Jul 2019 | Deviance: | 723.45 | | Time: | 22:08:49 | Pearson chi2: | 836. | | No. Iterations: | 5 | | | | | coef | std err | z | P>|z| | [0.025 | 0.975] | |------------- |--------- |------- |--------- |------- |-------- |-------- | | Intercept | -8.4047 | 0.717 | -11.728 | 0.000 | -9.809 | -7.000 | | age | 0.0149 | 0.009 | 1.593 | 0.111 | -0.003 | 0.033 | | pregnancies | 0.1232 | 0.032 | 3.840 | 0.000 | 0.060 | 0.186 | | glucose | 0.0352 | 0.004 | 9.481 | 0.000 | 0.028 | 0.042 | | triceps | 0.0006 | 0.007 | 0.090 | 0.929 | -0.013 | 0.014 | | diastolic | -0.0133 | 0.005 | -2.540 | 0.011 | -0.024 | -0.003 | | insulin | -0.0012 | 0.001 | -1.322 | 0.186 | -0.003 | 0.001 | | bmi | 0.0897 | 0.015 | 5.945 | 0.000 | 0.060 | 0.119 | | dpf | 0.9452 | 0.299 | 3.160 | 0.002 | 0.359 | 1.531 |`

The above output shows that adding other variables to the model leads to a big shift in the age parameter (its p-value increased to over the significance level of 0.05). This can happen in statistical models while adding or removing other variables from a model.

Looking at the p-values, the variables 'age', 'triceps', and 'insulin', seem to be insignificant predictors. All the other variables have their p-values smaller than 0.05, and are, therefore, significant.

The interpretation of logistic models is different in the manner that the coefficients are understood from the logit perspective. In simple terms, it means that, for the output above, the log odds for 'diabetes' increases by 0.09 for each unit of 'bmi', 0.03 for each unit of 'glucose', and so on.

As with linear regression, the roles of 'bmi' and 'glucose' in the logistic regression model is additive, but here the additivity is on the scale of log odds, not odds or probabilities.

In this guide, you have learned about interpreting data using statistical models. You also learned about using the Statsmodels library for building linear and logistic models - univariate as well as multivariate. You also learned about interpreting the model output to infer relationships, and determine the significant predictor variables.

To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:

19