Introduction

3

Statistical models are useful not only in machine learning, but also in interpreting data and understanding the relationships between the variables. In this guide, the reader will learn how to fit and analyze statistical models on the quantitative (linear regression) and qualitative (logistic regression) target variables. The reader will also learn how to create and interpret the correlation matrix of the numerical variables.

We will begin by understanding the data.

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and nine variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad").

Age: The applicant’s age in years.

Sex: Whether the applicant is female (F) or male (M).

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

Let us start by loading the required libraries and the data.

`1 2 3 4 5 6`

`library(readr) library(dplyr) library(mlbench) dat <- read_csv("data_r.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes... $ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No... $ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000... $ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "... $ approval_status <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No... $ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6... $ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...`

The output shows that the dataset has five categorical variables (labelled as 'chr') while the remaining four are numerical variables (labelled as 'int').

Regression models are algorithms that predict a continuous label. Linear Regression is a type of regression models which assume the presence of linear relationship between the target and the predictor variables.

Simple linear regression is the simplest form of regression which uses only one covariate for predicting the target variable. In our case, 'Investment' is the covariate variable, while 'Income' is the target variable.

The *first line of code* below fits the univariate linear regression model, while the *second line* prints the summary of the fitted model. Note that we are using the ** lm** command, which is used for fitting linear models in R.

`1 2`

`fit_lin <- lm(Income ~ Investment, data = dat) summary(fit_lin)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20`

`Call: lm(formula = Income ~ Investment, data = dat) Residuals: Min 1Q Median 3Q Max -4940996 -93314 -33441 78990 3316423 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.393e+05 2.091e+04 11.45 <2e-16 *** Investment 2.895e+00 8.071e-02 35.87 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 401100 on 598 degrees of freedom Multiple R-squared: 0.6827, Adjusted R-squared: 0.6821 F-statistic: 1286 on 1 and 598 DF, p-value: < 2.2e-16`

- Investment is a significant variable for predicting Income, as is evident from the significance code '***', printed next to the p-value of the variable.

- The p-value, shown under the column,
, is less than the significance value of 0.05, which also suggests that there are statistically significant relationships between the variables, 'Investment', and 'Income'.*Pr(>|t|)*

- The coefficients of the output indicate that for every unit increase in the 'Investment', the 'Income' goes up by 2.895 dollars.

R-squared Value: represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Investment). In our case, the R-squared value of 0.68 means that 68 percent of the variation in the variable 'Income' is explained by the variable 'Investment'.

All the above factors indicate that there is a strong linear relationship between the two variables.

For numerical attributes, an excellent way to think about relationships is to calculate the correlation.

The Pearson correlation coefficient, calculated using the ** cor** function, is an indicator of the extent and strength of the linear relationship between the two variables. The line of code below prints the correlation coefficient which comes out to be 0.82. This is a strong positive correlation between the two variables, with the highest positive value being one.

`1`

`cor(dat$Income, dat$Investment)`

{r}

Output:

`1 2`

`[1] 0.8262401`

It is also possible to create a correlation matrix for multiple variables, which is a symmetrical table of all pairs of attribute correlations for numerical variables. The *first line of code* below calculates the correlation between the numerical variables, while the *second line* displays the correlation matrix.

`1 2 3`

`correl_dat <- cor(dat[,c(3,4,7,9)]) print(correl_dat)`

{r}

Output:

`1 2 3 4 5 6 7`

`Income Loan_amount Age Investment Income 1.00000000 0.76643958 0.02787282 0.8262401 Loan_amount 0.76643958 1.00000000 0.05791348 0.7202692 Age 0.02787282 0.05791348 1.00000000 0.1075841 Investment 0.82624011 0.72026924 0.10758414 1.0000000`

The matrix above shows that Income has a high positive correlation with 'Loan_amount' and 'Investment'.

As the name suggests, multiple linear regression tries to predict the target variable using multiple predictors. In our case, we will build the multivariate statistical model using all the other variables. But before doing the modelling, it is better to convert the character variables into the factor type.
The *first line of code* below creates a list of character variables in the dataset. The *second line* uses the ** lapply** function to convert these variables, stored in 'names', into the factor variables. The

`1 2 3 4`

`names <- c(1,2,5,6,8) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 600 Variables: 9 $ Marital_status <fct> Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, ... $ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Yes,... $ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000... $ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Go... $ approval_status <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, N... $ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6... $ Sex <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, M, M, M, M, ... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...`

Now we are ready to fit the multiple linear regression. The lines of code below fit the model and prints the result summary.

`1 2 3`

`fit_mlr <- lm(Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat) summary(fit_mlr)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25`

`Call: lm(formula = Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat) Residuals: Min 1Q Median 3Q Max -4184641 -133867 -37001 92469 2852369 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.055e+05 6.802e+04 4.491 8.53e-06 *** Marital_statusYes 2.341e+04 3.299e+04 0.710 0.4782 Is_graduateYes 8.032e+04 3.671e+04 2.188 0.0291 * Loan_amount 3.419e-01 2.925e-02 11.688 < 2e-16 *** Credit_scoreGood -5.012e+04 3.196e+04 -1.568 0.1174 Age -2.426e+03 1.006e+03 -2.412 0.0162 * SexM 4.793e+04 4.048e+04 1.184 0.2370 Investment 2.021e+00 1.043e-01 19.379 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 357700 on 592 degrees of freedom Multiple R-squared: 0.7502, Adjusted R-squared: 0.7473 F-statistic: 254 on 7 and 592 DF, p-value: < 2.2e-16`

- The R-squared Value increased from 0.68 to 0.75 which shows that the addition of variables have improved the prediction power.

- ‘Investment’ and ‘Loan_amount’ are the highly significant predictors, while 'Age' and 'Is_graduate' are the moderately significant variables. The degree of significance can also be understood from the number of stars, if any, printed next to the p-value of the variable.

The p-value for all four variables, discussed above, is less than a significance value of 0.05, as shown under the column labeled

. This also reinforces our inference that these variables have a statistically significant relationship with the 'Income' variable.*Pr(>|t|)*

Logistic Regression is a type of generalized linear model which is used for classification problems. While a linear regression model predicts a continuous outcome, the idea of a logistic regression model is to extend it to situations where the outcome variable is categorical. In this guide, we will perform two-class classification using logistic regression. We will be using the same dataset, but this time, the target variable will be 'approval_status', which indicates whether the loan application was approved ("Yes") or not ("No").

We will start with only one covariate, 'Credit_score', to predict 'approval_status'. The function being used is the ** glm** command, which is used for fitting generalized linear models in R. The lines of code below fit the univariate logistic regression model and prints the model summary. The argument,

`1 2 3`

`mod_log = glm(approval_status ~ Credit_score, data=dat, family="binomial") summary(mod_log)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23`

`Call: glm(formula = approval_status ~ Credit_score, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.3197 -0.6550 0.3748 0.3748 1.8137 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4302 0.1783 -8.023 1.03e-15 *** Credit_scoreGood 4.0506 0.2674 15.147 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 749.20 on 599 degrees of freedom Residual deviance: 395.64 on 598 degrees of freedom AIC: 399.64 Number of Fisher Scoring iterations: 5`

Using the ** Pr(>|z|)** result above, we can conclude that the variable 'Credit_score' is an important predictor for 'diabetes', as the p-value is less than 0.05. The significance code also supports this inference. It is also intuitive that the applicants with good credit score will more likely get their loan applications approved, and vice versa.

We can also include multiple variables in a logistic regression model, using the ** approval_status ~ .,** command. Below we will fit a multivariate logistic regression model for 'approval_status' using all the other variables.

`1 2 3`

`mod_log2 = glm(approval_status ~ ., data=dat, family="binomial") summary(mod_log2)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29`

`Call: glm(formula = approval_status ~ ., family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.7715 -0.2600 0.1995 0.2778 1.8321 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.160e+00 7.062e-01 -4.474 7.67e-06 *** Marital_statusYes 7.360e-01 3.265e-01 2.254 0.02418 * Is_graduateYes 2.469e+00 3.809e-01 6.484 8.95e-11 *** Income 1.949e-07 5.013e-07 0.389 0.69746 Loan_amount -9.635e-07 3.128e-07 -3.080 0.00207 ** Credit_scoreGood 4.649e+00 3.612e-01 12.869 < 2e-16 *** Age -1.379e-02 1.002e-02 -1.377 0.16841 SexM -3.306e-01 3.941e-01 -0.839 0.40158 Investment 1.784e-06 1.923e-06 0.928 0.35360 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 749.20 on 599 degrees of freedom Residual deviance: 327.75 on 591 degrees of freedom AIC: 345.75 Number of Fisher Scoring iterations: 6`

- The variables 'Is_graduate', with label "Yes", and 'Credit_score', with label "Good", are the two most significant variables. This is indicated by their lower p-values and the higher significance code. 'Loan_amount' and 'Marital_status' are the next two important variables for predicting 'approval_status'.

The Akaike information criterion (AIC) value also decreased from 399.64 in the univariate model to 345.75 in the multivariate model. In simple terms, the AIC value is an estimator of the relative quality of statistical models for a given set of data. The decrease in AIC value also suggests that adding more variables have strengthened the predictive power of the statistical model.

In this guide, you have learned about interpreting data using statistical models. You also learned about building the correlation matrix for numerical variables and interpreting the output to identify statistically significant variables.

3