Introduction

Statistical models are useful not only in machine learning, but also in interpreting data and understanding the relationships between the variables. In this guide, the reader will learn how to fit and analyze statistical models on the quantitative (linear regression) and qualitative (logistic regression) target variables. The reader will also learn how to create and interpret the correlation matrix of the numerical variables.

We will begin by understanding the data.

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and nine variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad").

Age: The applicant’s age in years.

Sex: Whether the applicant is female (F) or male (M).

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

Let us start by loading the required libraries and the data.

```
1library(readr)
2library(dplyr)
3library(mlbench)
4dat <- read_csv("data_r.csv")
5glimpse(dat)
6
```

{r}

Output:

```
1Observations: 600
2Variables: 9
3$ Marital_status <chr> "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes...
4$ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
5$ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
6$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
7$ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "...
8$ approval_status <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
9$ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
10$ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M...
11$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
12
```

The output shows that the dataset has five categorical variables (labelled as 'chr') while the remaining four are numerical variables (labelled as 'int').

Regression models are algorithms that predict a continuous label. Linear Regression is a type of regression models which assume the presence of linear relationship between the target and the predictor variables.

Simple linear regression is the simplest form of regression which uses only one covariate for predicting the target variable. In our case, 'Investment' is the covariate variable, while 'Income' is the target variable.

The *first line of code* below fits the univariate linear regression model, while the *second line* prints the summary of the fitted model. Note that we are using the ** lm** command, which is used for fitting linear models in R.

```
1fit_lin <- lm(Income ~ Investment, data = dat)
2summary(fit_lin)
```

{r}

Output:

```
1Call:
2lm(formula = Income ~ Investment, data = dat)
3
4Residuals:
5 Min 1Q Median 3Q Max
6-4940996 -93314 -33441 78990 3316423
7
8Coefficients:
9 Estimate Std. Error t value Pr(>|t|)
10(Intercept) 2.393e+05 2.091e+04 11.45 <2e-16 ***
11Investment 2.895e+00 8.071e-02 35.87 <2e-16 ***
12---
13Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
14
15Residual standard error: 401100 on 598 degrees of freedom
16
17Multiple R-squared: 0.6827, Adjusted R-squared: 0.6821
18
19F-statistic: 1286 on 1 and 598 DF, p-value: < 2.2e-16
20
```

- Investment is a significant variable for predicting Income, as is evident from the significance code '***', printed next to the p-value of the variable.

- The p-value, shown under the column,
, is less than the significance value of 0.05, which also suggests that there are statistically significant relationships between the variables, 'Investment', and 'Income'.*Pr(>|t|)*

- The coefficients of the output indicate that for every unit increase in the 'Investment', the 'Income' goes up by 2.895 dollars.

R-squared Value: represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Investment). In our case, the R-squared value of 0.68 means that 68 percent of the variation in the variable 'Income' is explained by the variable 'Investment'.

All the above factors indicate that there is a strong linear relationship between the two variables.

For numerical attributes, an excellent way to think about relationships is to calculate the correlation.

The Pearson correlation coefficient, calculated using the ** cor** function, is an indicator of the extent and strength of the linear relationship between the two variables. The line of code below prints the correlation coefficient which comes out to be 0.82. This is a strong positive correlation between the two variables, with the highest positive value being one.

`1cor(dat$Income, dat$Investment) `

{r}

Output:

```
1[1] 0.8262401
2
```

It is also possible to create a correlation matrix for multiple variables, which is a symmetrical table of all pairs of attribute correlations for numerical variables. The *first line of code* below calculates the correlation between the numerical variables, while the *second line* displays the correlation matrix.

```
1correl_dat <- cor(dat[,c(3,4,7,9)])
2print(correl_dat)
3
```

{r}

Output:

```
1
2 Income Loan_amount Age Investment
3Income 1.00000000 0.76643958 0.02787282 0.8262401
4Loan_amount 0.76643958 1.00000000 0.05791348 0.7202692
5Age 0.02787282 0.05791348 1.00000000 0.1075841
6Investment 0.82624011 0.72026924 0.10758414 1.0000000
7
```

The matrix above shows that Income has a high positive correlation with 'Loan_amount' and 'Investment'.

As the name suggests, multiple linear regression tries to predict the target variable using multiple predictors. In our case, we will build the multivariate statistical model using all the other variables. But before doing the modelling, it is better to convert the character variables into the factor type.
The *first line of code* below creates a list of character variables in the dataset. The *second line* uses the ** lapply** function to convert these variables, stored in 'names', into the factor variables. The

```
1 names <- c(1,2,5,6,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
4
```

{r}

Output:

```
1Observations: 600
2Variables: 9
3$ Marital_status <fct> Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, ...
4$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Yes,...
5$ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
6$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
7$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Go...
8$ approval_status <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, N...
9$ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
10$ Sex <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, M, M, M, M, ...
11$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
12
```

Now we are ready to fit the multiple linear regression. The lines of code below fit the model and prints the result summary.

```
1fit_mlr <- lm(Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat)
2summary(fit_mlr)
3
```

{r}

Output:

```
1Call:
2lm(formula = Income ~ Marital_status + Is_graduate + Loan_amount +
3 Credit_score + Age + Sex + Investment, data = dat)
4
5Residuals:
6 Min 1Q Median 3Q Max
7-4184641 -133867 -37001 92469 2852369
8
9Coefficients:
10 Estimate Std. Error t value Pr(>|t|)
11(Intercept) 3.055e+05 6.802e+04 4.491 8.53e-06 ***
12Marital_statusYes 2.341e+04 3.299e+04 0.710 0.4782
13Is_graduateYes 8.032e+04 3.671e+04 2.188 0.0291 *
14Loan_amount 3.419e-01 2.925e-02 11.688 < 2e-16 ***
15Credit_scoreGood -5.012e+04 3.196e+04 -1.568 0.1174
16Age -2.426e+03 1.006e+03 -2.412 0.0162 *
17SexM 4.793e+04 4.048e+04 1.184 0.2370
18Investment 2.021e+00 1.043e-01 19.379 < 2e-16 ***
19---
20Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
22Residual standard error: 357700 on 592 degrees of freedom
23Multiple R-squared: 0.7502, Adjusted R-squared: 0.7473
24F-statistic: 254 on 7 and 592 DF, p-value: < 2.2e-16
25
```

- The R-squared Value increased from 0.68 to 0.75 which shows that the addition of variables have improved the prediction power.

- ‘Investment’ and ‘Loan_amount’ are the highly significant predictors, while 'Age' and 'Is_graduate' are the moderately significant variables. The degree of significance can also be understood from the number of stars, if any, printed next to the p-value of the variable.

The p-value for all four variables, discussed above, is less than a significance value of 0.05, as shown under the column labeled

. This also reinforces our inference that these variables have a statistically significant relationship with the 'Income' variable.*Pr(>|t|)*

Logistic Regression is a type of generalized linear model which is used for classification problems. While a linear regression model predicts a continuous outcome, the idea of a logistic regression model is to extend it to situations where the outcome variable is categorical. In this guide, we will perform two-class classification using logistic regression. We will be using the same dataset, but this time, the target variable will be 'approval_status', which indicates whether the loan application was approved ("Yes") or not ("No").

We will start with only one covariate, 'Credit_score', to predict 'approval_status'. The function being used is the ** glm** command, which is used for fitting generalized linear models in R. The lines of code below fit the univariate logistic regression model and prints the model summary. The argument,

```
1mod_log = glm(approval_status ~ Credit_score, data=dat, family="binomial")
2summary(mod_log)
3
```

{r}

Output:

```
1Call:
2glm(formula = approval_status ~ Credit_score, family = "binomial",
3 data = dat)
4
5Deviance Residuals:
6 Min 1Q Median 3Q Max
7-2.3197 -0.6550 0.3748 0.3748 1.8137
8
9Coefficients:
10 Estimate Std. Error z value Pr(>|z|)
11(Intercept) -1.4302 0.1783 -8.023 1.03e-15 ***
12Credit_scoreGood 4.0506 0.2674 15.147 < 2e-16 ***
13---
14Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15
16(Dispersion parameter for binomial family taken to be 1)
17
18 Null deviance: 749.20 on 599 degrees of freedom
19Residual deviance: 395.64 on 598 degrees of freedom
20AIC: 399.64
21
22Number of Fisher Scoring iterations: 5
23
```

Using the ** Pr(>|z|)** result above, we can conclude that the variable 'Credit_score' is an important predictor for 'diabetes', as the p-value is less than 0.05. The significance code also supports this inference. It is also intuitive that the applicants with good credit score will more likely get their loan applications approved, and vice versa.

We can also include multiple variables in a logistic regression model, using the ** approval_status ~ .,** command. Below we will fit a multivariate logistic regression model for 'approval_status' using all the other variables.

```
1mod_log2 = glm(approval_status ~ ., data=dat, family="binomial")
2summary(mod_log2)
3
```

{r}

Output:

```
1Call:
2glm(formula = approval_status ~ ., family = "binomial", data = dat)
3
4Deviance Residuals:
5 Min 1Q Median 3Q Max
6-2.7715 -0.2600 0.1995 0.2778 1.8321
7
8Coefficients:
9 Estimate Std. Error z value Pr(>|z|)
10(Intercept) -3.160e+00 7.062e-01 -4.474 7.67e-06 ***
11Marital_statusYes 7.360e-01 3.265e-01 2.254 0.02418 *
12Is_graduateYes 2.469e+00 3.809e-01 6.484 8.95e-11 ***
13Income 1.949e-07 5.013e-07 0.389 0.69746
14Loan_amount -9.635e-07 3.128e-07 -3.080 0.00207 **
15Credit_scoreGood 4.649e+00 3.612e-01 12.869 < 2e-16 ***
16Age -1.379e-02 1.002e-02 -1.377 0.16841
17SexM -3.306e-01 3.941e-01 -0.839 0.40158
18Investment 1.784e-06 1.923e-06 0.928 0.35360
19---
20Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
22(Dispersion parameter for binomial family taken to be 1)
23
24 Null deviance: 749.20 on 599 degrees of freedom
25Residual deviance: 327.75 on 591 degrees of freedom
26AIC: 345.75
27
28Number of Fisher Scoring iterations: 6
29
```

- The variables 'Is_graduate', with label "Yes", and 'Credit_score', with label "Good", are the two most significant variables. This is indicated by their lower p-values and the higher significance code. 'Loan_amount' and 'Marital_status' are the next two important variables for predicting 'approval_status'.

The Akaike information criterion (AIC) value also decreased from 399.64 in the univariate model to 345.75 in the multivariate model. In simple terms, the AIC value is an estimator of the relative quality of statistical models for a given set of data. The decrease in AIC value also suggests that adding more variables have strengthened the predictive power of the statistical model.

In this guide, you have learned about interpreting data using statistical models. You also learned about building the correlation matrix for numerical variables and interpreting the output to identify statistically significant variables.