Interpreting Data Using Statistical Models with R
Aug 7, 2019 • 16 Minute Read
Introduction
Statistical models are useful not only in machine learning, but also in interpreting data and understanding the relationships between the variables. In this guide, the reader will learn how to fit and analyze statistical models on the quantitative (linear regression) and qualitative (logistic regression) target variables. The reader will also learn how to create and interpret the correlation matrix of the numerical variables.
We will begin by understanding the data.
Data
In this guide, we will be using the fictitious data of loan applicants containing 600 observations and nine variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad").

Age: The applicant’s age in years.

Sex: Whether the applicant is female (F) or male (M).

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.
Let us start by loading the required libraries and the data.
library(readr)
library(dplyr)
library(mlbench)
dat < read_csv("data_r.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <chr> "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes...
$ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
$ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
$ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "...
$ approval_status <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
$ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
$ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
The output shows that the dataset has five categorical variables (labelled as 'chr') while the remaining four are numerical variables (labelled as 'int').
Linear Regression
Regression models are algorithms that predict a continuous label. Linear Regression is a type of regression models which assume the presence of linear relationship between the target and the predictor variables.
Simple Linear Regression
Simple linear regression is the simplest form of regression which uses only one covariate for predicting the target variable. In our case, 'Investment' is the covariate variable, while 'Income' is the target variable.
The first line of code below fits the univariate linear regression model, while the second line prints the summary of the fitted model. Note that we are using the lm command, which is used for fitting linear models in R.
fit_lin < lm(Income ~ Investment, data = dat)
summary(fit_lin)
Output:
Call:
lm(formula = Income ~ Investment, data = dat)
Residuals:
Min 1Q Median 3Q Max
4940996 93314 33441 78990 3316423
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 2.393e+05 2.091e+04 11.45 <2e16 ***
Investment 2.895e+00 8.071e02 35.87 <2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 401100 on 598 degrees of freedom
Multiple Rsquared: 0.6827, Adjusted Rsquared: 0.6821
Fstatistic: 1286 on 1 and 598 DF, pvalue: < 2.2e16
Interpretation of the Output

Investment is a significant variable for predicting Income, as is evident from the significance code '***', printed next to the pvalue of the variable.

The pvalue, shown under the column, ***Pr(>t)***, is less than the significance value of 0.05, which also suggests that there are statistically significant relationships between the variables, 'Investment', and 'Income'.

The coefficients of the output indicate that for every unit increase in the 'Investment', the 'Income' goes up by 2.895 dollars.

Rsquared Value: represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Investment). In our case, the Rsquared value of 0.68 means that 68 percent of the variation in the variable 'Income' is explained by the variable 'Investment'.
All the above factors indicate that there is a strong linear relationship between the two variables.
Correlation
For numerical attributes, an excellent way to think about relationships is to calculate the correlation.
Correlation Coefficient Between Two Variables
The Pearson correlation coefficient, calculated using the cor function, is an indicator of the extent and strength of the linear relationship between the two variables. The line of code below prints the correlation coefficient which comes out to be 0.82. This is a strong positive correlation between the two variables, with the highest positive value being one.
cor(dat$Income, dat$Investment)
Output:
1] 0.8262401
Correlation Between Multiple Variables
It is also possible to create a correlation matrix for multiple variables, which is a symmetrical table of all pairs of attribute correlations for numerical variables. The first line of code below calculates the correlation between the numerical variables, while the second line displays the correlation matrix.
correl_dat < cor(dat[,c(3,4,7,9)])
print(correl_dat)
Output:
Income Loan_amount Age Investment
Income 1.00000000 0.76643958 0.02787282 0.8262401
Loan_amount 0.76643958 1.00000000 0.05791348 0.7202692
Age 0.02787282 0.05791348 1.00000000 0.1075841
Investment 0.82624011 0.72026924 0.10758414 1.0000000
The matrix above shows that Income has a high positive correlation with 'Loan_amount' and 'Investment'.
Multiple Linear Regression
As the name suggests, multiple linear regression tries to predict the target variable using multiple predictors. In our case, we will build the multivariate statistical model using all the other variables. But before doing the modelling, it is better to convert the character variables into the factor type. The first line of code below creates a list of character variables in the dataset. The second line uses the lapply function to convert these variables, stored in 'names', into the factor variables. The third line provides the information about the data, where the categorical variables have been converted to 'factor' type.
names < c(1,2,5,6,8)
dat[,names] < lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 9
$ Marital_status <fct> Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, ...
$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Yes,...
$ Income <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Go...
$ approval_status <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, N...
$ Age <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
$ Sex <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, M, M, M, M, ...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
Now we are ready to fit the multiple linear regression. The lines of code below fit the model and prints the result summary.
fit_mlr < lm(Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat)
summary(fit_mlr)
Output:
Call:
lm(formula = Income ~ Marital_status + Is_graduate + Loan_amount +
Credit_score + Age + Sex + Investment, data = dat)
Residuals:
Min 1Q Median 3Q Max
4184641 133867 37001 92469 2852369
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 3.055e+05 6.802e+04 4.491 8.53e06 ***
Marital_statusYes 2.341e+04 3.299e+04 0.710 0.4782
Is_graduateYes 8.032e+04 3.671e+04 2.188 0.0291 *
Loan_amount 3.419e01 2.925e02 11.688 < 2e16 ***
Credit_scoreGood 5.012e+04 3.196e+04 1.568 0.1174
Age 2.426e+03 1.006e+03 2.412 0.0162 *
SexM 4.793e+04 4.048e+04 1.184 0.2370
Investment 2.021e+00 1.043e01 19.379 < 2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 357700 on 592 degrees of freedom
Multiple Rsquared: 0.7502, Adjusted Rsquared: 0.7473
Fstatistic: 254 on 7 and 592 DF, pvalue: < 2.2e16
Interpretation of the Output

The Rsquared Value increased from 0.68 to 0.75 which shows that the addition of variables have improved the prediction power.

‘Investment’ and ‘Loan_amount’ are the highly significant predictors, while 'Age' and 'Is_graduate' are the moderately significant variables. The degree of significance can also be understood from the number of stars, if any, printed next to the pvalue of the variable.

The pvalue for all four variables, discussed above, is less than a significance value of 0.05, as shown under the column labeled ***Pr(>t)***. This also reinforces our inference that these variables have a statistically significant relationship with the 'Income' variable.
Logistic Regression
Logistic Regression is a type of generalized linear model which is used for classification problems. While a linear regression model predicts a continuous outcome, the idea of a logistic regression model is to extend it to situations where the outcome variable is categorical. In this guide, we will perform twoclass classification using logistic regression. We will be using the same dataset, but this time, the target variable will be 'approval_status', which indicates whether the loan application was approved ("Yes") or not ("No").
Univariate Logistic Regression
We will start with only one covariate, 'Credit_score', to predict 'approval_status'. The function being used is the glm command, which is used for fitting generalized linear models in R. The lines of code below fit the univariate logistic regression model and prints the model summary. The argument, ***family="binomial"***, specifies that we are building a logistic regression model for predicting binary outcomes.
mod_log = glm(approval_status ~ Credit_score, data=dat, family="binomial")
summary(mod_log)
Output:
Call:
glm(formula = approval_status ~ Credit_score, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
2.3197 0.6550 0.3748 0.3748 1.8137
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.4302 0.1783 8.023 1.03e15 ***
Credit_scoreGood 4.0506 0.2674 15.147 < 2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 749.20 on 599 degrees of freedom
Residual deviance: 395.64 on 598 degrees of freedom
AIC: 399.64
Number of Fisher Scoring iterations: 5
Using the Pr(>z) result above, we can conclude that the variable 'Credit_score' is an important predictor for 'diabetes', as the pvalue is less than 0.05. The significance code also supports this inference. It is also intuitive that the applicants with good credit score will more likely get their loan applications approved, and vice versa.
Multivariate Logistic Regression
We can also include multiple variables in a logistic regression model, using the approval_status ~ ., command. Below we will fit a multivariate logistic regression model for 'approval_status' using all the other variables.
mod_log2 = glm(approval_status ~ ., data=dat, family="binomial")
summary(mod_log2)
Output:
Call:
glm(formula = approval_status ~ ., family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
2.7715 0.2600 0.1995 0.2778 1.8321
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 3.160e+00 7.062e01 4.474 7.67e06 ***
Marital_statusYes 7.360e01 3.265e01 2.254 0.02418 *
Is_graduateYes 2.469e+00 3.809e01 6.484 8.95e11 ***
Income 1.949e07 5.013e07 0.389 0.69746
Loan_amount 9.635e07 3.128e07 3.080 0.00207 **
Credit_scoreGood 4.649e+00 3.612e01 12.869 < 2e16 ***
Age 1.379e02 1.002e02 1.377 0.16841
SexM 3.306e01 3.941e01 0.839 0.40158
Investment 1.784e06 1.923e06 0.928 0.35360

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 749.20 on 599 degrees of freedom
Residual deviance: 327.75 on 591 degrees of freedom
AIC: 345.75
Number of Fisher Scoring iterations: 6
Interpretation of the Output

The variables 'Is_graduate', with label "Yes", and 'Credit_score', with label "Good", are the two most significant variables. This is indicated by their lower pvalues and the higher significance code. 'Loan_amount' and 'Marital_status' are the next two important variables for predicting 'approval_status'.

The Akaike information criterion (AIC) value also decreased from 399.64 in the univariate model to 345.75 in the multivariate model. In simple terms, the AIC value is an estimator of the relative quality of statistical models for a given set of data. The decrease in AIC value also suggests that adding more variables have strengthened the predictive power of the statistical model.
Conclusion
In this guide, you have learned about interpreting data using statistical models. You also learned about building the correlation matrix for numerical variables and interpreting the output to identify statistically significant variables.