Author avatar

Deepika Singh

Interpreting Data Using Statistical Models with R

Deepika Singh

  • Aug 7, 2019
  • 16 Min read
  • 12 Views
  • Aug 7, 2019
  • 16 Min read
  • 12 Views
Data
R

Introduction

Statistical models are useful not only in machine learning, but also in interpreting data and understanding the relationships between the variables. In this guide, the reader will learn how to fit and analyze statistical models on the quantitative (linear regression) and qualitative (logistic regression) target variables. The reader will also learn how to create and interpret the correlation matrix of the numerical variables.

We will begin by understanding the data.

Data

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and nine variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No").

  1. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

  2. Income: Annual Income of the applicant (in USD).

  3. Loan_amount: Loan amount (in USD) for which the application was submitted.

  4. Credit_score: Whether the applicant's credit score is good ("Good") or not ("Bad").

  5. Age: The applicant’s age in years.

  6. Sex: Whether the applicant is female (F) or male (M).

  7. approval_status: Whether the loan application was approved ("Yes") or not ("No").

  8. Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

Let us start by loading the required libraries and the data.

1
2
3
4
5
6
library(readr)
library(dplyr)
library(mlbench)
dat <- read_csv("data_r.csv")
glimpse(dat)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 9
$ Marital_status   <chr> "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes...
$ Is_graduate      <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
$ Income              <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
$ Loan_amount    <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
$ Credit_score      <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "...
$ approval_status <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No...
$ Age                    <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
$ Sex                    <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M...
$ Investment         <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
 

The output shows that the dataset has five categorical variables (labelled as 'chr') while the remaining four are numerical variables (labelled as 'int').

Linear Regression

Regression models are algorithms that predict a continuous label. Linear Regression is a type of regression models which assume the presence of linear relationship between the target and the predictor variables.

Simple Linear Regression

Simple linear regression is the simplest form of regression which uses only one covariate for predicting the target variable. In our case, 'Investment' is the covariate variable, while 'Income' is the target variable.

The first line of code below fits the univariate linear regression model, while the second line prints the summary of the fitted model. Note that we are using the lm command, which is used for fitting linear models in R.

1
2
fit_lin <- lm(Income ~ Investment, data = dat)
summary(fit_lin)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Call:
lm(formula = Income ~ Investment, data = dat)
 
Residuals:
 	Min   	1Q   Median   	3Q  	Max
-4940996   -93314   -33441	78990  3316423
 
Coefficients:
         	             	Estimate            	Std. Error           	t value 	Pr(>|t|)	
(Intercept)          	2.393e+05         	2.091e+04         	11.45      <2e-16 ***
Investment         	2.895e+00         	8.071e-02          	35.87      <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 401100 on 598 degrees of freedom
 
Multiple R-squared:  0.6827,           	Adjusted R-squared:  0.6821
 
F-statistic:  1286 on 1 and 598 DF,     p-value: < 2.2e-16
 

Interpretation of the Output

  1. Investment is a significant variable for predicting Income, as is evident from the significance code '***', printed next to the p-value of the variable.
  1. The p-value, shown under the column, Pr(>|t|), is less than the significance value of 0.05, which also suggests that there are statistically significant relationships between the variables, 'Investment', and 'Income'.
  1. The coefficients of the output indicate that for every unit increase in the 'Investment', the 'Income' goes up by 2.895 dollars.
  1. R-squared Value: represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Investment). In our case, the R-squared value of 0.68 means that 68 percent of the variation in the variable 'Income' is explained by the variable 'Investment'.

All the above factors indicate that there is a strong linear relationship between the two variables.

Correlation

For numerical attributes, an excellent way to think about relationships is to calculate the correlation.

Correlation Coefficient Between Two Variables

The Pearson correlation coefficient, calculated using the cor function, is an indicator of the extent and strength of the linear relationship between the two variables. The line of code below prints the correlation coefficient which comes out to be 0.82. This is a strong positive correlation between the two variables, with the highest positive value being one.

1
cor(dat$Income, dat$Investment) 
{r}

Output:

1
2
[1] 0.8262401
 

Correlation Between Multiple Variables

It is also possible to create a correlation matrix for multiple variables, which is a symmetrical table of all pairs of attribute correlations for numerical variables. The first line of code below calculates the correlation between the numerical variables, while the second line displays the correlation matrix.

1
2
3
correl_dat <- cor(dat[,c(3,4,7,9)])
print(correl_dat)
 
{r}

Output:

1
2
3
4
5
6
7
 
            	          	Income            Loan_amount          Age        	        Investment
Income  	         	1.00000000      0.76643958         0.02787282        0.8262401
Loan_amount             0.76643958      1.00000000         0.05791348        0.7202692
Age     	             	0.02787282      0.05791348         1.00000000        0.1075841
Investment         	0.82624011      0.72026924         0.10758414        1.0000000
 

The matrix above shows that Income has a high positive correlation with 'Loan_amount' and 'Investment'.

Multiple Linear Regression

As the name suggests, multiple linear regression tries to predict the target variable using multiple predictors. In our case, we will build the multivariate statistical model using all the other variables. But before doing the modelling, it is better to convert the character variables into the factor type. The first line of code below creates a list of character variables in the dataset. The second line uses the lapply function to convert these variables, stored in 'names', into the factor variables. The third line provides the information about the data, where the categorical variables have been converted to 'factor' type.

1
2
3
4
 names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 9
$ Marital_status  <fct> Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, ...
$ Is_graduate 	<fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Yes,...
$ Income      	   <int> 586700, 426700, 735500, 327200, 240000, 683200, 800000, 4...
$ Loan_amount   <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61000...
$ Credit_score	      <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Go...
$ approval_status <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, N...
$ Age         	        <int> 76, 76, 75, 75, 75, 74, 72, 72, 71, 71, 71, 70, 70, 69, 6...
$ Sex         	        <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, F, M, M, M, M, ...
$ Investment  	       <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9568...
 

Now we are ready to fit the multiple linear regression. The lines of code below fit the model and prints the result summary.

1
2
3
fit_mlr <- lm(Income ~ Marital_status + Is_graduate + Loan_amount + Credit_score + Age + Sex + Investment, data = dat)
summary(fit_mlr)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Call:
lm(formula = Income ~ Marital_status + Is_graduate + Loan_amount +
    Credit_score + Age + Sex + Investment, data = dat)
 
Residuals:
 	Min   	1Q   Median   	3Q  	Max
-4184641  -133867   -37001	92469  2852369
 
Coefficients:
                	      	Estimate            	Std. Error              t value 	Pr(>|t|)	
(Intercept)    	  	3.055e+05         	6.802e+04            4.491  	8.53e-06 ***
Marital_statusYes       2.341e+04         	3.299e+04             0.710        0.4782	
Is_graduateYes 	8.032e+04         	3.671e+04             2.188  	0.0291 * 
Loan_amount    	3.419e-01          	2.925e-02              11.688      < 2e-16 ***
Credit_scoreGood      -5.012e+04        	3.196e+04             -1.568       0.1174	
Age          	      	 -2.426e+03       	1.006e+03              -2.412      0.0162 * 
SexM           	  	4.793e+04         	4.048e+04               1.184      0.2370	
Investment    	 	 2.021e+00        	1.043e-01               19.379     < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 357700 on 592 degrees of freedom
Multiple R-squared:  0.7502,           	Adjusted R-squared:  0.7473
F-statistic:   254 on 7 and 592 DF,  p-value: < 2.2e-16
 

Interpretation of the Output

  1. The R-squared Value increased from 0.68 to 0.75 which shows that the addition of variables have improved the prediction power.
  1. ‘Investment’ and ‘Loan_amount’ are the highly significant predictors, while 'Age' and 'Is_graduate' are the moderately significant variables. The degree of significance can also be understood from the number of stars, if any, printed next to the p-value of the variable.
  1. The p-value for all four variables, discussed above, is less than a significance value of 0.05, as shown under the column labeled Pr(>|t|). This also reinforces our inference that these variables have a statistically significant relationship with the 'Income' variable.

Logistic Regression

Logistic Regression is a type of generalized linear model which is used for classification problems. While a linear regression model predicts a continuous outcome, the idea of a logistic regression model is to extend it to situations where the outcome variable is categorical. In this guide, we will perform two-class classification using logistic regression. We will be using the same dataset, but this time, the target variable will be 'approval_status', which indicates whether the loan application was approved ("Yes") or not ("No").

Univariate Logistic Regression

We will start with only one covariate, 'Credit_score', to predict 'approval_status'. The function being used is the glm command, which is used for fitting generalized linear models in R. The lines of code below fit the univariate logistic regression model and prints the model summary. The argument, family="binomial", specifies that we are building a logistic regression model for predicting binary outcomes.

1
2
3
mod_log = glm(approval_status ~ Credit_score, data=dat, family="binomial")
summary(mod_log)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Call:
glm(formula = approval_status ~ Credit_score, family = "binomial",
    data = dat)
 
Deviance Residuals:
    Min   	1Q       Median   	3Q  	Max 
-2.3197  -0.6550   0.3748   0.3748   1.8137 
 
Coefficients:
             	         	Estimate        	Std. Error        	z value       Pr(>|z|)	
(Intercept)   	   	-1.4302 	0.1783            	-8.023 	     1.03e-15 ***
Credit_scoreGood   4.0506 	        	0.2674            	15.147     < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
	Null deviance: 749.20  on 599  degrees of freedom
Residual deviance: 395.64  on 598  degrees of freedom
AIC: 399.64
 
Number of Fisher Scoring iterations: 5
 

Using the Pr(>|z|) result above, we can conclude that the variable 'Credit_score' is an important predictor for 'diabetes', as the p-value is less than 0.05. The significance code also supports this inference. It is also intuitive that the applicants with good credit score will more likely get their loan applications approved, and vice versa.

Multivariate Logistic Regression

We can also include multiple variables in a logistic regression model, using the approval_status ~ ., command. Below we will fit a multivariate logistic regression model for 'approval_status' using all the other variables.

1
2
3
mod_log2 = glm(approval_status ~ ., data=dat, family="binomial")
summary(mod_log2)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Call:
glm(formula = approval_status ~ ., family = "binomial", data = dat)
 
Deviance Residuals:
    Min   	1Q         Median       3Q  	Max 
-2.7715   -0.2600   0.1995   0.2778   1.8321 
 
Coefficients:
                	      	Estimate            	Std. Error           	z value    Pr(>|z|)	
(Intercept)   	   	-3.160e+00        	7.062e-01          	-4.474 	7.67e-06 ***
Marital_statusYes   	7.360e-01        	3.265e-01          	2.254  	0.02418 * 
Is_graduateYes 	 2.469e+00        	3.809e-01          	6.484  	8.95e-11 ***
Income         	 	1.949e-07          	5.013e-07          	0.389  	0.69746	
Loan_amount   	 -9.635e-07         	3.128e-07          	-3.080     0.00207 **
Credit_scoreGood   	4.649e+00       	3.612e-01          	12.869    < 2e-16 ***
Age           	     	 -1.379e-02        	1.002e-02          	-1.377 	 0.16841	
SexM          	   	-3.306e-01         	3.941e-01          	-0.839     0.40158	
Investment     	  	1.784e-06           	1.923e-06          	0.928  	0.35360	
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
	Null deviance: 749.20  on 599  degrees of freedom
Residual deviance: 327.75  on 591  degrees of freedom
AIC: 345.75
 
Number of Fisher Scoring iterations: 6
 

Interpretation of the Output

  1. The variables 'Is_graduate', with label "Yes", and 'Credit_score', with label "Good", are the two most significant variables. This is indicated by their lower p-values and the higher significance code. 'Loan_amount' and 'Marital_status' are the next two important variables for predicting 'approval_status'.
  1. The Akaike information criterion (AIC) value also decreased from 399.64 in the univariate model to 345.75 in the multivariate model. In simple terms, the AIC value is an estimator of the relative quality of statistical models for a given set of data. The decrease in AIC value also suggests that adding more variables have strengthened the predictive power of the statistical model.

Conclusion

In this guide, you have learned about interpreting data using statistical models. You also learned about building the correlation matrix for numerical variables and interpreting the output to identify statistically significant variables.

0