Author avatar

Deepika Singh

Interpreting Data Using Descriptive Statistics with R

Deepika Singh

  • Aug 2, 2019
  • 15 Min read
  • 123 Views
  • Aug 2, 2019
  • 15 Min read
  • 123 Views
Data
R

Introduction

Descriptive Statistics is the foundation block of summarizing data. It is divided into the measures of central tendency and the measures of dispersion. Measures of central tendency include mean, median, and the mode, while the measures of variability include standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these measures of descriptive statistics and use them to interpret the data.

We will begin by loading the data to be used in this guide.

Data

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and 9 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No").

  1. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

  2. Income: Annual Income of the applicant (in USD).

  3. Loan_amount: Loan amount (in USD) for which the application was submitted.

  4. Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").

  5. Age: The applicant’s age in years.

  6. Sex: Whether the applicant is female (F) or male (M).

  7. approval_status: Whether the loan application was approved ("Yes") or not ("No").

  1. Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

Let us start by loading the required libraries and the data.

1
2
3
4
5
6
 
library(readr)
library(dplyr)
library(e1071)
dat <- read_csv("data_de.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 9
$ Marital_status  <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate 	<chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "...
$ Income      	<int> 306800, 702100, 558800, 534500, 468000, 412700, 257100,...
$ Loan_amount 	<int> 43500, 104000, 66500, 64500, 135000, 63000, 55500, 2500...
$ Credit_score	<chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisf...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Age         	<int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74,...
$ Sex         	<chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M", ...
$ Investment      <int> 199420, 456365, 363220, 347425, 304200, 268255, 167115,...
 

Five of the variables are categorical (labelled as 'chr') while the remaining four are numerical (labelled as 'int').

Measures of Central Tendency

Measures of central tendency describe the center of the data and are often represented by the mean, median, and mode.

Mean

Mean represents the arithmetic average of the data. It is calculated by taking the sum of the values and dividing by the number of observations. The mean() function is used to calculate this in R. If the variable contains missing values, the argument na.rm=TRUE must be added to the mean function, which will now ignore the missing values while computing the mean.

The line of code below uses the 'sapply function to calculate the mean of the numerical variables in the data. The argument c(3,4,7,9) selects the numerical variables as per their position in the data.

From the output, we can infer that the average age of the applicant is 49.5 years, the average annual income is USD 705,541, and the average investment is USD 161,066. The output also shows that the average loan applied for is USD 323,793.

1
sapply(dat[,c(3,4,7,9)], mean) 
{r}

Output:

1
2
3
Income      	Loan_amount   	Age      Investment
705541.33    	323793.67     	49.45 	161066.97
 

It is also possible to calculate the mean of a variable in the data, as shown below.

1
2
3
4
 
print(mean(dat$Income))
print(mean(dat$Loan_amount)) 
 
{r}

Output:

1
2
3
[1] 705541.3
[1] 323793.7
 

Median

The middle most value of a variable in a data is its median value. The line of code below uses the median() function to print the median of the numerical variables in the data.

1
2
sapply(dat[,c(3,4,7,9)], median)
 
{r}

Output:

1
2
3
Income      Loan_amount  	Age    	   Investment
508350   	76000      	  	  51  	         106740
 

From the output, we can infer that the median age of the applicants is 51 years, the median annual income is USD 508,350, and the median loan applied for is USD 76,000.

It is also possible to calculate the median of a variable in the data, as shown in the first two lines of code below.

1
2
3
4
 
print(median(dat$Income))
print(median(dat$Loan_amount))
 
{r}

Output:

1
2
3
[1] 508350
[1] 76000
 

Mode

Mode represents the most frequent value of a variable in the data and is the only central tendency measure that can be used with both numeric and categorical variables.

For finding mode in R, we need to convert the five 'chr' variables into the 'factor' variable. These five variables are 'Marital_status', 'Is_graduate', 'Credit_score', 'approval_status', and 'Sex'.

The first line of code below creates a list of columns that contain the above variables in the dataset. The second line uses the lapply function to convert these variables, stored in 'names', into the factor variables. The third line provides the information about the data.

1
2
3
4
5
 
names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 9
$ Marital_status  <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No,...
$ Is_graduate 	<fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes...
$ Income      	<int> 306800, 702100, 558800, 534500, 468000, 412700...
$ Loan_amount 	<int> 43500, 104000, 66500, 64500, 135000, 63000, 55...
$ Credit_score	<fct> Satisfactory, Satisfactory, Satisfactory, Sati...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
$ Age         	<int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74...
$ Sex         	<fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M...
$ Investment  	<int> 199420, 456365, 363220, 347425, 304200, 268255... 
 

The output shows that all the five variables have been converted into the ‘factor’ variables. Now, we can print the label-wise frequency of each variable with the line of code below.

1
summary(dat[,c(1,2,5,6,8)])
{r}

Output:

1
2
3
4
Marital_status    Is_graduate 	   Credit_score    	        approval_status            	 Sex	
 No :209    	        No :130 	    Not _satisfactory:128         No :190     	     F:111 
 Yes:391    	       Yes:470	     Satisfactory 	:472            Yes:410     	    M:489 
 

The mode for the variable 'Marital_status' is the label 'Yes' which means majority of the applicants were married. Similarly, the mode for the variable 'Sex' is the label 'M', indicating that majority of the applicants were male.

It is also possible to calculate the mode of a variable in the data, as shown in the line of code below.

1
2
table(dat$Credit_score)
 
{r}

Output:

1
2
3
Not _satisfactory  	Satisfactory
              128           	472
 

Measures of Dispersion

The extent to which a distribution is stretched or squeezed is measured by dispersion, also referred to as variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the interquartile range.

Standard Deviation

Standard deviation is a measure used to quantify the amount of variation of a set of data values from its mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa. It is also used to examine if the data has a normal (or nearly normal) distribution. The line of code below prints the standard deviation of all the numerical variables in the data.

1
2
sapply(dat[,c(3,4,7,9)], sd)
 
{r}

Output:

1
2
3
Income               	Loan_amount      	   Age                    	Investment
711421.81415   	 724293.48078 	  14.72851            203058.62713
 

While interpreting the standard deviation values, it is important to understand them in conjunction with the mean. For example, the units of the variables 'Income' and 'Age' are different, therefore, comparing the dispersion of these two variables based on standard deviation alone will be incorrect. This needs to be kept in mind.

It is also possible to calculate the standard deviation of a variable, as shown in the lines of code below.

1
2
3
4
 
print(sd(dat$Income))
print(sd(dat$Loan_amount))
 
{r}

Output:

1
2
3
[1] 711421.8
[1] 724293.5
 

Variance

Variance is the square of the standard deviation and the covariance of the random variable with itself. The line of code below prints the variance of all the numerical variables in the dataset. The interpretation of the variance is like that of the standard deviation.

1
2
3
 
sapply(dat[,c(3,4,7,9)], var)
 
{r}

Output:

1
2
3
 Income       	      Loan_amount      	     Age        	    Investment
5.061210e+11      5.246010e+11        2.169290e+02      4.123281e+10
 

IQR

The Interquartile Range (IQR) is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR can be calculated using the IQR() function, as shown in the line of code below.

1
2
3
 
sapply(dat[,c(3,4,7,9)], IQR)
 
{r}

Output:

1
2
3
 Income       	Loan_amount     	Age       Investment
 381125                69250      	       25            89315
 

Skewness

Skewness is a measure of symmetry, or the lack of it, for a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly symmetrical distribution, the mean, median, and the mode will all have the same value. However, the variables in our data are not symmetrical, resulting in different values of the central tendency.

The line of code below prints the skewness value for all the numerical variables.

1
2
3
4
 
skew_val <- apply(dat[,c(3,4,7,9)], 2, skewness)
print(skew_val)
 
{r}

Output:

1
2
3
 Income            Loan_amount     	Age         	         Investment
 5.31789378     4.98136968        -0.05525976        8.99320361
 

The skewness values can be interpreted in the following manner:

  1. Highly skewed distribution: If the skewness value is less than −1 or greater than +1.

  2. Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.

  3. Approximately symmetric distribution: If the skewness value is between −½ and +½.

Putting Everything Together

In the previous sections, we learned how to calculate the measures of central tendency and dispersion, individually. However, many of these measures can be calculated simultaneously, using the summary() function, which will print the summary statistics of all the variables. The line of code below performs this operation on the data.

1
2
 
summary(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 
Marital_status     Is_graduate 	      Income     	        Loan_amount 	
 No :209                No :130            Min.   :  30000         Min.   :  10900 
 Yes:391                Yes:470 	     1st Qu.: 384975       1st Qu.:  61000 
                        	                          Median : 508350      Median :  76000 
                        	                          Mean   : 705541       Mean   : 323794 
                        	                          3rd Qu.: 766100       3rd Qu.: 130250 
                       	                            Max.   :8444900       Max.   :7780000 
 
 Credit_score                   approval_status  	  Age                 Sex   	         Investment 	
 Not _satisfactory:128      No :190     	             Min.   :22.00     F:111          Min.   :   6000 
 Satisfactory 	:472         Yes:410     	             1st Qu.:36.00    M:489        1st Qu.:  79400 
                                     	                         Median :51.00       	          Median : 106740 
                                     	              	 Mean   :49.45       	          Mean   : 161067 
                                     	              	 3rd Qu.:61.00                       3rd Qu.: 168715 
                                     	              	 Max.   :76.00   	          Max.   :3466580 
 
 

The above output prints the important summary statistics of all the variables like the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and the third quartile values.

Summary Statistics using Multiple Variables

Sometimes we may want to understand a statistic using a combination of two or more categories. For example, understanding the ‘mean’ of the numerical variables using two or more categorical variables. .

The first line of code below uses the aggregate function to create a table of mean variables for all the numerical variables, across the two categorical variables, 'Sex' and 'approval_status'. The second line of code prints the output.

1
2
3
4
 
agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean)
agg
 
{r}

Output:

1
2
3
4
5
6
   Group.1    	Group.2       	Income         	Loan_amount  	   Age         	Investment
1   	F         	No               	544824.3	        	228027.0     	     44.16216     132583.8
2   	M        	No               	734543.1	        	353334.0     	     50.32026     158825.1
3   	F         	Yes              	646274.3	        	256114.9     	     51.55405     157135.4
4   	M        	Yes              	723086.0	        	335793.5     	     49.17262     166090.2
 

The interesting inference from the output above is that the female applicants whose loan application was approved had a significantly higher income, age, and investment values, compared to the female applicants whose application was not approved. This inference can be useful for feature engineering.

Conclusion

In this guide, you have learned about the fundamentals of the most widely used descriptive statistics and their calculations with R, an extremely powerful statistical programming language. You have learned about the following topics in this guide: Mean Median Mode Standard Deviation Variance Interquartile Range Skewness

0