Introduction

17

Descriptive Statistics is the foundation block of summarizing data. It is divided into the measures of central tendency and the measures of dispersion. Measures of central tendency include mean, median, and the mode, while the measures of variability include standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these measures of descriptive statistics and use them to interpret the data.

We will begin by loading the data to be used in this guide.

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and 9 variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").

Age: The applicant’s age in years.

Sex: Whether the applicant is female (F) or male (M).

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

Let us start by loading the required libraries and the data.

`1 2 3 4 5 6`

`library(readr) library(dplyr) library(e1071) dat <- read_csv("data_de.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", ... $ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "... $ Income <int> 306800, 702100, 558800, 534500, 468000, 412700, 257100,... $ Loan_amount <int> 43500, 104000, 66500, 64500, 135000, 63000, 55500, 2500... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satisf... $ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74,... $ Sex <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M", ... $ Investment <int> 199420, 456365, 363220, 347425, 304200, 268255, 167115,...`

Five of the variables are categorical (labelled as 'chr') while the remaining four are numerical (labelled as 'int').

Measures of central tendency describe the center of the data and are often represented by the mean, median, and mode.

Mean represents the arithmetic average of the data. It is calculated by taking the sum of the values and dividing by the number of observations. The ** mean()** function is used to calculate this in R. If the variable contains missing values, the argument

The line of code below uses the ** 'sapply** function to calculate the mean of the numerical variables in the data. The argument

From the output, we can infer that the average age of the applicant is 49.5 years, the average annual income is USD 705,541, and the average investment is USD 161,066. The output also shows that the average loan applied for is USD 323,793.

`1`

`sapply(dat[,c(3,4,7,9)], mean)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 705541.33 323793.67 49.45 161066.97`

It is also possible to calculate the mean of a variable in the data, as shown below.

`1 2 3 4`

`print(mean(dat$Income)) print(mean(dat$Loan_amount))`

{r}

Output:

`1 2 3`

`[1] 705541.3 [1] 323793.7`

The middle most value of a variable in a data is its median value. The line of code below uses the ** median()** function to print the median of the numerical variables in the data.

`1 2`

`sapply(dat[,c(3,4,7,9)], median)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 508350 76000 51 106740`

From the output, we can infer that the median age of the applicants is 51 years, the median annual income is USD 508,350, and the median loan applied for is USD 76,000.

It is also possible to calculate the median of a variable in the data, as shown in the first two lines of code below.

`1 2 3 4`

`print(median(dat$Income)) print(median(dat$Loan_amount))`

{r}

Output:

`1 2 3`

`[1] 508350 [1] 76000`

Mode represents the most frequent value of a variable in the data and is the only central tendency measure that can be used with both numeric and categorical variables.

For finding mode in R, we need to convert the five 'chr' variables into the 'factor' variable. These five variables are 'Marital_status', 'Is_graduate', 'Credit_score', 'approval_status', and 'Sex'.

The *first line of code* below creates a list of columns that contain the above variables in the dataset. The *second line* uses the ** lapply** function to convert these variables, stored in 'names', into the factor variables. The

`1 2 3 4 5`

`names <- c(1,2,5,6,8) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 600 Variables: 9 $ Marital_status <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No,... $ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes... $ Income <int> 306800, 702100, 558800, 534500, 468000, 412700... $ Loan_amount <int> 43500, 104000, 66500, 64500, 135000, 63000, 55... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Sati... $ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y... $ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74... $ Sex <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M... $ Investment <int> 199420, 456365, 363220, 347425, 304200, 268255...`

The output shows that all the five variables have been converted into the ‘factor’ variables. Now, we can print the label-wise frequency of each variable with the line of code below.

`1`

`summary(dat[,c(1,2,5,6,8)])`

{r}

Output:

`1 2 3 4`

`Marital_status Is_graduate Credit_score approval_status Sex No :209 No :130 Not _satisfactory:128 No :190 F:111 Yes:391 Yes:470 Satisfactory :472 Yes:410 M:489`

The mode for the variable 'Marital_status' is the label 'Yes' which means majority of the applicants were married. Similarly, the mode for the variable 'Sex' is the label 'M', indicating that majority of the applicants were male.

It is also possible to calculate the mode of a variable in the data, as shown in the line of code below.

`1 2`

`table(dat$Credit_score)`

{r}

Output:

`1 2 3`

`Not _satisfactory Satisfactory 128 472`

The extent to which a distribution is stretched or squeezed is measured by dispersion, also referred to as variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the interquartile range.

Standard deviation is a measure used to quantify the amount of variation of a set of data values from its mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa. It is also used to examine if the data has a normal (or nearly normal) distribution. The line of code below prints the standard deviation of all the numerical variables in the data.

`1 2`

`sapply(dat[,c(3,4,7,9)], sd)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 711421.81415 724293.48078 14.72851 203058.62713`

While interpreting the standard deviation values, it is important to understand them in conjunction with the mean. For example, the units of the variables 'Income' and 'Age' are different, therefore, comparing the dispersion of these two variables based on standard deviation alone will be incorrect. This needs to be kept in mind.

It is also possible to calculate the standard deviation of a variable, as shown in the lines of code below.

`1 2 3 4`

`print(sd(dat$Income)) print(sd(dat$Loan_amount))`

{r}

Output:

`1 2 3`

`[1] 711421.8 [1] 724293.5`

Variance is the square of the standard deviation and the covariance of the random variable with itself. The line of code below prints the variance of all the numerical variables in the dataset. The interpretation of the variance is like that of the standard deviation.

`1 2 3`

`sapply(dat[,c(3,4,7,9)], var)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 5.061210e+11 5.246010e+11 2.169290e+02 4.123281e+10`

The Interquartile Range (IQR) is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR can be calculated using the ** IQR()** function, as shown in the line of code below.

`1 2 3`

`sapply(dat[,c(3,4,7,9)], IQR)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 381125 69250 25 89315`

Skewness is a measure of symmetry, or the lack of it, for a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly symmetrical distribution, the mean, median, and the mode will all have the same value. However, the variables in our data are not symmetrical, resulting in different values of the central tendency.

The line of code below prints the skewness value for all the numerical variables.

`1 2 3 4`

`skew_val <- apply(dat[,c(3,4,7,9)], 2, skewness) print(skew_val)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 5.31789378 4.98136968 -0.05525976 8.99320361`

The skewness values can be interpreted in the following manner:

Highly skewed distribution: If the skewness value is less than −1 or greater than +1.

Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.

Approximately symmetric distribution: If the skewness value is between −½ and +½.

In the previous sections, we learned how to calculate the measures of central tendency and dispersion, individually. However, many of these measures can be calculated simultaneously, using the ** summary()** function, which will print the summary statistics of all the variables. The line of code below performs this operation on the data.

`1 2`

`summary(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18`

`Marital_status Is_graduate Income Loan_amount No :209 No :130 Min. : 30000 Min. : 10900 Yes:391 Yes:470 1st Qu.: 384975 1st Qu.: 61000 Median : 508350 Median : 76000 Mean : 705541 Mean : 323794 3rd Qu.: 766100 3rd Qu.: 130250 Max. :8444900 Max. :7780000 Credit_score approval_status Age Sex Investment Not _satisfactory:128 No :190 Min. :22.00 F:111 Min. : 6000 Satisfactory :472 Yes:410 1st Qu.:36.00 M:489 1st Qu.: 79400 Median :51.00 Median : 106740 Mean :49.45 Mean : 161067 3rd Qu.:61.00 3rd Qu.: 168715 Max. :76.00 Max. :3466580`

The above output prints the important summary statistics of all the variables like the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and the third quartile values.

Sometimes we may want to understand a statistic using a combination of two or more categories. For example, understanding the ‘mean’ of the numerical variables using two or more categorical variables. .

The *first line of code* below uses the ** aggregate** function to create a table of mean variables for all the numerical variables, across the two categorical variables, 'Sex' and
'approval_status'. The second line of code prints the output.

`1 2 3 4`

`agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean) agg`

{r}

Output:

`1 2 3 4 5 6`

`Group.1 Group.2 Income Loan_amount Age Investment 1 F No 544824.3 228027.0 44.16216 132583.8 2 M No 734543.1 353334.0 50.32026 158825.1 3 F Yes 646274.3 256114.9 51.55405 157135.4 4 M Yes 723086.0 335793.5 49.17262 166090.2`

The interesting inference from the output above is that the female applicants whose loan application was approved had a significantly higher income, age, and investment values, compared to the female applicants whose application was not approved. This inference can be useful for feature engineering.

In this guide, you have learned about the fundamentals of the most widely used descriptive statistics and their calculations with R, an extremely powerful statistical programming language. You have learned about the following topics in this guide: Mean Median Mode Standard Deviation Variance Interquartile Range Skewness

17