Introduction

3

Summarizing data is undoubtedly one of the most common data science and analytics tasks. For predictive modeling, you also need to understand the concept of probability, which forms the basis of many machine learning algorithms like logistic regression. In this guide, you will learn the techniques of summarizing data and deducing probabilities in R.

In this guide, you'll use a fictitious dataset of loan applications containing 600 observations and nine variables, as described below:

`Marital_status`

: Whether the applicant is married ("Yes") or not ("No").`Is_graduate`

: Whether the applicant is a graduate ("Yes") or not ("No").`Income`

: Annual income of the applicant (in USD).`Loan_amount`

: Loan amount (in USD) for which the application was submitted.`Credit_score`

: Whether the applicant's credit score is good ("Satisfactory") or not ("Not_satisfactory").`Age`

: The applicant's age in years.`Sex`

: Whether the applicant is female (F) or male (M).`approval_status`

: Whether the loan application was approved ("Yes") or not ("No").`Investment`

: Investments in stocks and mutual funds (in USD) declared by the applicant.

The lines of code below load the required libraries and the data.

`1 2 3 4 5 6 7 8 9 10`

`library(tidyverse) library(readr) library(dplyr) library(e1071) library("ggplot2") library("reshape2") library("knitr") dat <- read_csv("data.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11`

`Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", ... $ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223... $ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis... $ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... $ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74... $ Sex <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M",... $ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...`

The above output shows that five variables are categorical (labeled as `chr`

) while the remaining four are numerical (labeled as `int`

). You need to convert the character variables to factor variables with the code below.

`1 2 3 4 5 6`

`dat$Marital_status = as.factor(dat$Marital_status) dat$Is_graduate = as.factor(dat$Is_graduate) dat$Credit_score = as.factor(dat$Credit_score) dat$approval_status = as.factor(dat$approval_status) dat$Sex = as.factor(dat$Sex) glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11`

`Observations: 600 Variables: 9 $ Marital_status <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Ye... $ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, N... $ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223... $ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory... $ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74... $ Sex <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M, M, M, ... $ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...`

The changes have been made and you are ready to summarize and analyze the data.

As a data scientist, you'll often be required to summarize individual variables in data. One of the most powerful ways to do this is through descriptive statistics, which includes measures of central tendency and measures of dispersion. Measures of central tendency include mean, median, and mode, while the measures of variability include standard deviation, variance, and the interquartile range (IQR). Some of these measures are briefly explained below.

Mean: the arithmetic average of the data

Median: the middle most value of a variable, which divides the data into two equal halves

Mode: the most frequent value of a variable and the only central tendency measure that can be used with both numeric and categorical variables

Standard deviation: quantifies the amount of variation from the mean of a set of data values

The lines of code below calculate the mean, median, and standard deviation of the `Income`

and `Loan_amount`

variables, respectively.

`1 2 3 4 5 6 7 8 9`

`# Income print(mean(dat$Income)) print(median(dat$Income)) print(sd(dat$Income)) # Loan_amount print(mean(dat$Income)) print(median(dat$Loan_amount)) print(sd(dat$Loan_amount))`

{r}

Output:

`1 2 3 4 5 6 7`

`[1] 70554.13 [1] 50835 [1] 71142.18 [1] 70554.13 [1] 7600 [1] 72429.35`

The above code calculates the mean, median, and standard deviation. To find the mode, create the frequency table of the categorical variable, as shown in the code below.

`1`

`table(dat$Credit_score)`

{r}

Output:

`1 2`

`Not _satisfactory Satisfactory 128 472`

The output shows that the mode of the variable `Credit_score`

is 472. This represents the count of the most frequent label, `Satisfactory`

.

In the previous section, you used descriptive statistics to summarize univariate variables. However, often you will want to summarize multiple variables together. For example, you might want to compute the mean of all the numerical variables in one line of code. This can be done with the `sapply()`

function as shown below.

`1`

`sapply(dat[,c(3,4,7,9)], mean)`

{r}

Output:

`1 2 3`

`Income Loan_amount Age Investment 70554.13 32379.37 49.45 16106.70`

The other method is to use the `summary()`

function, which will print the summary statistic of all the variables. The line of code below performs this operation.

`1`

`summary(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`Marital_status Is_graduate Income Loan_amount No :209 No :130 Min. : 3000 Min. : 1090 Yes:391 Yes:470 1st Qu.: 38498 1st Qu.: 6100 Median : 50835 Median : 7600 Mean : 70554 Mean : 32379 3rd Qu.: 76610 3rd Qu.: 13025 Max. :844490 Max. :778000 Credit_score approval_status Age Sex Investment Not _satisfactory:128 No :190 Min. :22.00 F:111 Min. : 600 Satisfactory :472 Yes:410 1st Qu.:36.00 M:489 1st Qu.: 7940 Median :51.00 Median : 10674 Mean :49.45 Mean : 16107 3rd Qu.:61.00 3rd Qu.: 16872 Max. :76.00 Max. :346658`

The above output prints the important summary statistics of all the variables, including the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and third quartile values.

Sometimes you'll want to understand a statistic using a combination of two or more categories. For example, you might want the mean of the numerical variables representing the gender of applicants and approval status. This can be done using the code below. The first line of code uses the `aggregate()`

function to create a table of the means of all the numerical variables across the two categorical variables, `Sex`

and
`approval_status`

. The second line of code prints the output.

`1 2`

`agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean) agg`

{r}

Output:

`1 2 3 4 5`

`Group.1 Group.2 Income Loan_amount Age Investment 1 F No 544824 228027 44.16 132583.8 2 M No 734543 353334 50.32 158825.1 3 F Yes 646274 256114 51.55 157135.4 4 M Yes 723086 335793 49.17 166090.2`

The interesting inference from the above table is that the female applicants whose loan application was approved had significantly higher incomes, ages, and investment values compared to the female applicants whose applications were not approved. This inference can be useful in building machine learning models.

In simple terms, probability can be defined as the extent to which an event is likely to occur and is measured by the ratio of the favorable cases to the total number of cases possible. For example, the probability of randomly picking a red ball from a box containing three red and seven blue balls is 0.3. This is arrived by dividing the total number of favorable cases, which is three in this example, with the total number of possible cases, which is ten.

You can apply this simple logic to calculate the probability of loan approval in the data. The `table()`

function in the first line of code below gives the frequency distribution of approved (denoted by the label "Yes") and rejected (denoted by the label "No") applications. The second line of code uses the logic explained above to calculate the probability of a loan application getting approved.

`1 2`

`table(dat$approval_status) 410/(410+190)`

{r}

Output:

`1`

`[1] 0.6833333`

You can also perform the above step by using the code below.

`1`

`prop.table(table(dat$approval_status))`

{r}

Output:

`1 2`

`No Yes 0.3166667 0.6833333`

An important probability application in data science is to compute conditional probability. A conditional probability is the probability of an event A occurring when a secondary event B has already occurred. Mathematically, it is represented as P(A | B), and is read as "the probability of A given B."

In this dataset, you may want to estimate the probability that a randomly selected application was approved given that the applicant was at least 40 years old. This is an example of conditional probability and can be calculated using the code below.

`1 2`

`dat %>% summarize(prob = sum(Age >= 40 & approval_status == "Yes", na.rm = TRUE)/sum(Age >= 40, na.rm = TRUE))`

{r}

Output:

`1 2 3`

`prob <dbl> 1 0.684`

You can see that the probability comes out to be 0.68. This means that if you randomly select a record from the data, the probability is 68 percent that the applicant was at least 40 years old and the application was approved.

You can repeat this for two categorical variables as well. For example, you may want to estimate the probability that a randomly selected application was approved given that the applicant's credit score was not satisfactory. The lines of code below will compute this probability.

`1 2`

`dat %>% summarize(prob = sum(Credit_score == "Not _satisfactory" & approval_status == "Yes", na.rm = TRUE)/sum(Credit_score == "Not _satisfactory", na.rm = TRUE))`

{r}

Output:

`1 2 3`

`prob <dbl> 1 0.296875`

The output above shows that the conditional probability that a loan application will be approved even if the credit score is not satisfactory is 29.7 percent. This insight can be useful to inform a risk management policy.

In this guide, you learned about the fundamentals of summarizing data for univariate and multivariate analysis. You also learned how to compute probabilities and conditional probabilities that'll help in understanding the data and generating meaningful insights.

To learn more about data science with R, please refer to the following guides:

3