Summarizing data is undoubtedly one of the most common data science and analytics tasks. For predictive modeling, you also need to understand the concept of probability, which forms the basis of many machine learning algorithms like logistic regression. In this guide, you will learn the techniques of summarizing data and deducing probabilities in R.
In this guide, you'll use a fictitious dataset of loan applications containing 600 observations and nine variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No").
Is_graduate
: Whether the applicant is a graduate ("Yes") or not ("No").
Income
: Annual income of the applicant (in USD).
Loan_amount
: Loan amount (in USD) for which the application was submitted.
Credit_score
: Whether the applicant's credit score is good ("Satisfactory") or not ("Not_satisfactory").
Age
: The applicant's age in years.
Sex
: Whether the applicant is female (F) or male (M).
approval_status
: Whether the loan application was approved ("Yes") or not ("No").
Investment
: Investments in stocks and mutual funds (in USD) declared by the applicant. The lines of code below load the required libraries and the data.
1library(tidyverse)
2library(readr)
3library(dplyr)
4library(e1071)
5library("ggplot2")
6library("reshape2")
7library("knitr")
8
9dat <- read_csv("data.csv")
10glimpse(dat)
Output:
1Observations: 600
2Variables: 9
3$ Marital_status <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes",...
4$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", ...
5$ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
6$ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
7$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
9$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
10$ Sex <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M",...
11$ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
The above output shows that five variables are categorical (labeled as chr
) while the remaining four are numerical (labeled as int
). You need to convert the character variables to factor variables with the code below.
1dat$Marital_status = as.factor(dat$Marital_status)
2dat$Is_graduate = as.factor(dat$Is_graduate)
3dat$Credit_score = as.factor(dat$Credit_score)
4dat$approval_status = as.factor(dat$approval_status)
5dat$Sex = as.factor(dat$Sex)
6glimpse(dat)
Output:
1Observations: 600
2Variables: 9
3$ Marital_status <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Ye...
4$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, N...
5$ Income <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
6$ Loan_amount <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
7$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
8$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
9$ Age <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
10$ Sex <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M, M, M, ...
11$ Investment <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
The changes have been made and you are ready to summarize and analyze the data.
As a data scientist, you'll often be required to summarize individual variables in data. One of the most powerful ways to do this is through descriptive statistics, which includes measures of central tendency and measures of dispersion. Measures of central tendency include mean, median, and mode, while the measures of variability include standard deviation, variance, and the interquartile range (IQR). Some of these measures are briefly explained below.
Mean: the arithmetic average of the data
Median: the middle most value of a variable, which divides the data into two equal halves
Mode: the most frequent value of a variable and the only central tendency measure that can be used with both numeric and categorical variables
The lines of code below calculate the mean, median, and standard deviation of the Income
and Loan_amount
variables, respectively.
1# Income
2print(mean(dat$Income))
3print(median(dat$Income))
4print(sd(dat$Income))
5
6# Loan_amount
7print(mean(dat$Income))
8print(median(dat$Loan_amount))
9print(sd(dat$Loan_amount))
Output:
1[1] 70554.13
2[1] 50835
3[1] 71142.18
4
5[1] 70554.13
6[1] 7600
7[1] 72429.35
The above code calculates the mean, median, and standard deviation. To find the mode, create the frequency table of the categorical variable, as shown in the code below.
1table(dat$Credit_score)
Output:
1Not _satisfactory Satisfactory
2 128 472
The output shows that the mode of the variable Credit_score
is 472. This represents the count of the most frequent label, Satisfactory
.
In the previous section, you used descriptive statistics to summarize univariate variables. However, often you will want to summarize multiple variables together. For example, you might want to compute the mean of all the numerical variables in one line of code. This can be done with the sapply()
function as shown below.
1sapply(dat[,c(3,4,7,9)], mean)
Output:
1 Income Loan_amount Age Investment
2 70554.13 32379.37 49.45 16106.70
3
The other method is to use the summary()
function, which will print the summary statistic of all the variables. The line of code below performs this operation.
1summary(dat)
Output:
1Marital_status Is_graduate Income Loan_amount
2 No :209 No :130 Min. : 3000 Min. : 1090
3 Yes:391 Yes:470 1st Qu.: 38498 1st Qu.: 6100
4 Median : 50835 Median : 7600
5 Mean : 70554 Mean : 32379
6 3rd Qu.: 76610 3rd Qu.: 13025
7 Max. :844490 Max. :778000
8 Credit_score approval_status Age Sex Investment
9 Not _satisfactory:128 No :190 Min. :22.00 F:111 Min. : 600
10 Satisfactory :472 Yes:410 1st Qu.:36.00 M:489 1st Qu.: 7940
11 Median :51.00 Median : 10674
12 Mean :49.45 Mean : 16107
13 3rd Qu.:61.00 3rd Qu.: 16872
14 Max. :76.00 Max. :346658
The above output prints the important summary statistics of all the variables, including the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and third quartile values.
Sometimes you'll want to understand a statistic using a combination of two or more categories. For example, you might want the mean of the numerical variables representing the gender of applicants and approval status. This can be done using the code below. The first line of code uses the aggregate()
function to create a table of the means of all the numerical variables across the two categorical variables, Sex
and
approval_status
. The second line of code prints the output.
1agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean)
2agg
Output:
1 Group.1 Group.2 Income Loan_amount Age Investment
2 1 F No 544824 228027 44.16 132583.8
3 2 M No 734543 353334 50.32 158825.1
4 3 F Yes 646274 256114 51.55 157135.4
5 4 M Yes 723086 335793 49.17 166090.2
The interesting inference from the above table is that the female applicants whose loan application was approved had significantly higher incomes, ages, and investment values compared to the female applicants whose applications were not approved. This inference can be useful in building machine learning models.
In simple terms, probability can be defined as the extent to which an event is likely to occur and is measured by the ratio of the favorable cases to the total number of cases possible. For example, the probability of randomly picking a red ball from a box containing three red and seven blue balls is 0.3. This is arrived by dividing the total number of favorable cases, which is three in this example, with the total number of possible cases, which is ten.
You can apply this simple logic to calculate the probability of loan approval in the data. The table()
function in the first line of code below gives the frequency distribution of approved (denoted by the label "Yes") and rejected (denoted by the label "No") applications. The second line of code uses the logic explained above to calculate the probability of a loan application getting approved.
1table(dat$approval_status)
2410/(410+190)
Output:
1[1] 0.6833333
You can also perform the above step by using the code below.
1prop.table(table(dat$approval_status))
Output:
1 No Yes
20.3166667 0.6833333
An important probability application in data science is to compute conditional probability. A conditional probability is the probability of an event A occurring when a secondary event B has already occurred. Mathematically, it is represented as P(A | B), and is read as "the probability of A given B."
In this dataset, you may want to estimate the probability that a randomly selected application was approved given that the applicant was at least 40 years old. This is an example of conditional probability and can be calculated using the code below.
1dat %>%
2 summarize(prob = sum(Age >= 40 & approval_status == "Yes", na.rm = TRUE)/sum(Age >= 40, na.rm = TRUE))
Output:
1 prob
2 <dbl>
31 0.684
You can see that the probability comes out to be 0.68. This means that if you randomly select a record from the data, the probability is 68 percent that the applicant was at least 40 years old and the application was approved.
You can repeat this for two categorical variables as well. For example, you may want to estimate the probability that a randomly selected application was approved given that the applicant's credit score was not satisfactory. The lines of code below will compute this probability.
1dat %>%
2 summarize(prob = sum(Credit_score == "Not _satisfactory" & approval_status == "Yes", na.rm = TRUE)/sum(Credit_score == "Not _satisfactory", na.rm = TRUE))
Output:
1 prob
2 <dbl>
31 0.296875
The output above shows that the conditional probability that a loan application will be approved even if the credit score is not satisfactory is 29.7 percent. This insight can be useful to inform a risk management policy.
In this guide, you learned about the fundamentals of summarizing data for univariate and multivariate analysis. You also learned how to compute probabilities and conditional probabilities that'll help in understanding the data and generating meaningful insights.
To learn more about data science with R, please refer to the following guides: