Summarizing Data and Deducing Probabilities

Apr 13, 2020 • 13 Minute Read

Introduction

Summarizing data is undoubtedly one of the most common data science and analytics tasks. For predictive modeling, you also need to understand the concept of probability, which forms the basis of many machine learning algorithms like logistic regression. In this guide, you will learn the techniques of summarizing data and deducing probabilities in R.

Data

In this guide, you'll use a fictitious dataset of loan applications containing 600 observations and nine variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
Income: Annual income of the applicant (in USD).
Loan_amount: Loan amount (in USD) for which the application was submitted.
Credit_score: Whether the applicant's credit score is good ("Satisfactory") or not ("Not_satisfactory").
Age: The applicant's age in years.
Sex: Whether the applicant is female (F) or male (M).
approval_status: Whether the loan application was approved ("Yes") or not ("No").
Investment: Investments in stocks and mutual funds (in USD) declared by the applicant.

The lines of code below load the required libraries and the data.

      library(tidyverse)
library(readr)
library(dplyr)
library(e1071) 
library("ggplot2")
library("reshape2")
library("knitr")

dat <- read_csv("data.csv")
glimpse(dat)
    

Output:

      Observations: 600
Variables: 9
$ Marital_status  <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Is_graduate     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", ...
$ Income          <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
$ Loan_amount     <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Age             <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M",...
$ Investment      <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
    

The above output shows that five variables are categorical (labeled as chr) while the remaining four are numerical (labeled as int). You need to convert the character variables to factor variables with the code below.

      dat$Marital_status = as.factor(dat$Marital_status)
dat$Is_graduate = as.factor(dat$Is_graduate)
dat$Credit_score = as.factor(dat$Credit_score)
dat$approval_status = as.factor(dat$approval_status)
dat$Sex = as.factor(dat$Sex)
glimpse(dat)
    

Output:

      Observations: 600
Variables: 9
$ Marital_status  <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Ye...
$ Is_graduate     <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, N...
$ Income          <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
$ Loan_amount     <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Age             <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
$ Sex             <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M, M, M, ...
$ Investment      <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...
    

The changes have been made and you are ready to summarize and analyze the data.

Summarizing Univariate Data

As a data scientist, you'll often be required to summarize individual variables in data. One of the most powerful ways to do this is through descriptive statistics, which includes measures of central tendency and measures of dispersion. Measures of central tendency include mean, median, and mode, while the measures of variability include standard deviation, variance, and the interquartile range (IQR). Some of these measures are briefly explained below.

Mean: the arithmetic average of the data
Median: the middle most value of a variable, which divides the data into two equal halves
Mode: the most frequent value of a variable and the only central tendency measure that can be used with both numeric and categorical variables
Standard deviation: quantifies the amount of variation from the mean of a set of data values

The lines of code below calculate the mean, median, and standard deviation of the Income and Loan_amount variables, respectively.

      # Income
print(mean(dat$Income))
print(median(dat$Income))
print(sd(dat$Income))

# Loan_amount
print(mean(dat$Income))
print(median(dat$Loan_amount))
print(sd(dat$Loan_amount))
    

Output:

      1] 70554.13
[1] 50835
[1] 71142.18

[1] 70554.13
[1] 7600
[1] 72429.35
    

The above code calculates the mean, median, and standard deviation. To find the mode, create the frequency table of the categorical variable, as shown in the code below.

      table(dat$Credit_score)

Output:

      Not _satisfactory      Satisfactory 
              128               472
    

The output shows that the mode of the variable Credit_score is 472. This represents the count of the most frequent label, Satisfactory.

Summarizing Multiple Variables

In the previous section, you used descriptive statistics to summarize univariate variables. However, often you will want to summarize multiple variables together. For example, you might want to compute the mean of all the numerical variables in one line of code. This can be done with the sapply() function as shown below.

      sapply(dat[,c(3,4,7,9)], mean)

Output:

      Income   Loan_amount      Age   Investment 
   70554.13    32379.37       49.45    16106.70
    

The other method is to use the summary() function, which will print the summary statistic of all the variables. The line of code below performs this operation.

      summary(dat)

Output:

      Marital_status Is_graduate     Income        Loan_amount    
 No :209        No :130     Min.   :  3000   Min.   :  1090  
 Yes:391        Yes:470     1st Qu.: 38498   1st Qu.:  6100  
                            Median : 50835   Median :  7600  
                            Mean   : 70554   Mean   : 32379  
                            3rd Qu.: 76610   3rd Qu.: 13025  
                            Max.   :844490   Max.   :778000  
            Credit_score approval_status      Age        Sex       Investment    
 Not _satisfactory:128   No :190         Min.   :22.00   F:111   Min.   :   600  
 Satisfactory     :472   Yes:410         1st Qu.:36.00   M:489   1st Qu.:  7940  
                                         Median :51.00           Median : 10674  
                                         Mean   :49.45           Mean   : 16107  
                                         3rd Qu.:61.00           3rd Qu.: 16872  
                                         Max.   :76.00           Max.   :346658
    

The above output prints the important summary statistics of all the variables, including the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and third quartile values.

Sometimes you'll want to understand a statistic using a combination of two or more categories. For example, you might want the mean of the numerical variables representing the gender of applicants and approval status. This can be done using the code below. The first line of code uses the aggregate() function to create a table of the means of all the numerical variables across the two categorical variables, Sex and approval_status. The second line of code prints the output.

      agg = aggregate(dat[,c(3,4,7,9)], by = list(dat$Sex, dat$approval_status), FUN = mean)
agg
    

Output:

      Group.1 Group.2   Income     Loan_amount      Age      Investment
     F      No     544824        228027       44.16    132583.8
     M      No     734543        353334      50.32     158825.1
     F     Yes    646274         256114       51.55    157135.4
     M     Yes     723086        335793       49.17     166090.2
    

The interesting inference from the above table is that the female applicants whose loan application was approved had significantly higher incomes, ages, and investment values compared to the female applicants whose applications were not approved. This inference can be useful in building machine learning models.

Probability

In simple terms, probability can be defined as the extent to which an event is likely to occur and is measured by the ratio of the favorable cases to the total number of cases possible. For example, the probability of randomly picking a red ball from a box containing three red and seven blue balls is 0.3. This is arrived by dividing the total number of favorable cases, which is three in this example, with the total number of possible cases, which is ten.

You can apply this simple logic to calculate the probability of loan approval in the data. The table() function in the first line of code below gives the frequency distribution of approved (denoted by the label "Yes") and rejected (denoted by the label "No") applications. The second line of code uses the logic explained above to calculate the probability of a loan application getting approved.

      table(dat$approval_status)
410/(410+190)
    

Output:

      1] 0.6833333

You can also perform the above step by using the code below.

      prop.table(table(dat$approval_status))

Output:

      No       Yes 
0.3166667 0.6833333
    

Conditional Probability

An important probability application in data science is to compute conditional probability. A conditional probability is the probability of an event A occurring when a secondary event B has already occurred. Mathematically, it is represented as P(A | B), and is read as "the probability of A given B."

In this dataset, you may want to estimate the probability that a randomly selected application was approved given that the applicant was at least 40 years old. This is an example of conditional probability and can be calculated using the code below.

      dat %>%
  summarize(prob = sum(Age >= 40 & approval_status == "Yes", na.rm = TRUE)/sum(Age >= 40, na.rm = TRUE))
    

Output:

      prob
   <dbl>
1  0.684
    

You can see that the probability comes out to be 0.68. This means that if you randomly select a record from the data, the probability is 68 percent that the applicant was at least 40 years old and the application was approved.

You can repeat this for two categorical variables as well. For example, you may want to estimate the probability that a randomly selected application was approved given that the applicant's credit score was not satisfactory. The lines of code below will compute this probability.

      dat %>%
  summarize(prob = sum(Credit_score == "Not _satisfactory" & approval_status == "Yes", na.rm = TRUE)/sum(Credit_score == "Not _satisfactory", na.rm = TRUE))
    

Output:

      prob
   <dbl>
1  0.296875
    

The output above shows that the conditional probability that a loan application will be approved even if the credit score is not satisfactory is 29.7 percent. This insight can be useful to inform a risk management policy.

Conclusion

In this guide, you learned about the fundamentals of summarizing data for univariate and multivariate analysis. You also learned how to compute probabilities and conditional probabilities that'll help in understanding the data and generating meaningful insights.

To learn more about data science with R, please refer to the following guides: