 Deepika Singh

# Summarizing Data and Deducing Probabilities

• Apr 13, 2020
• 581 Views
• Apr 13, 2020
• 581 Views
Data
Probability

## Introduction

Summarizing data is undoubtedly one of the most common data science and analytics tasks. For predictive modeling, you also need to understand the concept of probability, which forms the basis of many machine learning algorithms like logistic regression. In this guide, you will learn the techniques of summarizing data and deducing probabilities in R.

## Data

In this guide, you'll use a fictitious dataset of loan applications containing 600 observations and nine variables, as described below:

1. `Marital_status`: Whether the applicant is married ("Yes") or not ("No").

2. `Is_graduate`: Whether the applicant is a graduate ("Yes") or not ("No").

3. `Income`: Annual income of the applicant (in USD).

4. `Loan_amount`: Loan amount (in USD) for which the application was submitted.

5. `Credit_score`: Whether the applicant's credit score is good ("Satisfactory") or not ("Not_satisfactory").

6. `Age`: The applicant's age in years.

7. `Sex`: Whether the applicant is female (F) or male (M).

8. `approval_status`: Whether the loan application was approved ("Yes") or not ("No").

9. `Investment`: Investments in stocks and mutual funds (in USD) declared by the applicant.

The lines of code below load the required libraries and the data.

``````1
2
3
4
5
6
7
8
9
10
``````library(tidyverse)
library(dplyr)
library(e1071)
library("ggplot2")
library("reshape2")
library("knitr")

glimpse(dat) ``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
``````Observations: 600
Variables: 9
\$ Marital_status  <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes",...
\$ Is_graduate     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", ...
\$ Income          <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
\$ Loan_amount     <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
\$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
\$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
\$ Age             <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
\$ Sex             <chr> "M", "M", "M", "M", "M", "M", "F", "M", "F", "M", "M",...
\$ Investment      <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...``````

The above output shows that five variables are categorical (labeled as `chr`) while the remaining four are numerical (labeled as `int`). You need to convert the character variables to factor variables with the code below.

``````1
2
3
4
5
6
``````dat\$Marital_status = as.factor(dat\$Marital_status)
dat\$Credit_score = as.factor(dat\$Credit_score)
dat\$approval_status = as.factor(dat\$approval_status)
dat\$Sex = as.factor(dat\$Sex)
glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
``````Observations: 600
Variables: 9
\$ Marital_status  <fct> Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Ye...
\$ Is_graduate     <fct> Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, N...
\$ Income          <int> 30680, 70210, 55880, 53450, 46800, 41270, 25710, 15223...
\$ Loan_amount     <int> 4350, 10400, 6650, 6450, 13500, 6300, 5550, 250000, 76...
\$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
\$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
\$ Age             <int> 76, 75, 75, 75, 75, 75, 75, 75, 75, 74, 74, 74, 74, 74...
\$ Sex             <fct> M, M, M, M, M, M, F, M, F, M, M, M, F, F, F, M, M, M, ...
\$ Investment      <dbl> 19942.0, 45636.5, 36322.0, 34742.5, 30420.0, 26825.5, ...``````

The changes have been made and you are ready to summarize and analyze the data.

## Summarizing Univariate Data

As a data scientist, you'll often be required to summarize individual variables in data. One of the most powerful ways to do this is through descriptive statistics, which includes measures of central tendency and measures of dispersion. Measures of central tendency include mean, median, and mode, while the measures of variability include standard deviation, variance, and the interquartile range (IQR). Some of these measures are briefly explained below.

1. Mean: the arithmetic average of the data

2. Median: the middle most value of a variable, which divides the data into two equal halves

3. Mode: the most frequent value of a variable and the only central tendency measure that can be used with both numeric and categorical variables

4. Standard deviation: quantifies the amount of variation from the mean of a set of data values

The lines of code below calculate the mean, median, and standard deviation of the `Income` and `Loan_amount` variables, respectively.

``````1
2
3
4
5
6
7
8
9
``````# Income
print(mean(dat\$Income))
print(median(dat\$Income))
print(sd(dat\$Income))

# Loan_amount
print(mean(dat\$Income))
print(median(dat\$Loan_amount))
print(sd(dat\$Loan_amount)) ``````
{r}

Output:

``````1
2
3
4
5
6
7
`````` 70554.13
 50835
 71142.18

 70554.13
 7600
 72429.35``````

The above code calculates the mean, median, and standard deviation. To find the mode, create the frequency table of the categorical variable, as shown in the code below.

``````1
````table(dat\$Credit_score)````
{r}

Output:

``````1
2
``````Not _satisfactory      Satisfactory
128               472 ``````

The output shows that the mode of the variable `Credit_score` is 472. This represents the count of the most frequent label, `Satisfactory`.

## Summarizing Multiple Variables

In the previous section, you used descriptive statistics to summarize univariate variables. However, often you will want to summarize multiple variables together. For example, you might want to compute the mean of all the numerical variables in one line of code. This can be done with the `sapply()` function as shown below.

``````1
````sapply(dat[,c(3,4,7,9)], mean) ````
{r}

Output:

``````1
2
3
``````     Income   Loan_amount      Age   Investment
70554.13    32379.37       49.45    16106.70
``````

The other method is to use the `summary()` function, which will print the summary statistic of all the variables. The line of code below performs this operation.

``````1
````summary(dat)````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
``````Marital_status Is_graduate     Income        Loan_amount
No :209        No :130     Min.   :  3000   Min.   :  1090
Yes:391        Yes:470     1st Qu.: 38498   1st Qu.:  6100
Median : 50835   Median :  7600
Mean   : 70554   Mean   : 32379
3rd Qu.: 76610   3rd Qu.: 13025
Max.   :844490   Max.   :778000
Credit_score approval_status      Age        Sex       Investment
Not _satisfactory:128   No :190         Min.   :22.00   F:111   Min.   :   600
Satisfactory     :472   Yes:410         1st Qu.:36.00   M:489   1st Qu.:  7940
Median :51.00           Median : 10674
Mean   :49.45           Mean   : 16107
3rd Qu.:61.00           3rd Qu.: 16872
Max.   :76.00           Max.   :346658  ``````

The above output prints the important summary statistics of all the variables, including the mean, median (50%), minimum, and maximum values. We can calculate the IQR using the first and third quartile values.

Sometimes you'll want to understand a statistic using a combination of two or more categories. For example, you might want the mean of the numerical variables representing the gender of applicants and approval status. This can be done using the code below. The first line of code uses the `aggregate()` function to create a table of the means of all the numerical variables across the two categorical variables, `Sex` and `approval_status`. The second line of code prints the output.

``````1
2
``````agg = aggregate(dat[,c(3,4,7,9)], by = list(dat\$Sex, dat\$approval_status), FUN = mean)
agg``````
{r}

Output:

``````1
2
3
4
5
``````       Group.1 Group.2   Income     Loan_amount      Age      Investment
1       F      No     544824        228027       44.16    132583.8
2       M      No     734543        353334      50.32     158825.1
3       F     Yes    646274         256114       51.55    157135.4
4       M     Yes     723086        335793       49.17     166090.2``````

The interesting inference from the above table is that the female applicants whose loan application was approved had significantly higher incomes, ages, and investment values compared to the female applicants whose applications were not approved. This inference can be useful in building machine learning models.

## Probability

In simple terms, probability can be defined as the extent to which an event is likely to occur and is measured by the ratio of the favorable cases to the total number of cases possible. For example, the probability of randomly picking a red ball from a box containing three red and seven blue balls is 0.3. This is arrived by dividing the total number of favorable cases, which is three in this example, with the total number of possible cases, which is ten.

You can apply this simple logic to calculate the probability of loan approval in the data. The `table()` function in the first line of code below gives the frequency distribution of approved (denoted by the label "Yes") and rejected (denoted by the label "No") applications. The second line of code uses the logic explained above to calculate the probability of a loan application getting approved.

``````1
2
``````table(dat\$approval_status)
410/(410+190)``````
{r}

Output:

``````1
```` 0.6833333````

You can also perform the above step by using the code below.

``````1
````prop.table(table(dat\$approval_status))````
{r}

Output:

``````1
2
``````       No       Yes
0.3166667 0.6833333``````

## Conditional Probability

An important probability application in data science is to compute conditional probability. A conditional probability is the probability of an event A occurring when a secondary event B has already occurred. Mathematically, it is represented as P(A | B), and is read as "the probability of A given B."

In this dataset, you may want to estimate the probability that a randomly selected application was approved given that the applicant was at least 40 years old. This is an example of conditional probability and can be calculated using the code below.

``````1
2
``````dat %>%
summarize(prob = sum(Age >= 40 & approval_status == "Yes", na.rm = TRUE)/sum(Age >= 40, na.rm = TRUE))``````
{r}

Output:

``````1
2
3
``````    prob
<dbl>
1  0.684``````

You can see that the probability comes out to be 0.68. This means that if you randomly select a record from the data, the probability is 68 percent that the applicant was at least 40 years old and the application was approved.

You can repeat this for two categorical variables as well. For example, you may want to estimate the probability that a randomly selected application was approved given that the applicant's credit score was not satisfactory. The lines of code below will compute this probability.

``````1
2
``````dat %>%
summarize(prob = sum(Credit_score == "Not _satisfactory" & approval_status == "Yes", na.rm = TRUE)/sum(Credit_score == "Not _satisfactory", na.rm = TRUE))``````
{r}

Output:

``````1
2
3
``````    prob
<dbl>
1  0.296875``````

The output above shows that the conditional probability that a loan application will be approved even if the credit score is not satisfactory is 29.7 percent. This insight can be useful to inform a risk management policy.

## Conclusion

In this guide, you learned about the fundamentals of summarizing data for univariate and multivariate analysis. You also learned how to compute probabilities and conditional probabilities that'll help in understanding the data and generating meaningful insights.