Author avatar

Deepika Singh

Exploring Data with Quantitative Techniques Using R

Deepika Singh

  • Nov 7, 2019
  • 13 Min read
  • 7 Views
  • Nov 7, 2019
  • 13 Min read
  • 7 Views
Data
R

Introduction

Exploratory data analysis is one of the most time consuming but important parts of the data science cycle. In this guide, you will learn techniques available in R for performing exploratory data analysis using quantitative methods.

Let’s start with some data.

Data

In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual Income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Gender: Whether the applicant is a male ("M") or a female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  10. work_exp: Work experience in years

Let's start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
9
library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)

dat <- read_csv("data_test.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 200
Variables: 10
$ Marital_status  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate     <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score    <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender          <fct> Female, Female, Female, Female, Female, Female, Female...
$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp        <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....

The output shows that the dataset has five categorical variables (labeled as 'chr'), four integer variables (labeled as 'int'), and one numeric variable, 'work_exp'. We will convert the categorical variables into 'factor' type using the line of code below.

1
2
3
4
names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 200
Variables: 10
$ Marital_status  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate     <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score    <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender          <fct> Female, Female, Female, Female, Female, Female, Female...
$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp        <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....

Let’s now perform the exploratory analysis using the various quantitative techniques.

Count, Frequency, and Proportions

The basic analysis would be to compute simple calculations such as count, frequency, and proportions. The summary() function can be used to perform this task.

1
summary(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Marital_status Is_graduate     Income       Loan_amount      Credit_score
 Divorced:60     No : 84     Min.   :32000   Min.   :  12200   Bad :163    
 No      :76     Yes:116     1st Qu.:56000   1st Qu.:  61375   Good: 37    
 Yes     :64                 Median :64000   Median :  77250               
                             Mean   :62750   Mean   : 349278               
                             3rd Qu.:72000   3rd Qu.: 144750               
                             Max.   :88000   Max.   :6670000               
 approval_status   Investment         gender         age           work_exp     
 No :149         Min.   :   6000   Female:178   Min.   :32.00   Min.   : 4.000  
 Yes: 51         1st Qu.:  81945   Male  : 22   1st Qu.:33.00   1st Qu.: 7.000  
                 Median : 114800                Median :34.00   Median : 8.000  
                 Mean   : 169694                Mean   :33.98   Mean   : 7.844  
                 3rd Qu.: 173492                3rd Qu.:34.00   3rd Qu.: 9.000  
                 Max.   :3466580                Max.   :49.00   Max.   :11.000  

We can infer the following from the above output:

1: There are no 'NA', or missing values, in any of the variables. This confirms that the count of observations for all the variables is 200.

  1. The summary statistics of the numeric variables are displayed. For example, the age of the applicants ranges from 32 to 49 years.

  2. For qualitative variables, the frequency of each label in the respective variable is displayed. For example, the frequency table of the target variable 'approval_status' shows that out of 200 applicants, 149 had their loan applications rejected, while the remaining were accepted.

Proportions

When dealing with a frequency table, it is often necessary to compute proportions. In our case, we might want to calculate proportions of a categorical variable, such as 'approval_status'. The lines of code below create a frequency table and the proportions of the label for the variable 'approval_status'.

1
2
3
table1 = table(dat$approval_status)

prop.table(table1) 
{r}

Output:

1
2
   No   Yes 
0.745 0.255

The output shows that around 75 percent of the applicants were rejected, while the acceptance rate was around 25 percent. This can be extended for two or more variables as well. For example, if we want to analyze the percentage of approved applications across credit scores, that is also possible.

The first line of code below creates the two-way frequency table, while the second line prints the cell percentages, which means the total sum of all cells will be equal to one. As expected, the majority of applicants with poor credit scores had their applications rejected.

The prop.table function can also be used to generate percentages for rows or columns. The third and fourth lines of code below create a proportion table, which gives row and column percentages, respectively.

1
2
3
4
table2 = table(dat$approval_status, dat$Credit_score)
prop.table(table2) 
prop.table(table2, 1) 
prop.table(table2, 2)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
       Bad  Good
  No  0.700 0.045
  Yes 0.115 0.140


            Bad       Good
  No  0.93959732 0.06040268
  Yes 0.45098039 0.54901961

    
            Bad      Good
  No  0.8588957 0.2432432
  Yes 0.1411043 0.7567568

Descriptive Statistics

In the previous section, we carried out preliminary data analysis. However, any quantitative analysis is incomplete without descriptive statistics. They are the foundation of data science and are defined as the measures that summarize given data. They include measures of central tendency (such as mean, median, and mode) and measures of dispersion (such as standard deviation, variance, and range).

The summary() function provides some of these statistics but not all of them. There are many other functions in R, that can be used to identify these measures. The fivenum() function is one; it returns the min, lower-hinge, median, upper-hinge, and max values. However, this function has a limitation in that it is not meaningful for factors.

1
fivenum(dat$Income)
{r}

Output:

1
[1] 32000 56000 64000 72000 88000

A better alternative to both the summary and the fivenum functions is the psych package, which provides several quantitative measures, including the standard deviation, skewness, and range.

The first line of code below loads the library, while the second line uses the 'describe' function in the library to print quantitative measures of all the variables.

1
2
library(psych)
describe(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
                 vars   n      mean        sd median   trimmed      mad   min
Marital_status*     1 200      2.02      0.79      2      2.02     1.48     1
Is_graduate*        2 200      1.58      0.49      2      1.60     0.00     1
Income              3 200  62750.00  10638.77  64000  63062.50 11860.80 32000
Loan_amount         4 200 349278.00 741366.91  77250 170134.38 35211.75 12200
Credit_score*       5 200      1.19      0.39      1      1.11     0.00     1
approval_status*    6 200      1.25      0.44      1      1.19     0.00     1
Investment          7 200 169693.55 267134.29 114800 130429.69 65590.22  6000
gender*             8 200      1.11      0.31      1      1.01     0.00     1
age                 9 200     33.98      1.76     34     33.71     1.48    32
work_exp           10 200      7.84      1.33      8      7.88     1.48     4
                     max   range  skew kurtosis       se
Marital_status*        3       2 -0.03    -1.40     0.06
Is_graduate*           2       1 -0.32    -1.91     0.03
Income             88000   56000 -0.25     0.08   752.27
Loan_amount      6670000 6657800  4.53    28.55 52422.56
Credit_score*          2       1  1.61     0.60     0.03
approval_status*       2       1  1.12    -0.76     0.03
Investment       3466580 3460580  9.71   113.87 18889.25
gender*                2       1  2.47     4.14     0.02
age                   49      17  4.63    31.04     0.12
work_exp              11       7 -0.25     0.08     0.09

The high skewness value for the variables 'Loan_amount' and 'Investment’ indicate that these variables are not normally distributed.

Correlation

We can use the syntax cor(X, Y) or rcorr(X, Y) to generate correlations between the two numerical columns.

1
cor(dat$Income, dat$Investment)
{r}

Output:

1
[1] 0.06168653

The value of 0.06 indicates weak correlation between the two variables.

Summarizing Data

In quantitative analysis, we are often required to summarize the data based on a few variables. This can easily be done with the summarize() function in the dplyr package. The code below summarizes and prints the average values of the three numerical variables.

1
2
3
4
dat %>%
  summarize(avg_income = mean(Income,na.rm=TRUE), 
            avg_loan = mean(Loan_amount,na.rm=TRUE), 
            avg_age = mean(age,na.rm=TRUE))
{r}

Output:

1
2
3
4
A tibble: 1 x 3
  avg_income avg_loan avg_age
       <dbl>    <dbl>   <dbl>
1      62750   349278    34.0

The above output shows that the average age of the applicants is 34 years, and the average income is $62,750. The average loan amount applied for is $34,9278.

Grouping Variables

Another useful way of analyzing data is by using the group_by() function, which groups the input data based on a single column or multiple columns, then manipulates each such group of data. The line of code below groups the data frame by the variable ‘approval_status’, then calculates the average income for each label.

1
2
3
dat %>%
    group_by(approval_status) %>%
    summarise(mean_inc = mean(Income))
{r}

Output:

1
2
3
4
5
A tibble: 2 x 2
approval_status mean_inc
 <fct>              <dbl>
No                62174.
Yes               64431.

The output above shows that the mean income of applicants whose loans were approved is slightly higher than that of the rejected applicants.

Inferential Statistics

We have covered several techniques of quantitative analysis. There is also a field in statistics called Inferential Statistics that can be used for quantitative exploratory analysis. This is an extremely broad topic, and it is covered at length in another guide:

Conclusion

In this guide, you have learned techniques for performing exploratory data analysis using quantitative methods. You have learned how to use in-built 'R' functions and the analysis techniques of the powerful 'dplyr' package.

0