Exploratory data analysis is one of the most time consuming but important parts of the data science cycle. In this guide, you will learn techniques available in R for performing exploratory data analysis using quantitative methods.
Let’s start with some data.
In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and 10 variables, as described below:
Marital_status: Whether the applicant is married ("Yes") or not ("No")
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
Income: Annual Income of the applicant (in USD)
Loan_amount: Loan amount (in USD) for which the application was submitted
Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")
Approval_status: Whether the loan application was approved ("Yes") or not ("No")
Age: The applicant's age in years
Gender: Whether the applicant is a male ("M") or a female ("F")
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Let's start by loading the required libraries and the data.
1library(plyr)
2library(readr)
3library(ggplot2)
4library(GGally)
5library(dplyr)
6library(mlbench)
7
8dat <- read_csv("data_test.csv")
9glimpse(dat)
Output:
1Observations: 200
2Variables: 10
3$ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
4$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
5$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
6$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
7$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
8$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
9$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
10$ gender <fct> Female, Female, Female, Female, Female, Female, Female...
11$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
12$ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
The output shows that the dataset has five categorical variables (labeled as 'chr'), four integer variables (labeled as 'int'), and one numeric variable, 'work_exp'. We will convert the categorical variables into 'factor' type using the line of code below.
1names <- c(1,2,5,6,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat)
4
Output:
1Observations: 200
2Variables: 10
3$ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
4$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
5$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
6$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
7$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
8$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
9$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
10$ gender <fct> Female, Female, Female, Female, Female, Female, Female...
11$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
12$ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
Let’s now perform the exploratory analysis using the various quantitative techniques.
The basic analysis would be to compute simple calculations such as count, frequency, and proportions. The summary() function can be used to perform this task.
1summary(dat)
Output:
1Marital_status Is_graduate Income Loan_amount Credit_score
2 Divorced:60 No : 84 Min. :32000 Min. : 12200 Bad :163
3 No :76 Yes:116 1st Qu.:56000 1st Qu.: 61375 Good: 37
4 Yes :64 Median :64000 Median : 77250
5 Mean :62750 Mean : 349278
6 3rd Qu.:72000 3rd Qu.: 144750
7 Max. :88000 Max. :6670000
8 approval_status Investment gender age work_exp
9 No :149 Min. : 6000 Female:178 Min. :32.00 Min. : 4.000
10 Yes: 51 1st Qu.: 81945 Male : 22 1st Qu.:33.00 1st Qu.: 7.000
11 Median : 114800 Median :34.00 Median : 8.000
12 Mean : 169694 Mean :33.98 Mean : 7.844
13 3rd Qu.: 173492 3rd Qu.:34.00 3rd Qu.: 9.000
14 Max. :3466580 Max. :49.00 Max. :11.000
We can infer the following from the above output:
1: There are no 'NA', or missing values, in any of the variables. This confirms that the count of observations for all the variables is 200.
The summary statistics of the numeric variables are displayed. For example, the age of the applicants ranges from 32 to 49 years.
When dealing with a frequency table, it is often necessary to compute proportions. In our case, we might want to calculate proportions of a categorical variable, such as 'approval_status'. The lines of code below create a frequency table and the proportions of the label for the variable 'approval_status'.
1table1 = table(dat$approval_status)
2
3prop.table(table1)
Output:
1 No Yes
20.745 0.255
The output shows that around 75 percent of the applicants were rejected, while the acceptance rate was around 25 percent. This can be extended for two or more variables as well. For example, if we want to analyze the percentage of approved applications across credit scores, that is also possible.
The first line of code below creates the two-way frequency table, while the second line prints the cell percentages, which means the total sum of all cells will be equal to one. As expected, the majority of applicants with poor credit scores had their applications rejected.
The prop.table function can also be used to generate percentages for rows or columns. The third and fourth lines of code below create a proportion table, which gives row and column percentages, respectively.
1table2 = table(dat$approval_status, dat$Credit_score)
2prop.table(table2)
3prop.table(table2, 1)
4prop.table(table2, 2)
Output:
1 Bad Good
2 No 0.700 0.045
3 Yes 0.115 0.140
4
5
6 Bad Good
7 No 0.93959732 0.06040268
8 Yes 0.45098039 0.54901961
9
10
11 Bad Good
12 No 0.8588957 0.2432432
13 Yes 0.1411043 0.7567568
In the previous section, we carried out preliminary data analysis. However, any quantitative analysis is incomplete without descriptive statistics. They are the foundation of data science and are defined as the measures that summarize given data. They include measures of central tendency (such as mean, median, and mode) and measures of dispersion (such as standard deviation, variance, and range).
The summary() function provides some of these statistics but not all of them. There are many other functions in R, that can be used to identify these measures. The fivenum() function is one; it returns the min, lower-hinge, median, upper-hinge, and max values. However, this function has a limitation in that it is not meaningful for factors.
1fivenum(dat$Income)
Output:
1[1] 32000 56000 64000 72000 88000
A better alternative to both the summary and the fivenum functions is the psych package, which provides several quantitative measures, including the standard deviation, skewness, and range.
The first line of code below loads the library, while the second line uses the 'describe' function in the library to print quantitative measures of all the variables.
1library(psych)
2describe(dat)
Output:
1 vars n mean sd median trimmed mad min
2Marital_status* 1 200 2.02 0.79 2 2.02 1.48 1
3Is_graduate* 2 200 1.58 0.49 2 1.60 0.00 1
4Income 3 200 62750.00 10638.77 64000 63062.50 11860.80 32000
5Loan_amount 4 200 349278.00 741366.91 77250 170134.38 35211.75 12200
6Credit_score* 5 200 1.19 0.39 1 1.11 0.00 1
7approval_status* 6 200 1.25 0.44 1 1.19 0.00 1
8Investment 7 200 169693.55 267134.29 114800 130429.69 65590.22 6000
9gender* 8 200 1.11 0.31 1 1.01 0.00 1
10age 9 200 33.98 1.76 34 33.71 1.48 32
11work_exp 10 200 7.84 1.33 8 7.88 1.48 4
12 max range skew kurtosis se
13Marital_status* 3 2 -0.03 -1.40 0.06
14Is_graduate* 2 1 -0.32 -1.91 0.03
15Income 88000 56000 -0.25 0.08 752.27
16Loan_amount 6670000 6657800 4.53 28.55 52422.56
17Credit_score* 2 1 1.61 0.60 0.03
18approval_status* 2 1 1.12 -0.76 0.03
19Investment 3466580 3460580 9.71 113.87 18889.25
20gender* 2 1 2.47 4.14 0.02
21age 49 17 4.63 31.04 0.12
22work_exp 11 7 -0.25 0.08 0.09
The high skewness value for the variables 'Loan_amount' and 'Investment’ indicate that these variables are not normally distributed.
We can use the syntax cor(X, Y) or rcorr(X, Y) to generate correlations between the two numerical columns.
1cor(dat$Income, dat$Investment)
Output:
1[1] 0.06168653
The value of 0.06 indicates weak correlation between the two variables.
In quantitative analysis, we are often required to summarize the data based on a few variables. This can easily be done with the summarize() function in the dplyr package. The code below summarizes and prints the average values of the three numerical variables.
1dat %>%
2 summarize(avg_income = mean(Income,na.rm=TRUE),
3 avg_loan = mean(Loan_amount,na.rm=TRUE),
4 avg_age = mean(age,na.rm=TRUE))
Output:
1A tibble: 1 x 3
2 avg_income avg_loan avg_age
3 <dbl> <dbl> <dbl>
41 62750 349278 34.0
The above output shows that the average age of the applicants is 34 years, and the average income is $62,750. The average loan amount applied for is $34,9278.
Another useful way of analyzing data is by using the group_by() function, which groups the input data based on a single column or multiple columns, then manipulates each such group of data. The line of code below groups the data frame by the variable ‘approval_status’, then calculates the average income for each label.
1dat %>%
2 group_by(approval_status) %>%
3 summarise(mean_inc = mean(Income))
Output:
1A tibble: 2 x 2
2approval_status mean_inc
3 <fct> <dbl>
4No 62174.
5Yes 64431.
The output above shows that the mean income of applicants whose loans were approved is slightly higher than that of the rejected applicants.
We have covered several techniques of quantitative analysis. There is also a field in statistics called Inferential Statistics that can be used for quantitative exploratory analysis. This is an extremely broad topic, and it is covered at length in another guide:
In this guide, you have learned techniques for performing exploratory data analysis using quantitative methods. You have learned how to use in-built 'R' functions and the analysis techniques of the powerful 'dplyr' package.
To learn more about data science in R, please refer to the following guides: 1. Interpreting Data Using Descriptive Statistics with R 2. Interpreting Data Using Statistical Models with R 3. Time Series Forecasting Using R 4. Hypothesis Testing - Interpreting Data with Statistical Models 5. Machine Learning with Text Data Using R 6. Visualization of Text Data Using Word Cloud in R