Introduction

2

Exploratory data analysis is one of the most time consuming but important parts of the data science cycle. In this guide, you will learn techniques available in R for performing exploratory data analysis using quantitative methods.

Let’s start with some data.

In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and 10 variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No")

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

Income: Annual Income of the applicant (in USD)

Loan_amount: Loan amount (in USD) for which the application was submitted

Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")

Approval_status: Whether the loan application was approved ("Yes") or not ("No")

Age: The applicant's age in years

Gender: Whether the applicant is a male ("M") or a female ("F")

Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

work_exp: Work experience in years

Let's start by loading the required libraries and the data.

`1 2 3 4 5 6 7 8 9`

`library(plyr) library(readr) library(ggplot2) library(GGally) library(dplyr) library(mlbench) dat <- read_csv("data_test.csv") glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 200 Variables: 10 $ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y... $ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61... $ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,... $ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9... $ gender <fct> Female, Female, Female, Female, Female, Female, Female... $ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33... $ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....`

The output shows that the dataset has five categorical variables (labeled as 'chr'), four integer variables (labeled as 'int'), and one numeric variable, 'work_exp'. We will convert the categorical variables into 'factor' type using the line of code below.

`1 2 3 4`

`names <- c(1,2,5,6,8) dat[,names] <- lapply(dat[,names] , factor) glimpse(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12`

`Observations: 200 Variables: 10 $ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y... $ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61... $ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,... $ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9... $ gender <fct> Female, Female, Female, Female, Female, Female, Female... $ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33... $ work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....`

Let’s now perform the exploratory analysis using the various quantitative techniques.

The basic analysis would be to compute simple calculations such as count, frequency, and proportions. The *summary()* function can be used to perform this task.

`1`

`summary(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`Marital_status Is_graduate Income Loan_amount Credit_score Divorced:60 No : 84 Min. :32000 Min. : 12200 Bad :163 No :76 Yes:116 1st Qu.:56000 1st Qu.: 61375 Good: 37 Yes :64 Median :64000 Median : 77250 Mean :62750 Mean : 349278 3rd Qu.:72000 3rd Qu.: 144750 Max. :88000 Max. :6670000 approval_status Investment gender age work_exp No :149 Min. : 6000 Female:178 Min. :32.00 Min. : 4.000 Yes: 51 1st Qu.: 81945 Male : 22 1st Qu.:33.00 1st Qu.: 7.000 Median : 114800 Median :34.00 Median : 8.000 Mean : 169694 Mean :33.98 Mean : 7.844 3rd Qu.: 173492 3rd Qu.:34.00 3rd Qu.: 9.000 Max. :3466580 Max. :49.00 Max. :11.000`

We can infer the following from the above output:

1: There are no 'NA', or missing values, in any of the variables. This confirms that the count of observations for all the variables is 200.

The summary statistics of the numeric variables are displayed. For example, the age of the applicants ranges from 32 to 49 years.

For qualitative variables, the frequency of each label in the respective variable is displayed. For example, the frequency table of the target variable 'approval_status' shows that out of 200 applicants, 149 had their loan applications rejected, while the remaining were accepted.

When dealing with a frequency table, it is often necessary to compute proportions. In our case, we might want to calculate proportions of a categorical variable, such as 'approval_status'. The lines of code below create a frequency table and the proportions of the label for the variable 'approval_status'.

`1 2 3`

`table1 = table(dat$approval_status) prop.table(table1)`

{r}

Output:

`1 2`

`No Yes 0.745 0.255`

The output shows that around 75 percent of the applicants were rejected, while the acceptance rate was around 25 percent. This can be extended for two or more variables as well. For example, if we want to analyze the percentage of approved applications across credit scores, that is also possible.

The *first line of code* below creates the two-way frequency table, while the *second line* prints the cell percentages, which means the total sum of all cells will be equal to one. As expected, the majority of applicants with poor credit scores had their applications rejected.

The *prop.table* function can also be used to generate percentages for rows or columns. The *third and fourth lines of code* below create a proportion table, which gives row and column percentages, respectively.

`1 2 3 4`

`table2 = table(dat$approval_status, dat$Credit_score) prop.table(table2) prop.table(table2, 1) prop.table(table2, 2)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13`

`Bad Good No 0.700 0.045 Yes 0.115 0.140 Bad Good No 0.93959732 0.06040268 Yes 0.45098039 0.54901961 Bad Good No 0.8588957 0.2432432 Yes 0.1411043 0.7567568`

In the previous section, we carried out preliminary data analysis. However, any quantitative analysis is incomplete without descriptive statistics. They are the foundation of data science and are defined as the measures that summarize given data. They include measures of central tendency (such as mean, median, and mode) and measures of dispersion (such as standard deviation, variance, and range).

The *summary()* function provides some of these statistics but not all of them. There are many other functions in R, that can be used to identify these measures. The *fivenum()* function is one; it returns the min, lower-hinge, median, upper-hinge, and max values. However, this function has a limitation in that it is not meaningful for factors.

`1`

`fivenum(dat$Income)`

{r}

Output:

`1`

`[1] 32000 56000 64000 72000 88000`

A better alternative to both the *summary* and the *fivenum* functions is the *psych* package, which provides several quantitative measures, including the standard deviation, skewness, and range.

The *first line of code* below loads the library, while the *second line* uses the 'describe' function in the library to print quantitative measures of all the variables.

`1 2`

`library(psych) describe(dat)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22`

`vars n mean sd median trimmed mad min Marital_status* 1 200 2.02 0.79 2 2.02 1.48 1 Is_graduate* 2 200 1.58 0.49 2 1.60 0.00 1 Income 3 200 62750.00 10638.77 64000 63062.50 11860.80 32000 Loan_amount 4 200 349278.00 741366.91 77250 170134.38 35211.75 12200 Credit_score* 5 200 1.19 0.39 1 1.11 0.00 1 approval_status* 6 200 1.25 0.44 1 1.19 0.00 1 Investment 7 200 169693.55 267134.29 114800 130429.69 65590.22 6000 gender* 8 200 1.11 0.31 1 1.01 0.00 1 age 9 200 33.98 1.76 34 33.71 1.48 32 work_exp 10 200 7.84 1.33 8 7.88 1.48 4 max range skew kurtosis se Marital_status* 3 2 -0.03 -1.40 0.06 Is_graduate* 2 1 -0.32 -1.91 0.03 Income 88000 56000 -0.25 0.08 752.27 Loan_amount 6670000 6657800 4.53 28.55 52422.56 Credit_score* 2 1 1.61 0.60 0.03 approval_status* 2 1 1.12 -0.76 0.03 Investment 3466580 3460580 9.71 113.87 18889.25 gender* 2 1 2.47 4.14 0.02 age 49 17 4.63 31.04 0.12 work_exp 11 7 -0.25 0.08 0.09`

The high skewness value for the variables 'Loan_amount' and 'Investment’ indicate that these variables are not normally distributed.

We can use the syntax *cor(X, Y) or rcorr(X, Y)* to generate correlations between the two numerical columns.

`1`

`cor(dat$Income, dat$Investment)`

{r}

Output:

`1`

`[1] 0.06168653`

The value of 0.06 indicates weak correlation between the two variables.

In quantitative analysis, we are often required to summarize the data based on a few variables. This can easily be done with the *summarize()* function in the *dplyr* package. The code below summarizes and prints the average values of the three numerical variables.

`1 2 3 4`

`dat %>% summarize(avg_income = mean(Income,na.rm=TRUE), avg_loan = mean(Loan_amount,na.rm=TRUE), avg_age = mean(age,na.rm=TRUE))`

{r}

Output:

`1 2 3 4`

`A tibble: 1 x 3 avg_income avg_loan avg_age <dbl> <dbl> <dbl> 1 62750 349278 34.0`

The above output shows that the average age of the applicants is 34 years, and the average income is $62,750. The average loan amount applied for is $34,9278.

Another useful way of analyzing data is by using the *group_by()* function, which groups the input data based on a single column or multiple columns, then manipulates each such group of data. The line of code below groups the data frame by the variable ‘approval_status’, then calculates the average income for each label.

`1 2 3`

`dat %>% group_by(approval_status) %>% summarise(mean_inc = mean(Income))`

{r}

Output:

`1 2 3 4 5`

`A tibble: 2 x 2 approval_status mean_inc <fct> <dbl> No 62174. Yes 64431.`

The output above shows that the mean income of applicants whose loans were approved is slightly higher than that of the rejected applicants.

We have covered several techniques of quantitative analysis. There is also a field in statistics called *Inferential Statistics* that can be used for quantitative exploratory analysis. This is an extremely broad topic, and it is covered at length in another guide:

In this guide, you have learned techniques for performing exploratory data analysis using quantitative methods. You have learned how to use in-built 'R' functions and the analysis techniques of the powerful 'dplyr' package.

To learn more about data science in R, please refer to the following guides: 1. Interpreting Data Using Descriptive Statistics with R 2. Interpreting Data Using Statistical Models with R 3. Time Series Forecasting Using R 4. Hypothesis Testing - Interpreting Data with Statistical Models 5. Machine Learning with Text Data Using R 6. Visualization of Text Data Using Word Cloud in R

2