 Deepika Singh

Exploring Data Visually with R

• Sep 16, 2019
• 2,410 Views
• Sep 16, 2019
• 2,410 Views
Data
R

Introduction

Visualization is an important part of exploratory data analysis which helps in understanding the data, building hypothesis, and feature engineering. In this guide, you will learn the techniques of visualizing the data using the powerful libraries present in the statistical programming language, ‘R’.

In this guide, you will learn how to build the following visualization plots:

1. Scatter Plot

2. Histogram

3. Bar Plot

4. Box Plot

5. Pie Chart

6. Correlogram

7. Multivariate Plots

Data

In this guide, we will be using the fictitious data of loan applicants containing 600 observations and 10 variables, as described below:

1. Marital_status: Whether the applicant is married ("Yes") or not ("No").

2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

3. Income: Annual Income of the applicant (in USD).

4. Loan_amount: Loan amount (in USD) for which the application was submitted.

5. Credit_score: Whether the applicants’ credit score is good ("Good") or not ("Bad").

6. approval_status: Whether the loan application was approved ("Yes") or not ("No").

7. Age: The applicant's age in years.

8. Sex: Whether the applicant was a male ("M") or a female ("F").

9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant.

10. Purpose: Purpose of applying for the loan.

1
3library(ggplot2)
4library(GGally)
5library(dplyr)
6
8glimpse(dat)
9
{r}

Output:

1Observations: 600
2Variables: 10
3\$ Marital_status   <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4\$ Is_graduate 	 <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5\$ Income      	    <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6\$ Loan_amount 	<int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7\$ Credit_score	  <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8\$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9\$ Age         	       <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10\$ Sex         	       <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11\$ Investment  	   <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12\$ Purpose     	    <chr> "Education", "Travel", "Others", "Others", "Travel", "...
13

The output shows that the data has six categorical and four numerical variables. However, for visualizing the categorical variables, it is better to convert it into 'factor' variables.

The first line of code below specifies the position of the categorical variables, while the second line converts these into factor variables. The third line prints the information about the data, which confirms that the conversion to factor variables is complete.

1
2names <- c(1,2,5,6,8,10)
3dat[,names] <- lapply(dat[,names] , factor)
4glimpse(dat)
5
{r}

Output:

1
2Observations: 600
3Variables: 10
4\$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
5\$ Is_graduate 	<fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
6\$ Income      	<int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
7\$ Loan_amount 	<int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
8\$ Credit_score	<fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
9\$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
10\$ Age         	<int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
11\$ Sex         	<fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
12\$ Investment  	<int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
13\$ Purpose     	<fct> Education, Travel, Others, Others, Travel, Travel, Tra...
14

Now, the data is ready for visualization, which will be done using the powerful ggplot library.

Scatter Plot

Scatter plot is a popular visualization technique to examine the relationship between two quantitative variables. It can be plotted using the geom_point() function. We will visualize the relationship between the age and investment level of the applicant. The line of code below utilizes the essential aesthetic mappings of the x ('Age') and y (Investment') axes, and then specifies the 'geom_point' function with the additional arguments of ‘size’ and 'color'.

1
2ggplot(dat, aes(x = Age, y = Investment)) + geom_point(size = 2, color="blue") + xlab("Age in years") + ylab("Investment in USD") + ggtitle("Age vs Investment Levels")
3
{r}

Output: The output above shows that there seems to be little or no linear relationship between the age and investment level of the applicant. There are many arguments that can be specified for the 'geom_point' argument, the list of which can be displayed by running the ?geom_point command in the R console.

We can extend the scatterplot to include other variables as well. For example, it would be a good idea to understand the above scatterplot in conjunction with the target variable. This can be done by passing the additional argument, col = approval_status, to the code, as shown below.

1
2ggplot(dat, aes(x = Age, y = Investment, col = approval_status)) + geom_point()
3
{r}

Output: The above output shows that the distribution of the points is now displayed in two colors - blue for approval_status = "Yes" and red for approval_status = "No".

Histogram

The histogram is used to visualize the distribution of the numerical variables. A histogram shows the number of data values within a bin for a numerical variable, with the bins dividing the values into equal segments. The vertical axis of the histogram shows the count of data values within each bin.

The code below plots a histogram of the 'Income' variable, using the geom_hist method.

1
2ggplot(dat, aes(x = Income)) + geom_histogram(col="black") + ggtitle("Annual Income Distribution")
3
{r}

Output: We can extend the above plot by adding a categorical variable, using the 'fill' argument. The positioning can be either stacked or dodged, as shown in the lines of code below.

1
2#Stacked histogram
3ggplot(dat, aes(x = Age, fill = approval_status)) + geom_histogram(position = "stack") + ggtitle("Stacked Histogram")
4
5
6#Dodged Histogram
7ggplot(dat, aes(x = Age, fill = approval_status)) + geom_histogram(position = "dodge") + ggtitle("Dodged Histogram")
8
{r}

Output:  Bar Chart

The bar chart is used to plot a categorical variable or a combination of continuous and categorical variables. It is built using the geom_bar() function. The code below plots the bar chart for the variable 'Purpose', where the vertical height represents the count of the categories.

1
2ggplot(dat, aes(Purpose)) + geom_bar()
3
{r}

Output: The bar chart above plots the frequency distribution of the 'Purpose' variable. However, we may want to visualize the relationship between a summary measure, such as mean income, and the purpose for which the loan was applied.

Since the default argument in 'geom_bar' is count, we first aggregate the 'Purpose' variable against the 'Income' variable. The first line of code below uses the FUN=mean argument to calculate the mean income as per the classes of the 'Purpose' variable. The second line renames the columns, while the third line creates the required plot.

1
2aggdata <-aggregate(dat\$Income, by=list(dat\$Purpose), FUN=mean, na.rm=TRUE)
3
4names(aggdata) = c("Purpose","Mean_Income")
5
6ggplot(aggdata, aes(Purpose, Mean_Income)) + geom_bar(stat = "identity", fill = "blue") + ggtitle("Mean Income across the Purpose Categories")
7
{r}

Output: It is also possible to create a stacked bar chart, an advanced version of a bar chart, which is used to visualize a combination of categorical variables, as shown below.

1
2ggplot(dat, aes(Purpose, fill = approval_status)) + geom_bar() +  labs(title = "Stacked Bar Chart", x = "Purpose", y = "Count")
3
{r}

Output: Box Plot

The box plot is a standardized way of displaying the distribution of data based on a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). It is often used to identify data distribution and detect outliers. The line of code below plots the distribution of the numeric variable 'Age' against the categorical variable 'Purpose'.

1
2ggplot(dat, aes(Purpose, Age)) + geom_boxplot(fill = "blue") + labs(title = "Box Plot")
3
{r}

Output: The outliers are shown as black circles in the above chart. It is also possible to visualize the above distribution with respect to the target variable, to see if there are any insights. This can be done by specifying the fill=factor(approval_status) argument in the geom_boxplot() function, as shown in the line of code below.

1
2ggplot(dat, aes(Purpose, Age)) + geom_boxplot(aes(fill=factor(approval_status))) + labs(title = "Box Plot")
3
{r}

Output: Pie Chart

The pie chart is a technique used to show the compositions of a categorical variable. It can be implemented using the coord_polar() function. The lines of code below will create the pie chart of the 'Purpose' variable.

1
2table(dat\$Purpose)
3
4p = as.data.frame(table(dat\$Purpose))
5colnames(p) <- c("purpose", "freq")
6
7pie <- ggplot(p, aes(x = "", y=freq, fill = factor(purpose))) +
8  geom_bar(width = 1, stat = "identity") +
9  labs(fill="purpose", x=NULL, y=NULL, title="Pie Chart of Purpose")
10
11pie + coord_polar(theta = "y", start=0)
12
{r}

Output: Correlogram

The correlogram is an important technique which can be used to identify multi-collinearity in the data. This is implemented using the ggcorrplot package, loaded in the first line of code below.

The second line creates the dataframe containing the numerical variables, while the third line computes the correlation. The fourth line creates the correlogram plot, where the arguments like 'colors', 'outline.color', and 'show.legend' are used to control the display of the chart.

1library(ggcorrplot)
2
3# Correlation matrix
4cordata = dat[,c(3,4,7,9)]
5corr <- round(cor(cordata), 1)
6
7# Plot
8ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of loan variables")
9
{r}

Output: Multivariate Plots

In the previous sections, we have discussed techniques to visualize the distribution of one, two, or three variables. However, if the number of variables is high, it may become cumbersome to visualize them separately. To make this task easier, we can use the GGally package in 'R'. which uses the ggpairs() function for visualizing pairwise relationships.

The lines of code below load the ‘GGally’ library and creates the pairwise plot for the continuous variables.

1
2library(GGally)
3
4num_df <- dat[, c(3, 4, 7,9)]
5ggpairs(num_df)
6
{r}

Output: It is also possible to visualize the pairwise plots for a combination of categorical and continuous variables. The first line of code below creates a dataframe consisting of two continuous and three categorical variables, while the second line creates the plot.

1
2mixed_df <- dat[, c(1,4,6,7,8)]
3ggpairs(mixed_df)
4
{r}

Output: Conclusion

In this guide, you have learned about the different visualization techniques using the popular 'ggplot2' package. You also learned about the 'GGally' package for pairwise visualization of multiple variables at a time.