R for business users: data visualization

- select the contributor at the end of the page -

This is Part 3 of a three-part series on the R programming language. Part 1 explained how to import data into R, Part 2 focused on data cleaning (how to write R code that will perform basic data cleansing tasks), and Part 3 takes an in-depth look at data visualization.

As we learned earlier, R’s applicability in business settings continues to draw new users both because of its growing popularity and pragmatic approach to data analysis. We've seen how data can be imported into R from a delimited text file. And we’ve looked at how to write R code that performs basic data cleansing tasks in a way that is much less error-prone than manually manipulating cells in a spreadsheet. Now, we’re going to dive right into data visualization with some useful examples.

Demo data

The examples below require a data frame populated and cleaned, as described in previous posts. These describe how data can be imported from a text file and preprocessed to get it into a form suitable for analysis and further processing. If you don’t have time to review these in detail, run the following commands to create the data frame and execute functions to format the data:

library(tidyr)
df <- data.frame(Year = c(2000, rep(NA, 11)),
          Month = as.factor(1:12),
          Quarter = c('Q1', rep(NA, 2),
                        'Q2', rep(NA, 2),
                        'Q3', rep(NA, 2),
                        'Q4',rep(NA, 2)),
          Balance = c(10000,rep(NA, 2), 6000,
                         rep(NA, 3), 3000,
                         2000, rep(NA, 2), 1000),
          Withdrawal = c(rep(NA, 3), 4000,
                         rep(NA, 3), 3000,
                        1000, rep(NA, 2), 1000)

         )

df <- df %>%
   fill(Year, Quarter, Balance) %>%
   replace_na(list(Withdrawal = 0))

View(df)

These steps create the sample data and populate missing values. They do not include additional code related to data filtering and summarization using the dplyr package discussed earlier. These steps were examples of independent data exploration activities, but the outputted results were not saved for future reference. Any filtering and summarization required for visualizations below will be performed in conjunction with the steps needed to produce the plot itself.

R base graphics

R out-of-the-box includes a graphics system that can quickly create graphical output (in many cases, simply passing data to it provides an interesting result). Other graphics packages can be used to make nicer looking graphics, but require long, complex function calls to create even the simplest of charts. Consider the set of data in the Balance column of the data frame (df). The syntax used below indicates that the column can be displayed by:

  1. Entering the variable referencing the df.
  2. Appending the dollar sign indicating a column in the data frame is about to be referenced.
  3. Appending the name of the column itself; in this case, Balance.

The R prompt and function call entered by the user is shown in blue; the output of the command is in black:

> df$Balance
[1] 10000 10000 10000 6000 6000 6000 6000 3000 2000
[10] 2000 2000 1000

The one and 10 in brackets indicate the index of the first element displayed on the line: 10000 is the first, and 2000 (on the second line) is the tenth. The df$Balance variable is known as a vector in R; it’s analogous to an array or list in other programming languages (an ordered sequence of variables that can be referenced by their position or index). The generic way for plotting R objects involves simply calling plot and passing it to the data to be plotted:

plot(df$Balance)

In RStudio, the plot is displayed in the Plots panel (positioned in the lower right-hand corner by default). You can click the Zoom button to view the plot in a separate resizable window.

The plot function faithfully represents the data passed to it as a scatter plot. It uses the position of the entry or index as the x axis, the value contained in Balance as the y axis, and sets up the axis markings and labels. Although this is correct, it’s not the way you might expect to see this data. If you prefer a bar chart, you can call the barplot function instead:

barplot(df$Balance)

A bar chart is presented, but the x and y axes aren’t labeled. Thankfully, you can create labels by regenerating the plot and adding additional arguments:

barplot(df$Balance, xlab = 'Index', ylab='Balance')

These examples all represent a single variable; balance. In many cases, you’ll want to display two or more. The following example creates a line chart with the month on the x-axis and the balance on the y-axis:

plot(df$Month, df$Balance, type="l")

As you can see, R’s base graphics make it is easy to create a plot, but it takes human judgment to choose relevant variables and select the correct function and related options. Base graphics functions make good guesses for default values, but more advanced users prefer to explicitly choose options that produce more polished results.

The ggplot2 package

The ggplot2 package introduces a “Grammar of Graphics” that essentially deconstructs the process of producing plots in a systematic way. It makes the common rules and features that underlie the display of quantitative information explicit. By way of comparison, we’ll recreate the plots already presented using base graphics above:

ggplot(df, aes(x=seq_along(Balance), y=Balance)) + geom_point()

The ggplot function only takes a data frame as an input, but allows you to reference columns without using the dollar sign syntax seen earlier. The aes function represents the aesthetics, which maps components of the chart to particular variables. And the seq_along function is required to reference the index of the Balance.

Unlike the plot function, which tries to create a scatterplot by default in the previous example, the ggplot function doesn't have a default manner for plotting the data. You need the geom_point() function to indicate that the data be represented as points. The geom_bar() function is used in the next example to create a bar chart, along with additional function calls to format the x axis and labels. You use the xlab function to label the x-axis, and the scale_x_discrete() function to indicate that discrete values (whole numbers) should be displayed rather than continuous values which can include decimals.

ggplot(df, aes(x=seq_along(Balance), y=Balance)) +
geom_bar(stat="identity") +
xlab('Index') +
scale_x_discrete()

Patterns become more obvious the more you work with ggplot2; the underlying “grammar” can help suggest changes required to modify a plot. Based on what we’ve seen in the previous two examples, we can reason that the line chart that includes Month as the x-axis and Balance as the y-axis can be produced by setting Month as the x variable and using a geom function for creating line charts:

ggplot(df, aes(x=Month, y=Balance)) +
geom_line() +
scale_x_discrete()

Creator of ggplot2, Hadley Wickham, also created the dplyr and tidyr packages – and, as a result, they can be seamlessly integrated in surprising ways. A series of pipelined statements can include calls to the ggplot function instead of passing data as the first parameter. This example uses a different approach to setting the x-axis values. It specifies continuous (rather than discrete) values but explicitly references the breaks as whole numbers:

df %>%
  filter(Withdrawal > 0) %>%
select(Month, Balance) %>%
arrange(desc(Month)) %>%
ggplot(aes(Month, Balance)) +
  geom_bar(stat="identity") +
  scale_x_continuous(breaks=c(1:12))

Charts created by ggplot2 tend to be visually appealing, but experts like Edward Tufte promote the practice of “ruthlessly removing unnecessary ink.” The previous call can have additional functions added to create a minimal version of the data displayed:

df %>%
  filter(Withdrawal > 0) %>%
  select(Month, Balance) %>%
  arrange(desc(Month)) %>%
  ggplot(aes(Month, Balance)) +
    geom_bar(stat="identity") +
    scale_x_continuous(breaks=c(1:12)) +
    theme_bw() +
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line = element_blank())

Reports and presentations


Graphics produced in R can be exported in standard formats and added to the document or presentation of your choice. You can easily create these using RStudio with menu options shown below. Embedded R code in these documents is compiled to create a final result that displays rendered plots and derived values.

Producing documents of this kind has immense value. Properly constructed, a report or presentation authored in this manner will banish errors normally caused by data from one part of a document being out of sync with a visualization generated from it later in the document. Besides eliminating this class of errors, you can use the same document to generate different results, based on a different or updated data set. This is yet another way that RStudio is bridging the gap between the advanced statistical capabilities of the R language and the most popular business software in use today.

Takeaway


If you’re not already taking full advantage of R’s many features, it’s time to start. As we’ve seen throughout this three-part series, business users have seemingly limitless options with the R programming language. Not only can R make your workload more efficient, it can help reduce errors and create reports and presentations that include visualizations far more sophisticated than what's possible using standard word processing and presentation software.

 

Get our content first. In your inbox.

Loading form...

If this message remains, it may be due to cookies being disabled or to an ad blocker.

Contributor

Casimir Saternos

is a Software Developer and Architect that has presented R related topics in the RStudio: Get Started screencast available from Pluralsight as well as the Simple-Talk site and the Oracle Technology Network. He is the author of Client Server Web Apps with JavaScript and Java with O’Reilly and has written articles that have appeared in Java Magazine.