R for business users: data cleaning
- select the contributor at the end of the page -
This is Part 2 of a three-part series on the R programming language. Part 1 showed you how to import data into R, Part 2 focuses on data cleaning (how to write R code that will perform basic data cleansing tasks), and Part 3 takes an in-depth look at data visualization.
As discussed in Part 1, R’s applicability in business settings continues to draw new users both because of its growing popularity and pragmatic approach to data analysis. We have seen how data can be imported into R from a delimited text file. In this post, we will show how to write R code that will perform basic data cleansing tasks using the tidyr package in a way that is much less error-prone than manually manipulating cells in a spreadsheet.
Speaking of spreadsheets, a readable one often includes or excludes features that are essential to accurate data processing. Like a spreadsheet, R can also be used to process data and to display it. However, R largely separates these two concerns. R provides interactive commands and scripts for data processing. Such scripts can display data as it’s processed, but are often geared towards processing the data. R and RStudio also include features for producing output specifically rendered for human consumption like images, reports and presentations.
In order to process data in a script, it’s necessary to ensure the data is in a consistent, unambiguous form. We’ll take a look at present packages that were designed specifically for cleaning and reformatting data. These provide a little language consisting of a few key ideas that can be used to perform the majority of common data cleansing activities.
Review: data Import
The same comma-separated file introduced in Part 1 represents the data to be processed.
"","Year","Month","Quarter","Balance","Withdrawal" "1",2000,"1","Q1",10000,NA "2",NA,"2",NA,NA,NA "3",NA,"3",NA,NA,NA "4",NA,"4","Q2",6000,4000 "5",NA,"5",NA,NA,NA "6",NA,"6",NA,NA,NA "7",NA,"7","Q3",NA,NA "8",NA,"8",NA,3000,3000 "9",NA,"9",NA,2000,1000 "10",NA,"10","Q4",NA,NA "11",NA,"11",NA,NA,NA "12",NA,"12",NA,1000,1000 |
If you performed the steps listed in Part 1, you can re-run the command that was generated by RStudio when you imported this data. There are two arguments to the function call, the first is the path to the csv. This path might vary depending on the name of the file and where it resides on your system. The type of slashes used are operating system dependent, and certain symbols (like ~ indicating the user’s home directory) are also operating system dependent. If you’re running R on a non-Windows environment like OSX or Unix and the csv was saved to a directory named Desktop in the user’s home directory, the command will appear like this:
df <- read.csv("~/Desktop/test.csv", stringsAsFactors=FALSE)
The data is presented in a form common to spreadsheets; it’s broken up in a way that’s readable to people. Certain values are omitted since it’s assumed that a previously listed value or zero should be used. Processing a file in such a format brings in unanticipated results since most data analysis workflows are designed for rigidly formatted data imports that conform to a stricter, less ambiguous set of rules.
Data cleaning
In this step, the data is formatted and restructured in a way that’s better suited for later analysis.
Missing data, represented by NA (yes, that’s not available), can introduce difficulties. Such a designation can be used to describe missing, irrelevant, meaningless or unknowable data. It’s clear to most people that the missing values under Year, Quarter, and Balance are simply excluded for redundancy. However, the NAs appearing under the withdrawal column indicate that zero dollars were withdrawn. Functionality in the tidyr package replaces the NA values.
df <- df %>%
fill(Year, Quarter, Balance) %>%
replace_na(list(Withdrawal = 0))
This code snippet performs the following operations:
- Start with the data frame containing data imported from the spreadsheet; this data is stored in the df variable.
- Replace NA values in the Year, Quarter and Balance columns of the df data frame using the value from the previous record.
- Replace NA values in the Withdrawal column with zero.
- Assign this new dataset to the variable (df) that was holding the original data set as imported; the original dataset is overwritten with the new values.
There are a couple of R operators worth mentioning since they appear in other code here. The <- operator is made up of two characters meant to look like an arrow. It indicates that the value derived from the expression on the right of the arrow should be assigned to the variable on the left. This operator is so powerful that the expression above could be rewritten using only this operator.
tmp1 <- fill(df, Year, Quarter, Balance)
tmp2 <- replace_na(tmp1, list(Withdrawal = 0))
df <- tmp2
We avoided the extra variable assignments in the original code snippet by hiding these them. We did this by representing a series of operations in sequence that passed the data from one stage in the process to another.
The three-character operator %>% is a “pipe” operator that serves as shorthand, eliminating the need for the additional variable assignments. It sends output from the expression on the left as input to the expression on the right. With this operator in use, the original example dictates that the dataframe referenced by the df variable is sent as input to the fill function. Next, the data frame that was returned from the fill function call is sent as input to the replace_na function call. Finally, this outputted data frame is assigned to the df variable, overwriting its original contents.
This pipe operator is used in a number of other R packages to create code that requires fewer variable definitions and can be read from left to right; almost like reading English.
The data frame when viewed now contains no NA values.
Data summarization
Many statistical functions are available in the R language and use familiar names; these functions can be run at the console. In the following examples, a data frame column is referenced by name on the data frame (df). A $ is used to separate the data frame name and the column name. The prompt symbol > appears at the beginning of the line. The function call entered by the user is shown in blue beside the prompt, and the output of the command is shown in black:
> mean(df$Withdrawal)
[1] 750
> min(df$Withdrawal)
[1] 0
> max(df$Withdrawal)
[1] 4000
These particular functions are run so often in statistical analysis that there is a single function that returns all of these results immediately:
> summary(df$Withdrawal)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 750 1000 4000
Of course, in financial analysis, it’s far more common to calculate sums, which follow the same function call pattern shown above:
> sum(df$Withdrawal)
[1] 9000
If it’s easier to visualize the data in a chart, a line plot can be created with a call to the plot function specifying type="l" for line:
plot(df$Withdrawal, type="l")
These are just a few common examples; the R base language includes all the functionality needed to handle most data manipulation and plotting. However, certain packages provide this type of functionality using syntax that’s much easier to read and remember.
Filtering and ordering data
The dplyr package was written to filter and order data stored in data frames. Based on the data introduced earlier, a listing of months and their corresponding balances (where there was a withdrawal amount greater than zero) looks like this:
Month Balance
1 4 6000
2 8 3000
3 9 2000
4 12 1000
The filter function limits the number of rows displayed, while the select function limits which columns are displayed. The output above can be created using the following expression:
df %>% filter(Withdrawal > 0) %>% select(Month, Balance)
Changing the arrangement of this report to descending order (by month) involves simply adding an additional pipe and call to the arrange function to the snippet listed above:
%>% arrange(desc(Month))
Although the pipe operator isn’t necessary, it makes it simple to add an additional processing step to the end of a previous sequence. This allows you to perform data manipulation and construct a final pipeline that performs all steps in a controlled predictable manner, free of the errors that often occur when manually manipulating a spreadsheet.
Grouping data
The dplyr package also includes capabilities to group data, along with the ability to create summaries at the group levels and derive new columns. Suppose we wanted to produce a report that listed each quarter, along with the total and average withdrawal amount per month:
Quarter Withdrawals AverageMonthlyWithdrawal
1 Quarter 1 0 0.00
2 Quarter 2 4000 1333.33
3 Quarter 3 4000 1333.33
4 Quarter 4 1000 333.33
The dplyr package includes the mutate function which can be used to create derived columns (in this case replacing each Q with the word Quarter spelled out). It also includes the group_by function which assigns the level used for subsequent summarization, and the summarize function which is used to create the sum, mean or other calculation at the level previously specified in the group_by call:
df %>% mutate(Quarter=gsub('Q', 'Quarter ', Quarter)) %>%
group_by(Quarter) %>%
summarize(Withdrawals = sum(Withdrawal), AverageMonthlyWithdrawal=round(mean(Withdrawal),2))
Unlike spreadsheets, these functions can also be used to perform elaborate manipulation of data frames. Likewise, R and dplyr can script these types of manipulations to be performed almost instantly and without error.
Takeaway
You should now have a better handle on data cleansing and data manipulation using two R packages tidyr and dplyr. Stay tuned for Part 3 when we look at how to create aesthetically pleasing and informative charts and plots using the ggplot2 package.