R for business users: importing data
- select the contributor at the end of the page -
This is Part 1 of a three-part series on the R programming language. Part 1 shows you how to import data into R, Part 2 focuses on data cleaning (how to write R code that will perform basic data cleansing tasks), and Part 3 takes an in-depth look at data visualization.
It’s difficult to imagine a modern business analyst who never creates, views or manipulates spreadsheets. The R programming language, like a spreadsheet program, is designed for data analysis. Its applicability in business settings continues to increase both because of its growing popularity and pragmatic approach to data analysis. Business professionals have found R as essential to their daily work as the venerable spreadsheet program.
Unlike many programming languages, the R community boasts a membership from a surprisingly wide range of backgrounds. For years, statisticians and scientists have used R and its predecessor, S, for data analysis. The ever-increasing store of valuable data available to businesses has caused R to be adopted by professionals in a variety of other disciplines. Business professionals who’ve pushed spreadsheets to their limits are joining the ranks of R users who have already been dealing with large, complex data sets for years.
Though sophisticated and full-featured, R is flexible and simple to use for many common data operations. Let’s take a look at how to create and import data into R, with a focus on replicating common spreadsheet functions in a way that is repeatable and less error-prone.
Spreadsheets are excellent interactive tools, but lack the controls needed for repeatable processes. Business users caught in a cycle of performing complex tasks with a spreadsheet can find themselves thinking that there must be a better way. A single click or keyboard control sequence can cause a significant error that is not apparent immediately. R can be used interactively, but being command-based, it can be scripted to ensure the exact same actions are performed each time a process is repeated. Scripted processes lend themselves to greater discipline and structure. This opens the possibility of data processing that reaches a much higher degree of formality, control and optimization.
R and Spreadsheet Functions
The examples that follow are demonstrated using R in the context of RStudio Desktop. This software, as well as all third party packages, are freely available for download. See the RStudio Desktop page for instructions on installing RStudio Desktop and the R programming language itself. The package installation is done once these tools are installed locally.
R packages provide much of the functionality that makes R so extensible and pertinent to a wide range of applications. If this is your first time using the packages that follow, they can be installed using the install.packages command or within RStudio by choosing the Packages tab and clicking the install button which brings up a dialog to perform the installation.
Installed packages are available for use, but must be loaded into the current R session when you intend to actually use them. Three packages including tidyr, dplyr and ggplot2 will be loaded using the statements below. These packages, all written by prolific R programmer Hadley Wickham, share a number of commonalities and work well together. The tidyr package is used for addressing missing values; dplyr for data filtering, ordering and summarization; and ggplot2 for creating graphical plots and charts.
The comma-separated file shown below represents the data to be processed. It consists of a series of entries that look a bit like a checkbook. NA indicates a missing value.
This data is easier to view when opened in a spreadsheet program.
But how can we get this data into R? It’s possible to add it by writing a script that programmatically creates the data. It’s not the most efficient way, but if you’re following along, it’s easy to copy this code chunk, paste it in at the R console and instantly have the data available.
df <- data.frame(Year = c(2000, rep(NA, 11)),
Month = as.factor(1:12),
Quarter = c('Q1', rep(NA, 2),
'Q2', rep(NA, 2),
'Q3', rep(NA, 2),
Balance = c(10000,rep(NA, 2), 6000,
rep(NA, 3), 3000,
2000, rep(NA, 2), 1000),
Withdrawal = c(rep(NA, 3), 4000,
rep(NA, 3), 3000,
1000, rep(NA, 2), 1000)
Again, this is not easiest way to get data in, and demonstrates why R is often seen as obscure or impractical when first encountered. For folks curious about the details of the R code listed, its general purpose is to create a data frame and assign it to a variable named df. Each column (Year, Month, Quarter, Balance and Withdrawal) is described by its name, followed by an equal sign and an expression that creates data for a given row.
The c function “combines” its elements into a vector; the series of values that comprises the column. The rep function specifies that the item listed as the first argument be “repeated” the number of times indicated by the second argument. The as.factor(1:12) expression takes a sequence of numbers from one through twelve and casts them as factor variables, and a variety of string variables (enclosed by quotes), numbers and NAs (indicating not available) are used to populate individual values. There is no need to worry about the details of this code, as it’s simply included to illustrate that data can be created using code, and that it’s generally not the best way to construct a data frame. It’s far easier to import data from a structured format.
R provides many different functions for importing and exporting data in a variety of formats, which can be a bit daunting to sift through initially. RStudio provides a simple way to import the data, and also produces a line of code required to perform an import.
- Open a text editor
- Copy the data shown in the comma-separated file previously
- Paste it in a new text file
- Save it as a file named test.csv on your local file system
The data can then be imported using RStudio by choosing the Environment Tab and clicking the Import Data Set button.
This will bring a file selection dialog where you can navigate to the directory of the test.csv file and open it. Opening the file causes a dialog to appear where import options can be configured.
The name for the object that will hold the data can be entered in the name field (by default, the name of the file is used to populate this field). Make sure to uncheck the Strings as factors box (factors are a way that statisticians represent string data for efficiency and organizational purposes). This will allow non-numeric string data to be treated in a manner familiar to spreadsheet users.
When the import button is clicked, the data is actually imported and presented for viewing.
The Environment tab displays all objects currently available in the R session environment. In this case, the only object shown is the data frame named df that contains the imported data. The R console contains the commands that were constructed and executed by clicking through the import dialog. These can be set aside and later used in place of clicking through the dialog options.
df <- read.csv("~/Desktop/test.csv", stringsAsFactors=FALSE)
The second command (View(df)) isn’t necessary but is essential for seeing the current state of the data. It opened the upper left pane and populated it with a spreadsheet-like view of the imported data.
Importing data is a prerequisite to all other data analysis tasks. Thankfully, RStudio makes importing CSV files straightforward, and R supports many different file formats and methods of importing. You should now have a handle on R and R Studio, so go ahead and get started -- and stay tuned for more posts in this series.