Data frames are a practical way to manipulate data; they show up everywhere in the data world. Whether you're importing data, doing descriptive statistics, creating visualizations, or doing machine learning - data frames will be at the heart of your work. R is a language that makes it relatively easy to get into these data worlds. Before jumping into data frame specifics, I should note that here we'll not only focus on base R but also tidyverse solutions via its series of packages. Whether you're a newbie or a vet, the tidyverse makes your R work more efficient.
For those who are curious, we'll build this via RStudio and R Markdown.
The great thing about data frames is that everyone is likely familiar with the concept. Data frames are essentially tables. If you think about data formally, it consists of attributes and records. Here's an example from a popular dataset on aircraft movements (displayed in RStudio).
Notice that whether this is displayed in an R data frame, an Excel table, or a SQL table, each row (record) represents a flight and each column (attribute) describes that flight. The contents of each cell holds the attribute for each record.
Once you have created this data frame, with the right skills, you could riff across most of the data science and analytics world. For example, much of the work in these fields can start with data frames:
It's easiest to start working with data by creating a reproducible example (this is also a great way to get help on Stack Overflow). I should first note that in R there are multiple ways of doing things. Base R (i.e., the functionality that comes out-of-the-box) provides hundreds of functions. On top of that, the community has built many other wonderful packages that can do things in a slightly more elegant or intuitive way. In that vein, I'll first show how to create a dataframe in base R and then will focus on tidyverse solutions.
Here's how to create a simple data frame that records the name and jobs of various fictional people. First, in base R.
1# We assign the data frame to variable df_base 2df_base <- data.frame(name = c("Jane","Sri","Eliza","Joe"), 3 occupation = c("engineer","designer","architect","engineer"))
Note that you simply use the
data.frame function, provide names for your columns, and populate the contents of the columns (using the
c() vector functionality). If that doesn't make sense yet, just copy and paste this into RStudio and follow along.
Now, we can look at this data frame by simply typing
Fairly straightforward - it looks like a table. Note that a data frame can hold a number of data types (i.e., the columns can be characters, integers, dates, factors, etc). Above you'll notice that the data types are displayed below the column name; in this case, our two columns were coded as factors.
While it's good to know how base R works, much of the data science community has embraced the tidyverse and works with a slightly updated version of the data frame, called a
tibble. It's quite similar (and actually is a data frame). Here's how we build a tibble-style data frame:
1# Load the tidyverse package 2library(tidyverse) 3# Assign the data frame to variable df_tidy 4df_tidy <- tibble(name = c("Jane","Sri","Eliza","Joe"), 5 occupation = c("engineer","designer","architect","engineer"))
And we can look at this tibble similarly (i.e., by typing the variable name):
You'll notice that it looks quite similar, but the columns were coded as characters instead of factors (note the
chr). This shift from data frames to the modern tibble is described in Hadley Wickham's impressive (and free) R for Data Science here. We'll show both methods in a couple of examples and then just focus on the tidyverse.
Let's now walk through how to actually access parts of the data frame. This constantly comes in handy, as for any particular data task you'll likely be working with just a subset of your rows or columns.
Let's say you want to access a particular column by using the column number. Perhaps you want to understand the occupations in our dataset. Here's how that access works:
We not only see the values of each row in the second column printed but also the corresponding levels. See here for more on what levels are. The syntax is the same when selecting a row from a tibble, except the levels aren't included because columns with characters aren't automatically coded as factors and only factors have levels (don't get hung-up if you don't understand levels for now). Note that the tibble column prints a little nicer (and is a
character, but not in a jokey way).
Note that in R, when locating a cell,
[1,2] refers to the first row and second column, so that
[,2] grabs the entire second column.
To actually do something more interesting with this, and count the number of unique jobs, you use the same syntax inside a function:
Note that we keep the
, syntax since we want the entire column, and select the column by name in quotes.
This is the same for a tibble (this is the last time we'll make the comparison). Again, notice the levels are gone because the tidyverse defaults to characters instead of factors.
Since you've probably picked up on the pattern, I'll now focus solely on the tidyverse.
Computations usually happen on columns or parts of columns; when you're looking at an entire row or a few rows it's usually more for inspection and sanity-checking. Now let's say you ran into an unexpected result of a computation and wanted to examine an entire row in a dataset.
And here's the syntax to grab multiple rows. Note that you span inclusively from the first to last row of interest.
While we typically access columns by index (i.e., column number) or column name, when it comes to rows we typically access by index or by row content. While you can name rows in a data frame, this isn't the typical workflow. More often, you'll identify rows by what their cells contain.
Let's say we want to grab all rows for engineers. Here we'll actually bring in other parts of the tidyverse which will help you on your data journey. Note we'll call the
tidyverse package (which is the bee’s knees) and we'll "pipe" data from our tibble to the
filter function using
%>%. See much more here on pipes in Hadley Wickham's (free) R for Data Science.
1library('tidyverse') 2df_tidy %>% 3 filter(occupation == 'engineer')
These type of operations are at the foundation of data work in R. To make progress, grab a dataset of great interest and practice manipulating data frames until the commands above are automatic. Building correct habits here will make everything slightly easier going forward. Happy computing!