When working with data, it's important to know what you want to do with the substantial amount of information you have. To figure out the facts from the data, some level of manipulation is necessary, as it is rare to get the data in exactly the right form.
For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.
1 2 3 4 5
# The easiest way to install dplyr is to install the whole tidyverse install.packages("tidyverse") # Alternate way to install only dplyr install.packages("dplyr")
dplyr provides a consistent set of functions to solve data manipulation problems. Some of these include:
filter(): to select records based on their values
arrange(): to reorder
select(): to select variables from the dataset
mutate(): to add new variables
summarize(): to condense multiple values into one
group_by(): to break down the dataset into specified groups of rows
Every data manipulation activity would not simple that will use one or two functions. To make the code easy to read, dplyr uses
%>% (the piping operator) from magrittr, which turns
We'll be using this operator to help explain dplyr.
To explore the functions of dplyr, we need a dataset. We will use the flights dataset from the nycflights13 package, which contains several useful datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Installing nycflights13 package install.packages("nycflights13") library(nycflights13) data(flights) # looking into sample data head(flights) # A tibble: 6 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD # ... with 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
To get a subset of rows from the main dataset, use the
filter() function. This function takes
dataframe as the first argument, and subsequent arguments are the expression that filters the dataframe.
1 2 3 4 5 6 7
# loading dplyr library library(dplyr) # filtering flights dataframe on year filter(flights, year ==2013) # adding more expression for year and month filter(flights, year = 2013, month = 1)
The filter expressions are applied as an
arrange() function works similarly to
filter() except it arranges the rows of the dataframe. The first argument of the function is the dataframe name and the subsequent arguments are the column names.
1 2 3 4 5 6 7 8
# arranging the flights dataframe on year arrange(flights, year) # arranging dataframe on multiple columns arrange(flights, year, month, day) # The default order of arrange() function is ascending if we want, we can arrange in descending order arrange(flights, desc(year))
When providing multiple columns, each column breaks the arrangement of the rows of the preceding one.
If you are interested only in a few columns of the dataset, pull them using the
select() function. The first argument ise the name of the dataframe and subsequent arguments are the names of the columns or expressions.
1 2 3 4 5
# selecting columns through their names select(flights, year, month, day) # selecting columns with expressions select(flights, starts_with("arr"))
To add a new column that is a function of the existing columns in the dataframe, use the
mutate() function. The first argument of the mutate function is the name of the dataframe and the subsequent arguments are the formula for the new columns.
# The flights dataset has distance and air_time, so we will add speed in the dataframe mutate(flights, speed = distance/air_time)
This will create a new column,
speed, in the flights dataframe.
Summarized data helps make decisions that are difficult to decide properly going through a huge amount of information. The summarize function helps in this scenario.
# lets get mean of the delay time in arrival from flights dataframe. summarize(flights, delay = mean(arr_delay, na.rm = TRUE)
summarize() function is used a lot with the
group_by() function as it gives more detailed information when used along with
In grouped operations, the dataset breaks down into specified groups of rows. In dplyr, this is done with the
group_by() function. The first argument of the function is the data frame name and subsequent arguments are those columns that take part in grouping the rows.
We generally use the
group_by() function along with some aggregate functions.
1 2 3 4
# Grouping dataset in years group_by(flights, year) # Using along with summarize() function summarize(group_by(flight, year), delay = mean(arr_delay, na.rm = TRUE)
When used in the
summarize() function, the result is the mean of
arr_delay for each year.
So far, we have applyied the data manipulation function alone, but in this section, we will take on a task in which we have to use multiple functions and club them with the help of the
Say you need to calculate the mean delay in arrival and departure for every month of the year 2013 from the flights dataset. If you read the statement, it looks complicated. So you will break it in the code section.
1 2 3 4 5 6 7 8
# Using functions along with %>% operator flight%>% filter(year = 2013)%>% # First filter the required rows select(year, month, arr_delay, dep_delay)%>% # Second selecting the necessary columns group_by(year, month)%>% # Third grouping the rows summarise( arrival = mean(arr_delay, na.rm = TRUE), departure = mean(dep_delay, na.rm = TRUE)) # Fourth now calculating the columns
The code above uses the
%>% operator to give you a better understanding of the code.
Whenever we are working with data, we need to do some manipulations to get the most valuable information. In real-life scenarios, datasets are more complicated and contain a lot of errors, so we have to write code that can manipulates data efficiently and tackles the errors.
Also, the volume of data in real-life cases is much higher. We can handle this if we understand the task and break it into small functions.
You can get more information about dplyr functions here.