Author avatar

Nishant Kumar Singh

Data Manipulation with Dplyr

Nishant Kumar Singh

  • Oct 23, 2020
  • 8 Min read
  • 51 Views
  • Oct 23, 2020
  • 8 Min read
  • 51 Views
Data
Dplyr
Data Query Languages
Data Development

Introduction

When working with data, it's important to know what you want to do with the substantial amount of information you have. To figure out the facts from the data, some level of manipulation is necessary, as it is rare to get the data in exactly the right form.

For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.

1
2
3
4
5
# The easiest way to install dplyr is to install the whole tidyverse
install.packages("tidyverse")

# Alternate way to install only dplyr
install.packages("dplyr")
html

Overview of dplyr

dplyr provides a consistent set of functions to solve data manipulation problems. Some of these include:

  1. filter(): to select records based on their values
  2. arrange(): to reorder
  3. select(): to select variables from the dataset
  4. mutate(): to add new variables
  5. summarize(): to condense multiple values into one
  6. group_by(): to break down the dataset into specified groups of rows

Piping Operator

Every data manipulation activity would not simple that will use one or two functions. To make the code easy to read, dplyr uses %>% (the piping operator) from magrittr, which turns x%>%f(x) into f(x, y). We'll be using this operator to help explain dplyr.

Manipulating Data

To explore the functions of dplyr, we need a dataset. We will use the flights dataset from the nycflights13 package, which contains several useful datasets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Installing nycflights13 package
install.packages("nycflights13")
library(nycflights13)
data(flights)
# looking into sample data
head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH  
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH  
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA  
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN  
5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL  
6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD  
# ... with 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
html

Filtering rows

To get a subset of rows from the main dataset, use the filter() function. This function takes dataframe as the first argument, and subsequent arguments are the expression that filters the dataframe.

1
2
3
4
5
6
7
# loading dplyr library
library(dplyr)

# filtering flights dataframe on year 
filter(flights, year ==2013)
# adding more expression for year and month
filter(flights, year = 2013, month = 1)
html

The filter expressions are applied as an AND operation.

Arranging Rows

The arrange() function works similarly to filter() except it arranges the rows of the dataframe. The first argument of the function is the dataframe name and the subsequent arguments are the column names.

1
2
3
4
5
6
7
8
# arranging the flights dataframe on year
arrange(flights, year)

# arranging dataframe on multiple columns 
arrange(flights, year, month, day)

# The default order of arrange() function is ascending if we want, we can arrange in descending order
arrange(flights, desc(year))
html

When providing multiple columns, each column breaks the arrangement of the rows of the preceding one.

Selecting Columns

If you are interested only in a few columns of the dataset, pull them using the select() function. The first argument ise the name of the dataframe and subsequent arguments are the names of the columns or expressions.

1
2
3
4
5
# selecting columns through their names
select(flights, year, month, day)

# selecting columns with expressions
select(flights, starts_with("arr"))
html

Adding New Columns

To add a new column that is a function of the existing columns in the dataframe, use the mutate() function. The first argument of the mutate function is the name of the dataframe and the subsequent arguments are the formula for the new columns.

1
2
# The flights dataset has distance and air_time, so we will add speed in the dataframe
mutate(flights, speed = distance/air_time)
html

This will create a new column, speed, in the flights dataframe.

Summarizing Values

Summarized data helps make decisions that are difficult to decide properly going through a huge amount of information. The summarize function helps in this scenario.

1
2
# lets get mean of the delay time in arrival from flights dataframe.
summarize(flights, delay = mean(arr_delay, na.rm = TRUE)
html

The summarize() function is used a lot with the group_by() function as it gives more detailed information when used along with group_by().

Grouped Operations

In grouped operations, the dataset breaks down into specified groups of rows. In dplyr, this is done with the group_by() function. The first argument of the function is the data frame name and subsequent arguments are those columns that take part in grouping the rows. We generally use the group_by() function along with some aggregate functions.

1
2
3
4
# Grouping dataset in years
group_by(flights, year)
# Using along with summarize() function
summarize(group_by(flight, year), delay = mean(arr_delay, na.rm = TRUE)
html

When used in the summarize() function, the result is the mean of arr_delay for each year.

Piping

So far, we have applyied the data manipulation function alone, but in this section, we will take on a task in which we have to use multiple functions and club them with the help of the piping(%>%) operator.

Say you need to calculate the mean delay in arrival and departure for every month of the year 2013 from the flights dataset. If you read the statement, it looks complicated. So you will break it in the code section.

1
2
3
4
5
6
7
8
# Using functions along with %>% operator

flight%>%
 filter(year = 2013)%>%                                        # First filter the required rows
 select(year, month, arr_delay, dep_delay)%>%                  # Second selecting the necessary columns
 group_by(year, month)%>%                                      # Third grouping the rows
 summarise( arrival = mean(arr_delay, na.rm = TRUE),
            departure = mean(dep_delay, na.rm = TRUE))         # Fourth now calculating the columns
html

The code above uses the %>% operator to give you a better understanding of the code.

Conclusion

Whenever we are working with data, we need to do some manipulations to get the most valuable information. In real-life scenarios, datasets are more complicated and contain a lot of errors, so we have to write code that can manipulates data efficiently and tackles the errors.

Also, the volume of data in real-life cases is much higher. We can handle this if we understand the task and break it into small functions.

You can get more information about dplyr functions here.

0