Author avatar

Nishant Kumar Singh

Data Manipulation with Dplyr

Nishant Kumar Singh

  • Oct 23, 2020
  • 8 Min read
  • Oct 23, 2020
  • 8 Min read
Data Query Languages
Data Development


When working with data, it's important to know what you want to do with the substantial amount of information you have. To figure out the facts from the data, some level of manipulation is necessary, as it is rare to get the data in exactly the right form.

For performing manipulations in R, the dplyr package comes to the rescue. It provides a set of functions for data manipulation activities.

1# The easiest way to install dplyr is to install the whole tidyverse
4# Alternate way to install only dplyr

Overview of dplyr

dplyr provides a consistent set of functions to solve data manipulation problems. Some of these include:

  1. filter(): to select records based on their values
  2. arrange(): to reorder
  3. select(): to select variables from the dataset
  4. mutate(): to add new variables
  5. summarize(): to condense multiple values into one
  6. group_by(): to break down the dataset into specified groups of rows

Piping Operator

Every data manipulation activity would not simple that will use one or two functions. To make the code easy to read, dplyr uses %>% (the piping operator) from magrittr, which turns x%>%f(x) into f(x, y). We'll be using this operator to help explain dplyr.

Manipulating Data

To explore the functions of dplyr, we need a dataset. We will use the flights dataset from the nycflights13 package, which contains several useful datasets.

1# Installing nycflights13 package
5# looking into sample data
7# A tibble: 6 x 19
8   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
9  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
101  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH  
112  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH  
123  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA  
134  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN  
145  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL  
156  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD  
16# ... with 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Filtering rows

To get a subset of rows from the main dataset, use the filter() function. This function takes dataframe as the first argument, and subsequent arguments are the expression that filters the dataframe.

1# loading dplyr library
4# filtering flights dataframe on year 
5filter(flights, year ==2013)
6# adding more expression for year and month
7filter(flights, year = 2013, month = 1)

The filter expressions are applied as an AND operation.

Arranging Rows

The arrange() function works similarly to filter() except it arranges the rows of the dataframe. The first argument of the function is the dataframe name and the subsequent arguments are the column names.

1# arranging the flights dataframe on year
2arrange(flights, year)
4# arranging dataframe on multiple columns 
5arrange(flights, year, month, day)
7# The default order of arrange() function is ascending if we want, we can arrange in descending order
8arrange(flights, desc(year))

When providing multiple columns, each column breaks the arrangement of the rows of the preceding one.

Selecting Columns

If you are interested only in a few columns of the dataset, pull them using the select() function. The first argument ise the name of the dataframe and subsequent arguments are the names of the columns or expressions.

1# selecting columns through their names
2select(flights, year, month, day)
4# selecting columns with expressions
5select(flights, starts_with("arr"))

Adding New Columns

To add a new column that is a function of the existing columns in the dataframe, use the mutate() function. The first argument of the mutate function is the name of the dataframe and the subsequent arguments are the formula for the new columns.

1# The flights dataset has distance and air_time, so we will add speed in the dataframe
2mutate(flights, speed = distance/air_time)

This will create a new column, speed, in the flights dataframe.

Summarizing Values

Summarized data helps make decisions that are difficult to decide properly going through a huge amount of information. The summarize function helps in this scenario.

1# lets get mean of the delay time in arrival from flights dataframe.
2summarize(flights, delay = mean(arr_delay, na.rm = TRUE)

The summarize() function is used a lot with the group_by() function as it gives more detailed information when used along with group_by().

Grouped Operations

In grouped operations, the dataset breaks down into specified groups of rows. In dplyr, this is done with the group_by() function. The first argument of the function is the data frame name and subsequent arguments are those columns that take part in grouping the rows. We generally use the group_by() function along with some aggregate functions.

1# Grouping dataset in years
2group_by(flights, year)
3# Using along with summarize() function
4summarize(group_by(flight, year), delay = mean(arr_delay, na.rm = TRUE)

When used in the summarize() function, the result is the mean of arr_delay for each year.


So far, we have applyied the data manipulation function alone, but in this section, we will take on a task in which we have to use multiple functions and club them with the help of the piping(%>%) operator.

Say you need to calculate the mean delay in arrival and departure for every month of the year 2013 from the flights dataset. If you read the statement, it looks complicated. So you will break it in the code section.

1# Using functions along with %>% operator
4 filter(year = 2013)%>%                                        # First filter the required rows
5 select(year, month, arr_delay, dep_delay)%>%                  # Second selecting the necessary columns
6 group_by(year, month)%>%                                      # Third grouping the rows
7 summarise( arrival = mean(arr_delay, na.rm = TRUE),
8            departure = mean(dep_delay, na.rm = TRUE))         # Fourth now calculating the columns

The code above uses the %>% operator to give you a better understanding of the code.


Whenever we are working with data, we need to do some manipulations to get the most valuable information. In real-life scenarios, datasets are more complicated and contain a lot of errors, so we have to write code that can manipulates data efficiently and tackles the errors.

Also, the volume of data in real-life cases is much higher. We can handle this if we understand the task and break it into small functions.

You can get more information about dplyr functions here.