Author avatar

Nishant Kumar Singh

Merging DataFrames in R

Nishant Kumar Singh

  • Oct 27, 2020
  • 6 Min read
  • 380 Views
  • Oct 27, 2020
  • 6 Min read
  • 380 Views
Data
Data Analytics
Languages and Libraries
R

Introduction

Merging or joining data frames is the process of combining columns from two or more dataframes. It is a well-known operation in programming. In R we can perform join with two functions: merge() of the base package and join() of a dplyr package. Before getting into that, this guide will go through the types of joins.

Types of Joins

There are four primary types of joins:

Left Outer Join

Suppose you are joining two tables, A and B, where A is the left table and B is the right table. When you perform a left outer join on A and B, it will return all rows from A and rows that are matching in B. All columns from A and B are returned, but the rows that do not match in B will have NA values for B columns.

Right Outer Join

A right outer join works similarly to the left outer join. It will return all matching rows from the right table in the left table. All columns from both tables are returned, and the rows that do not match in the left table will have NA values.

Inner Join

An inner join will return all the matching rows from both tables. If there are multiple matches between both tables, all combinations will be returned.

Full Join

A full join will return all values of rows and columns from both tables whether they are matching or not.

Joining Dataframes with merge()

The merge() function belongs to the base package of R. You don't need to install any additional packages to use the merge() function. The arguments of the merge() function, along with the default values that are passed in those arguments, are given below.

1# Syntax for merge function
2merge(x, y, by = intersect(names(x), names(y)),
3      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all)
4	  
html
  • The first two arguments, x and y, are the name of the dataframes that need to be joined.
  • The next three arguments, by, by.x, and by.y, decide the column used for joining the dataframes. If the name of the column that is needed for joining is the same then you don't need to pass any names. If they are different then you have to pass the names in by.x and by.y.
  • The next three arguments decide the type of join performed by the merge(). The default values will perform an inner join. If all.x is set to TRUE then it will perform a left outer join. If all.y is set to TRUE then it will perform a right outer join. If both are set to TRUE then it will perform a full outer join.

The dataframes used in this example are band_members and band_instruments. The column details are shared below.

1# Loading dplyr function to use the datasets present in the package
2library(dplyr)
3data(band_members)
4data(band_instruments)
5
6# Columns in band_instruments
7colnames(band_instruments)
8[1] "name"  "plays"
9
10#Columns in band_members
11colnames(band_members)
12[1] "name" "band"
13
14# Lets look at the data
15view(band_instruments)
16  name  plays 
17  <chr> <chr> 
181 John  guitar
192 Paul  bass  
203 Keith guitar
21
22view(band_members)
23  name  band   
24  <chr> <chr>  
251 Mick  Stones 
262 John  Beatles
273 Paul  Beatles
html

The code in the next example will perform all four types of joins using the dataframes above and the merge() function.

1# Performing Left outer join
2merge(band_members, band_instruments,  all.x = TRUE)
3  name    band  plays
41 John Beatles guitar
52 Mick  Stones   <NA>
63 Paul Beatles   bass
7
8# Performing Right outer join
9merge(band_members, band_instruments,  all.y = TRUE)
10   name    band  plays
111  John Beatles guitar
122 Keith    <NA> guitar
133  Paul Beatles   bass
14
15# Performing Inner join
16merge(band_members, band_instruments,  all.y = TRUE, all.x = TRUE)
17   name    band  plays
181  John Beatles guitar
192 Keith    <NA> guitar
203  Mick  Stones   <NA>
214  Paul Beatles   bass
22
23# Performing Full outer join
24merge(band_members, band_instruments)
25  name    band  plays
261 John Beatles guitar
272 Paul Beatles   bass
html

In the output of joins you can see that if the matching values are not there they are assigned as <NA>. In the case of an inner join, it is only showing the matching values from both dataframes.

Joining Dataframes with dplyr:: join Function

In comparison to the merge() function, dplyr has four different functions for different types of joins. It avoids confusion because you don't have to set values of the arguments. The join functions are given below:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()

This example will perform all four types of joins using the above functions.

1# Performing Inner join
2inner_join(band_members, band_instruments, by = "name")
3
4  name  band    plays 
5  <chr> <chr>   <chr> 
61 John  Beatles guitar
72 Paul  Beatles bass
8
9# Performing Left outer join  
10left_join(band_members, band_instruments, by = "name")
11
12  name  band    plays 
13  <chr> <chr>   <chr> 
141 Mick  Stones  NA    
152 John  Beatles guitar
163 Paul  Beatles bass 
17
18# Performing Right outer join 
19 right_join(band_members, band_instruments, by = "name")
20
21  name  band    plays 
22  <chr> <chr>   <chr> 
231 John  Beatles guitar
242 Paul  Beatles bass  
253 Keith NA      guitar
26
27# Performing Full outer join
28full_join(band_members, band_instruments, by = "name")
29
30  name  band    plays 
31  <chr> <chr>   <chr> 
321 Mick  Stones  NA    
332 John  Beatles guitar
343 Paul  Beatles bass  
354 Keith NA      guitar
html

Conclusion

When we start working with data stored in different tables or sources then we will start exploring the relationship between them. In this process, we join datasets to get a clear view. This operation happens in every project that works around data.