Author avatar

Deepika Singh

Splitting and Combining Data with R

Deepika Singh

  • Nov 6, 2019
  • 8 Min read
  • 14 Views
  • Nov 6, 2019
  • 8 Min read
  • 14 Views
Data
R

Introduction

In real-world data science projects, it is often necessary to divide data into two or more subsets or to combine multiple sets into one. This is an integral part of the data wrangling process for predictive modeling.

In this guide, you will learn techniques for splitting and combining data using the statistical programming language R.

Combining Data: Columns

Combining, or joining, data is a common data preparation task. The merge function can be used to perform all four standard join functions:

  1. Inner join

  2. Full join

  3. Left join

  4. Right join

Let’s start by creating sample datasets for carrying out these operations.

1
2
3
4
5
6
7
8
9
 
per_data <- data.frame(cust_id   = c("ID1", "ID2", "ID3"), married = c("Yes", "No", "Yes"), Age = c(44, 23, 35))
 
inc_data <- data.frame(cust_id   = c("ID2", "ID1", "ID3"), income = c(60000, 80000, 45000))
 
head(per_data)
 
head(inc_data)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
 
cust_id   married   Age
ID1 	     Yes        44
ID2  	    No          23
ID3 	     Yes        35 
 
 
cust_id income
ID2       60000
ID1  	 80000
ID3  	 45000
 

Inner Join

The two datasets can be combined horizontally using the merge function. In our case, we will inner join the two datasets using the common key variable 'UID'.

The first line of code below merges the two data frames, while the second line displays the resultant dataset, 'merge1'.

1
2
3
4
 
merge1 = merge(per_data,inc_data,by="cust_id")
head(merge1)
 
{r}

Output:

1
2
3
4
5
6
7
 
|           	| cust_id  | married  | Age   	| income  |
|---       	|---------   |---------   |-----    	|--------	|
| 1        	| ID1 	   | Yes 	   | 44      	| 80000    |
| 2        	| ID2 	   | No  	   | 23      	| 60000    |
| 3        	| ID3 	   | Yes 	   | 35      	| 45000    |
 

Full Join

The 'full join' results in keeping all the observations in either of the datasets. The first line of code below creates a data frame 'df3', while the second line performs the full join and saves the object as data frame, 'df4'.

The output will create a data frame where all the observations are maintained in the resultant data. The missing cells will be marked as 'NA'.

1
2
3
4
5
6
 
df3 =  data.frame(cust_id   = c("ID4", "ID1", "ID5"), approval = c("Yes", "No", "Yes"))
 
df4 = merge(x = merge1, y = df3,  by = "cust_id", all = TRUE)
head(df4)
 
{r}

Output:

1
2
3
4
5
6
7
8
|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID2 	  	| No  	  	| 23    | 60000     	| NA   	 	|
| 3 	| ID3 	  	| Yes       	| 35    | 45000     	| NA   	 	|
| 4 	| ID4 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
| 5 	| ID5 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
 

Left Join

The 'left join' function adds information from the right data frame to the left data frame. Let’s perform the left join on the data frames 'merge1' and 'df3', as shown below.

1
2
3
4
5
 
ljoin = merge(x = merge1, y = df3, by = "cust_id", all.x = TRUE)
 
head(ljoin)
 
{r}

Output:

1
2
3
4
5
6
7
|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID2 	  	| No  	  	| 23    | 60000     	| NA   	 	|
| 3 	| ID3 	  	| Yes 	  	| 35    | 45000     	| NA   	 	|
 
 

The output shows that all the records in the left dataset, 'merge1', are maintained while the right dataset, 'df3', is mapped to it.

Right Join

The 'right join' function adds information from the left data frame to the right data frame. Let’s perform the right join on the data frames 'merge1' and 'df3', as shown below.

1
2
3
4
 
rjoin = merge(x = merge1, y = df3, by = "cust_id", all.y = TRUE)
head(rjoin)
 
{r}

Output:

1
2
3
4
5
6
7
|   	| cust_id   	| married   	| Age   | income    	| approval  	|
|---	|---------  	|---------  	|-----  |--------   	|---------- 	|
| 1 	| ID1 	  	| Yes 	  	| 44    | 80000     	| No   	 	|
| 2 	| ID4 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
| 3 	| ID5 	  	| NA  	  	| NA    | NA 	   	| Yes  	 	|
 
 

Combining Data: Rows

It’s possible to combine the two data frames based on their rows using the rbind function. Let’s create two data frames and join them vertically using the lines of code below.

1
2
3
4
5
6
7
8
 
r1 <- data.frame(cust_id   = c("ID11", "ID12", "ID13"), married = c("Yes", "No", "Yes"), Age = c(44, 23, 35))
 
r2 <- data.frame(cust_id   = c("ID14", "ID15"), married = c("Yes", "Yes"), Age = c(40, 35))
 
r_tot <- rbind(r1, r2)
glimpse(r_tot)
 
{r}

Output:

1
2
3
4
5
6
7
 
Observations: 5
Variables: 3
$ cust_id <fct> ID11, ID12, ID13, ID14, ID15
$ married <fct> Yes, No, Yes, Yes, Yes
$ Age 	<dbl> 44, 23, 35, 40, 35
 

The resultant data now has five observations and three variables. To carry out this operation, the two data frames must have the same variables, but they do not need to be in the same order.

Splitting Data: Rows

One of the most common data partitioning steps in machine learning is dividing the data into training and test sets for evaluating model performance. This is called the holdout approach to model validation. There are many techniques to perform this task depending upon the target variable.

We will use fictitious loan application data containing 600 observations and 4 variables—'UID', 'Income', 'Credit_score', and 'approval_status'. The lines of code below load the libraries and the data.

1
2
3
4
5
6
7
8
9
 
library(readr)
library(dplyr)
library(caret)
library(caTools)
 
d1 <- read_csv("data_1.csv")
glimpse(d1)
 
{r}

Output:

1
2
3
4
5
6
7
Observations: 600
Variables: 4
$ UID         	<chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Income 	     <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Credit_score	<chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
 

Since the target variable, 'approval_status', is categorical, we will convert it into a factor variable.

1
2
3
4
 
d1$approval_status = as.factor(d1$approval_status)
str(d1$approval_status)
 
{r}

Output:

1
2
Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 

We can now divide the dataset into training and test datasets using the 'caTools' package.

The first line of code below loads the 'caTools' library, while the second line sets the random seed for reproducibility of the results. The third line uses the sample.split function to divide the data in the ratio of 70 to 30. This ensures that 70 percent of the data is allocated to the training set, while the remaining 30 percent gets allocated to the test set. This is done in the fourth and fifth lines of code.

1
2
3
4
5
6
7
8
9
10
11
 
library(caTools)
set.seed(150)
split = sample.split(d1$approval_status, SplitRatio = 0.70)
 
# Create training and testing sets
train = subset(d1, split == TRUE)
test = subset(d1, split == FALSE)
 
dim(train); dim(test)
 
{r}

Output:

1
2
3
[1] 420   4
[1] 180   4
 

The output confirms that the data has been divided into training data and test data, containing 420 and 180 observations, respectively.

Conclusion

In this guide, you have learned techniques for combining and splitting data frames using the base R packages. These techniques allow you to join and divide data in real-world datasets.

To learn more about data science using R, please refer to the following guides:

0