Author avatar

Deepika Singh

Manipulating Dataframes in R

Deepika Singh

  • Nov 7, 2019
  • 12 Min read
  • 13,509 Views
  • Nov 7, 2019
  • 12 Min read
  • 13,509 Views
Data
R

Introduction

It is said that eighty percent of the time in a data science project is spent on data preparation and data cleaning. In this guide, you will learn about the tricks and techniques of manipulating dataframes in R using the popular package dplyr.

The 'dplyr' library offers several powerful functions to manipulate the dataframe, which is a two-dimensional data structure containing rows and columns.

In particular, you will learn the following data manipulation techniques:

  1. Filter
  2. Select
  3. Mutate
  4. Arrange
  5. Summarize
  6. Group_by
  7. Count 8: Rename

Let’s begin by loading the data.

Data

In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  3. Income: Annual Income of the applicant (in USD)

  4. Loan_amount: Loan amount (in USD) for which the application was submitted

  5. Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")

  6. Approval_status: Whether the loan application was approved ("Yes") or not ("No")

  7. Age: The applicant's age in years

  8. Sex: Whether the applicant is a male ("M") or a female ("F")

  9. Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant

  10. Purpose: Purpose of applying for the loan

Let's load the required libraries and the data.

1 
2library(plyr)
3library(readr)
4library(ggplot2)
5library(GGally)
6library(dplyr)
7library(mlbench)
8 
9dat <- read_csv("data_r2.csv")
10glimpse(dat)
11 
{r}

Output:

1Observations: 600
2Variables: 10
3$ Marital_status   <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4$ Is_graduate 	 <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5$ Income      	    <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount 	<int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score	  <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9$ Age         	       <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex         	       <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11$ Investment  	   <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12$ Purpose     	    <chr> "Education", "Travel", "Others", "Others", "Travel", "...
13 

The output shows that the data has six categorical variables (labeled as 'chr'), and four numerical variables (labeled as 'int'). We will convert the categorical variables into the 'factor' type, using the lines of code below.

1 
2names <- c(1,2,5,6,8,10)
3dat[,names] <- lapply(dat[,names] , factor)
4glimpse(dat)
5 
{r}

Output:

1 
2Observations: 600
3Variables: 10
4$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
5$ Is_graduate 	<fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
6$ Income    	  <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
7$ Loan_amount 	<int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
8$ Credit_score	<fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
9$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
10$ Age         	<int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
11$ Sex         	<fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
12$ Investment  	<int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
13$ Purpose     	<fct> Education, Travel, Others, Others, Travel, Travel, Tra...
14 

The data is now ready for carrying out the various data manipulation steps.

Filter

The filter command selects rows based on the specified condition. We will start by filtering the data for applicants whose loan is approved. The first line of code below prints the table of the variable 'approval_status'. The output shows that there are 410 applicants whose loan was approved.

The second line creates a new dataframe, 'approved_loan', while the third line prints the dimension, which is 410 rows and 10 variables.

1 
2table(dat$approval_status)
3approved_loan = dat %>% filter(approval_status == "Yes")
4print(dim(approved_loan))
5 
{r}

Output:

1[1] 410  10
2 

One of the things to notice in the code above is the use of the pipe operator, written as %>%. This pipe operator enables us to chain multiple 'dplyr' commands together, such that it takes output from one command and feeds it as input to the next command.

The 'filter' command can also be used to include multiple conditions. The code below filters the data using different conditions on two variables, ‘approval_status’ and ‘credit_score’.

1 
2aproved_satis <- dat %>% filter(approval_status == "Yes", Credit_score == "Satisfactory")
3 
4dim(aproved_satis)
5 
{r}

Output:

1[1] 372  10
2 

The above output shows that the resultant data now has 372 records instead of the original 600 because of the conditions we used.

The filter command can also be used with numerical variables, as shown in the lines of code below. The output confirms that the operation is completed.

1 
2income_age <- dat %>% filter(Income > 600000, Age >= 30)
3 
4dim(income_age)
5 
6summary(income_age$Income)
7 
8summary(income_age$Age)
9 
{r}

Output:

1[1] 205  10
2 
3  Min. 1st Qu.  Median	Mean 3rd Qu.	Max.
4 606300  711100  843300 1100888 1274700 3173700
5  
6Min. 1st Qu.  Median    Mean 3rd Qu.	Max.
7 30.00   43.00   55.00   52.81   62.00   75.00
8 

Select

The select() command selects columns in a data. The lines of code below provides the entire data as input to the ‘select’ function using the '%>%' operator. Inside the select function, the columns to be selected are specified.

The output shows that the resultant data has 600 observations and 3 variables.

1 
2dat_3 = dat %>% select(Marital_status, Age, Sex)
3glimpse(dat_3)
4 
{r}

Output:

1Observations: 600
2Variables: 3
3$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No,...
4$ Age        	<int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33,...
5$ Sex        	<fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M...
6 

Mutate

The mutate() function helps in feature engineering by allowing us to create new variables. For example, we can use two variables, 'Investment' and 'Income', to create a new variable, 'Inv_inc_ratio'. The mutate verb adds the new column to the dataframe, as shown in the syntax below.

1 
2dat_ratio = dat %>% mutate(Inv_inc_ratio = Investment / Income * 100)
3 
4dim(dat_ratio)
5 
6summary(dat_ratio$Inv_inc_ratio)
7 
{r}

Output:

1Min.      1st Qu.  Median	Mean     3rd Qu.    Max.
2 20           20           70       55.69        70          70
3 

Arrange

The arrange() function arranges the rows based upon the specified condition. For example, if we want to arrange the 'Inv_inc_ratio' variable in an order, we can do that using the lines of code below.

1 
2dat_ratio_2 = dat_ratio %>% arrange(Inv_inc_ratio)
3head(dat_ratio_2$Inv_inc_ratio)
4 
{r}

Output:

1[1] 20 20 20 20 20 20
2 

By default, the 'arrange()' function orders the variable in ascending order. In order to sort and arrange in descending order, we add the function desc() to the code.

1 
2dat_ratio_2 = dat_ratio %>% arrange(desc(Inv_inc_ratio))
3head(dat_ratio_2$Inv_inc_ratio)
4 
{r}

Output:

1[1] 70 70 70 70 70 70
2 

Multiple columns can also be included in the 'arrange()' function. In such a case, each column is used to break ties in the values of the preceding columns.

1 
2dat_4 = dat_ratio %>% select(approval_status, Age, Inv_inc_ratio)
3dim(dat_4)
4 
5dat_ratio_4 = dat_4 %>% arrange(approval_status, Age, Inv_inc_ratio)
6head(dat_ratio_4)
7 
{r}

Output:

1[1] 600   3
2 
3A tibble: 6 x 3
4approval_status      	Age  	    Inv_inc_ratio
5   <fct>       	          	<int>     	<dbl>
6   No             	       	         22        	70
7   No             	       	         23        	50
8   No             	       	         23        	70
9   No             	       	         23        	70
10   No             	       	         24        	70
11   No            	        	 24            70
12 

Summarize

The summarize() function summarizes the variables in the dataframe. For example, if we want to get the mean value of variables like income, loan amount, or age, the line of code below will produce and display the desired output.

1 
2dat_ratio %>%
3  summarize(avg_income = mean(Income,na.rm=TRUE),
4        	avg_loan = mean(Loan_amount,na.rm=TRUE),
5        	avg_ratio = mean(Age,na.rm=TRUE))
{r}

Output:

1A tibble: 1 x 3
2  avg_income    avg_loan     avg_ratio
3   	<dbl>	     <dbl> 	        <dbl>
4	658615         1455120         48.7	
5 

Group_by

The group_by() function groups the data based on one or more columns and then manipulates the grouped dataframe. The 'group_by()' function is often used along with the other five 'dplyr' commands discussed in the previous sections.

We will take one case of using the 'group_by()' and 'summarize()' commands together. The line of code below groups the data by the variable 'Purpose', and then calculates the summary of average income for each 'Purpose' group.

1 
2data %>%
3        	group_by(Purpose) %>%
4        	summarise(mean_inc = mean(Income))
5 
{r}

Output:

1A tibble: 5 x 2
2 
3Purpose   mean_inc
4<chr>    	<dbl>
5Education  677996.
6Home   	 581834.
7Others 	  749483.
8Personal   682385.
9Travel 	  640854.
10 

Count

The count() function counts the number of observations in a variable or the dataset. The first line of code below prints the count of the entire dataset, 'dat'. The second line extends this functionality further and provides the count of respective categories within the 'Purpose' variable.

1 
2dat %>% count()
3 
4dat %>% count(Purpose)
5 
{r}

Output:

11   600
2 
3A tibble: 5 x 2
4Purpose   	n
5<fct> 	       <int>
6Education	99
7Home    	  134
8Others   	  69
9Personal	 178
10Travel  	   120
11 

We can also sort the above output in descending order, as shown in the line of code below.

1 
2dat %>% count(Purpose, sort = TRUE)
3 
{r}

Output:

1A tibble: 5 x 2
2Purpose   	n
3<fct> 	     <int>
4Personal	178
5Home    	134
6Travel  	  120
7Education   99
8Others   	 69
9 

Rename

The rename() function is used to rename one or more columns. If we want to change the name of the variable 'Inv_inc_ratio' and rename it 'investment_ratio', we can do that using the line of code below. The general syntax is rename(new name = old name).

1 
2newdat = dat_ratio %>%  rename(investment_ratio = Inv_inc_ratio)
3names(newdat)
4 
{r}

Output:

1[1] "Marital_status"   "Is_graduate"      "Income"           "Loan_amount" 	
2 [5] "Credit_score" 	"approval_status"  "Age"          	"Sex"     	    
3 [9] "Investment"   	"Purpose"      	"investment_ratio"
4 

Conclusion

In this guide, you have learned about data manipulation techniques using the popular 'dplyr' library. These techniques will help you handle complex, real-world data sets.

To learn more about data science using 'R', please refer to the following guides: