It is said that eighty percent of the time in a data science project is spent on data preparation and data cleaning. In this guide, you will learn about the tricks and techniques of manipulating dataframes in R using the popular package dplyr.
The 'dplyr' library offers several powerful functions to manipulate the dataframe, which is a two-dimensional data structure containing rows and columns.
In particular, you will learn the following data manipulation techniques:
Let’s begin by loading the data.
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10 variables, as described below:
Marital_status: Whether the applicant is married ("Yes") or not ("No")
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")
Income: Annual Income of the applicant (in USD)
Loan_amount: Loan amount (in USD) for which the application was submitted
Credit_score: Whether the applicants credit score is good ("Good") or not ("Bad")
Approval_status: Whether the loan application was approved ("Yes") or not ("No")
Age: The applicant's age in years
Sex: Whether the applicant is a male ("M") or a female ("F")
Investment: Total investment in stocks and mutual funds (in USD) as declared by the applicant
Purpose: Purpose of applying for the loan
Let's load the required libraries and the data.
1
2library(plyr)
3library(readr)
4library(ggplot2)
5library(GGally)
6library(dplyr)
7library(mlbench)
8
9dat <- read_csv("data_r2.csv")
10glimpse(dat)
11
Output:
1Observations: 600
2Variables: 10
3$ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
4$ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
5$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
6$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
7$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
8$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
9$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
10$ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
11$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
12$ Purpose <chr> "Education", "Travel", "Others", "Others", "Travel", "...
13
The output shows that the data has six categorical variables (labeled as 'chr'), and four numerical variables (labeled as 'int'). We will convert the categorical variables into the 'factor' type, using the lines of code below.
1
2names <- c(1,2,5,6,8,10)
3dat[,names] <- lapply(dat[,names] , factor)
4glimpse(dat)
5
Output:
1
2Observations: 600
3Variables: 10
4$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
5$ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
6$ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
7$ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
8$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
9$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
10$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
11$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
12$ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
13$ Purpose <fct> Education, Travel, Others, Others, Travel, Travel, Tra...
14
The data is now ready for carrying out the various data manipulation steps.
The filter command selects rows based on the specified condition. We will start by filtering the data for applicants whose loan is approved. The first line of code below prints the table of the variable 'approval_status'. The output shows that there are 410 applicants whose loan was approved.
The second line creates a new dataframe, 'approved_loan', while the third line prints the dimension, which is 410 rows and 10 variables.
1
2table(dat$approval_status)
3approved_loan = dat %>% filter(approval_status == "Yes")
4print(dim(approved_loan))
5
Output:
1[1] 410 10
2
One of the things to notice in the code above is the use of the pipe operator, written as %>%. This pipe operator enables us to chain multiple 'dplyr' commands together, such that it takes output from one command and feeds it as input to the next command.
The 'filter' command can also be used to include multiple conditions. The code below filters the data using different conditions on two variables, ‘approval_status’ and ‘credit_score’.
1
2aproved_satis <- dat %>% filter(approval_status == "Yes", Credit_score == "Satisfactory")
3
4dim(aproved_satis)
5
Output:
1[1] 372 10
2
The above output shows that the resultant data now has 372 records instead of the original 600 because of the conditions we used.
The filter command can also be used with numerical variables, as shown in the lines of code below. The output confirms that the operation is completed.
1
2income_age <- dat %>% filter(Income > 600000, Age >= 30)
3
4dim(income_age)
5
6summary(income_age$Income)
7
8summary(income_age$Age)
9
Output:
1[1] 205 10
2
3 Min. 1st Qu. Median Mean 3rd Qu. Max.
4 606300 711100 843300 1100888 1274700 3173700
5
6Min. 1st Qu. Median Mean 3rd Qu. Max.
7 30.00 43.00 55.00 52.81 62.00 75.00
8
The select() command selects columns in a data. The lines of code below provides the entire data as input to the ‘select’ function using the '%>%' operator. Inside the select function, the columns to be selected are specified.
The output shows that the resultant data has 600 observations and 3 variables.
1
2dat_3 = dat %>% select(Marital_status, Age, Sex)
3glimpse(dat_3)
4
Output:
1Observations: 600
2Variables: 3
3$ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No,...
4$ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33,...
5$ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, M...
6
The mutate() function helps in feature engineering by allowing us to create new variables. For example, we can use two variables, 'Investment' and 'Income', to create a new variable, 'Inv_inc_ratio'. The mutate verb adds the new column to the dataframe, as shown in the syntax below.
1
2dat_ratio = dat %>% mutate(Inv_inc_ratio = Investment / Income * 100)
3
4dim(dat_ratio)
5
6summary(dat_ratio$Inv_inc_ratio)
7
Output:
1Min. 1st Qu. Median Mean 3rd Qu. Max.
2 20 20 70 55.69 70 70
3
The arrange() function arranges the rows based upon the specified condition. For example, if we want to arrange the 'Inv_inc_ratio' variable in an order, we can do that using the lines of code below.
1
2dat_ratio_2 = dat_ratio %>% arrange(Inv_inc_ratio)
3head(dat_ratio_2$Inv_inc_ratio)
4
Output:
1[1] 20 20 20 20 20 20
2
By default, the 'arrange()' function orders the variable in ascending order. In order to sort and arrange in descending order, we add the function desc() to the code.
1
2dat_ratio_2 = dat_ratio %>% arrange(desc(Inv_inc_ratio))
3head(dat_ratio_2$Inv_inc_ratio)
4
Output:
1[1] 70 70 70 70 70 70
2
Multiple columns can also be included in the 'arrange()' function. In such a case, each column is used to break ties in the values of the preceding columns.
1
2dat_4 = dat_ratio %>% select(approval_status, Age, Inv_inc_ratio)
3dim(dat_4)
4
5dat_ratio_4 = dat_4 %>% arrange(approval_status, Age, Inv_inc_ratio)
6head(dat_ratio_4)
7
Output:
1[1] 600 3
2
3A tibble: 6 x 3
4approval_status Age Inv_inc_ratio
5 <fct> <int> <dbl>
6 No 22 70
7 No 23 50
8 No 23 70
9 No 23 70
10 No 24 70
11 No 24 70
12
The summarize() function summarizes the variables in the dataframe. For example, if we want to get the mean value of variables like income, loan amount, or age, the line of code below will produce and display the desired output.
1
2dat_ratio %>%
3 summarize(avg_income = mean(Income,na.rm=TRUE),
4 avg_loan = mean(Loan_amount,na.rm=TRUE),
5 avg_ratio = mean(Age,na.rm=TRUE))
Output:
1A tibble: 1 x 3
2 avg_income avg_loan avg_ratio
3 <dbl> <dbl> <dbl>
4 658615 1455120 48.7
5
The group_by() function groups the data based on one or more columns and then manipulates the grouped dataframe. The 'group_by()' function is often used along with the other five 'dplyr' commands discussed in the previous sections.
We will take one case of using the 'group_by()' and 'summarize()' commands together. The line of code below groups the data by the variable 'Purpose', and then calculates the summary of average income for each 'Purpose' group.
1
2data %>%
3 group_by(Purpose) %>%
4 summarise(mean_inc = mean(Income))
5
Output:
1A tibble: 5 x 2
2
3Purpose mean_inc
4<chr> <dbl>
5Education 677996.
6Home 581834.
7Others 749483.
8Personal 682385.
9Travel 640854.
10
The count() function counts the number of observations in a variable or the dataset. The first line of code below prints the count of the entire dataset, 'dat'. The second line extends this functionality further and provides the count of respective categories within the 'Purpose' variable.
1
2dat %>% count()
3
4dat %>% count(Purpose)
5
Output:
11 600
2
3A tibble: 5 x 2
4Purpose n
5<fct> <int>
6Education 99
7Home 134
8Others 69
9Personal 178
10Travel 120
11
We can also sort the above output in descending order, as shown in the line of code below.
1
2dat %>% count(Purpose, sort = TRUE)
3
Output:
1A tibble: 5 x 2
2Purpose n
3<fct> <int>
4Personal 178
5Home 134
6Travel 120
7Education 99
8Others 69
9
The rename() function is used to rename one or more columns. If we want to change the name of the variable 'Inv_inc_ratio' and rename it 'investment_ratio', we can do that using the line of code below. The general syntax is rename(new name = old name).
1
2newdat = dat_ratio %>% rename(investment_ratio = Inv_inc_ratio)
3names(newdat)
4
Output:
1[1] "Marital_status" "Is_graduate" "Income" "Loan_amount"
2 [5] "Credit_score" "approval_status" "Age" "Sex"
3 [9] "Investment" "Purpose" "investment_ratio"
4
In this guide, you have learned about data manipulation techniques using the popular 'dplyr' library. These techniques will help you handle complex, real-world data sets.
To learn more about data science using 'R', please refer to the following guides: