Author avatar

Deepika Singh

Creating a data.table with R

Deepika Singh

  • Nov 13, 2019
  • 17 Min read
  • 19 Views
  • Nov 13, 2019
  • 17 Min read
  • 19 Views
Data
R

Introduction

The data.table package is an enhanced version of the data.frame, which is the defacto structure for working with R. Dataframes are extremely useful, providing the user an intuitive way to organize, view, and access data. Their limitation is that it becomes trickier to perform fast data manipulation for large datasets. Also, codes can become complex and inconsistent with dataframes. This is where data.table is a better alternative because it has consistent syntax, efficient memory, and parallelization.

In this guide, you will learn about the basics of data.table and how to apply it for data manipulation and aggregation tasks. Let’s start by understanding the data.

Data

In this guide, we will use a fictitious dataset of loan applications containing 600 observations and 10 variables:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No")

  2. Dependents: Number of dependents of the applicant

  3. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No")

  4. Income: Annual Income of the applicant (in USD)

  5. Loan_amount: Loan amount (in USD) for which the application was submitted

  6. Credit_score: Whether the applicant’s credit score is good ("Satisfactory") or not ("Not Satisfactory")

  7. Approval_status: Whether the loan application was approved ("1") or not ("0")

  8. Age: The applicant's age in years

  9. Sex: Whether the applicant is male ("M") or female ("F")

  10. Purpose: Purpose of applying for the loan

Let's start by loading the required libraries and the data.

Fast Data Reading with fread

The common way of loading data in R is through the read_csv() function of the readr library. This is effective for smaller sized datasets, but not efficient when dealing with big data. The data.table package provides a faster alternative for reading data with the fread() function, which is a fast and parallel file reader that can read local files, files from the web, and even string files.

The lines of code below load the required libraries, read the data using the fread function, and print the view of the data.

1
2
3
4
5
library(data.table)
library(dplyr)

dat <- fread("data_eng.csv")
glimpse(dat)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Dependents      <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100...
$ Loan_amount     <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age             <int> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54, 54...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",... 

The output shows that the dataset has five numerical (labeled as 'int) and five qualitative (labeled as chr) variables. To work with the data.table library, it's necessary to convert the data.frame into a data.table, using the line of code below. The structure of the resulting data shows that it is both a data.table and a data.frame.

1
2
3
dat = as.data.table(dat)
str(dat)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Classes ‘data.table’ and 'data.frame':	600 obs. of  10 variables:
 $ Marital_status : chr  "Yes" "No" "Yes" "Yes" ...
 $ Dependents     : int  1 0 0 1 2 0 1 0 0 0 ...
 $ Is_graduate    : chr  "No" "Yes" "Yes" "Yes" ...
 $ Income         : int  298500 315500 295100 319300 333300 277700 332100 320000 300400 311100 ...
 $ Loan_amount    : int  71000 75500 70000 70000 98000 71000 58000 64000 61000 75500 ...
 $ Credit_score   : chr  "Satisfactory" "Satisfactory" "Satisfactory" "Satisfactory" ...
 $ approval_status: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Age            : int  74 71 71 68 64 64 63 61 60 59 ...
 $ Sex            : chr  "M" "M" "M" "M" ...
 $ Purpose        : chr  "Wedding" "Wedding" "Wedding" "Wedding" ...
 - attr(*, ".internal.selfref")=<externalptr>

We are now ready to carry out the data processing and aggregation tasks common in data science.

Selecting Rows and Columns

A common data manipulation task is data slicing based on specific rows and columns.

Selecting Rows

To extract a single row of the data, we can use the syntax dataset[rownumber, ]. For example, the line of code below prints the third row.

1
2
d1 = dat[3,] 
str(d1)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Classes ‘data.table’ and 'data.frame':	1 obs. of  10 variables:
 $ Marital_status : chr "Yes"
 $ Dependents     : int 0
 $ Is_graduate    : chr "Yes"
 $ Income         : int 295100
 $ Loan_amount    : int 70000
 $ Credit_score   : chr "Satisfactory"
 $ approval_status: int 1
 $ Age            : int 71
 $ Sex            : chr "M"
 $ Purpose        : chr "Wedding"
 - attr(*, ".internal.selfref")=<externalptr>

The output shows the resulting data has one row and ten observations. It is also possible to extract a range of rows. The code below extracts the third through fifth rows and all the columns, then stores them in the new object d'. The second line prints the structure of the new data: 3 observations of 10 variables.

1
2
d1 = dat[3:5,]
str(d1)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Classes ‘data.table’ and 'data.frame':	3 obs. of  10 variables:
 $ Marital_status : chr  "Yes" "Yes" "Yes"
 $ Dependents     : int  0 1 2
 $ Is_graduate    : chr  "Yes" "Yes" "Yes"
 $ Income         : int  295100 319300 333300
 $ Loan_amount    : int  70000 70000 98000
 $ Credit_score   : chr  "Satisfactory" "Satisfactory" "Satisfactory"
 $ approval_status: int  1 1 1
 $ Age            : int  71 68 64
 $ Sex            : chr  "M" "M" "M"
 $ Purpose        : chr  "Wedding" "Wedding" "Wedding"
 - attr(*, ".internal.selfref")=<externalptr>

It is also possible to perform advanced filtering of rows. If we want to examine records where the applicant's credit record was not satisfactory but the loan was still approved, we can do that using the first line of code below.

The second line prints the dimension of the resulting data: 38 rows and 10 variables. The third line prints the table of the two variables. The output shows that there were 38 applicants whose credit score was not satisfactory, but their loan application was approved.

1
2
3
4
5
d1 = dat[approval_status == 1 & Credit_score != "Satisfactory"] 

dim(d1)

table(d1$approval_status, d1$Credit_score)  
{r}

Output:

1
2
         Not _satisfactory
  1                38  

Helper Functions for Data Sub-setting

The advantage of using data.table is that it provides a lot of helper functions for efficient data manipulation.

  1. %like%: This function allows us to search for a pattern in a character or a factor variable. This is illustrated in the first to third lines of code below, which subset all the rows where the variable Purpose starts with the string Personal.
  1. %between%: This function allows us to search for values within the closed interval. This is illustrated in the fourth to sixth lines of code below, which subset all the rows where the applicant's income is between $384,975 (first quartile) and $766,100 (third quartile).

  2. %chin%: This function is only for character vectors. This is illustrated in the seventh to tenth lines of code below, which subset all the rows where the purpose for loan application is "Furniture," "Business," or "Wedding."

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Lines 1 - 3: helper - %like% 
d1 = dat[Purpose %like% "^Personal"] 
dim(d1) 
table(d1$Purpose)

# Lines 4 - 6: helper - %between% 
d1 = dat[Income %between% c(384975, 766100)] 
dim(d1) 
summary(d1$Income)
 
#  Lines 7 - 10: helper - %chin% 
table(dat$Purpose)
d1 = dat[Purpose %chin% c("Furniture", "Business", "Wedding")]
dim(d1) 
table(d1$Purpose) 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Output - Lines 1 – 3
 
[1] 166  10
 
Personal 
     166 
 
 
# Output - Lines 4 – 6
 
[1] 301  10
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 385200  444400  508400  529978  611100  766100 
 
 
# Output - Lines 7 – 10
 
 Business Education Furniture  Personal    Travel   Wedding 
       43       191        38       166       123        39 
 
[1] 120  10
 
 Business Furniture   Wedding 
       43        38        39 

Selecting Columns

It’s easy to select columns in data.table using the respective names. For example, the line of code below will extract the entire vector of values for the Purpose variable. The summary function confirms that the resulting vector, d1, has a length of 600 values.

1
2
d1 = dat[, Purpose]
str(d1)
{r}

Output:

1
chr [1:600] "Wedding" "Wedding" "Wedding" "Wedding" "Wedding" "Wedding" ..  

It is also possible to select multiple columns using the list() function. The first line of code below selects all the rows and two specified columns, while the second line prints the structure of the resulting data.

1
2
3
d1 = dat[, list(approval_status, Income)]

str(d1)
{r}

Output:

1
2
3
4
5
 
Classes ‘data.table’ and 'data.frame':	600 obs. of  2 variables:
 $ approval_status: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Income         : int  298500 315500 295100 319300 333300 277700 332100 320000 300400 311100 ...
 - attr(*, ".internal.selfref")=<externalptr> 

We have learned how to select columns, but sometimes we might want to deselect certain columns. This can be done using the convenience function that takes the form -c("col1", "col2", ...). As an example, the lines of code below create a subset of data that excludes the variables Sex and Dependents. The output of the code confirms this exclusion.

1
2
3
d2 <- dat[, -c("Sex", "Dependents")] 
dim(d2)
names(d2)
{r}

Output:

1
2
3
4
5
[1] 600   8


[1] "Marital_status"  "Is_graduate"     "Income"          "Loan_amount"    
[5] "Credit_score"    "approval_status" "Age"             "Purpose"    

Filtering Data

It’s easy to filter the data based on row and column conditions. The line of code below creates a data table where age is more than 60 years.

1
2
d1 = dat[Age > 60, ]
glimpse(d1)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Observations: 159
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Dependents      <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 2, 3, 0, 2, 0, 2, 2, 0, ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100...
$ Loan_amount     <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age             <int> 74, 71, 71, 68, 64, 64, 63, 61, 75, 72, 75, 75, 75, 75...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "F", "F", "M",...
$ Purpose         <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",...

We can also subset data on various combinations of conditions. The first subset, s1, contains data for the maximum age of 76 years. The output shows there are only two such records.

The second subset, s2, contains data with records where the applicant is greater than fifty years and the purpose for loan application is “Education”. The resulting data has 97 observations of 10 variables.

The third subset, s3, contains data with records where the purpose of the loan application was either “Education” or "Personal." The resulting data has 357 observations of 10 variables.

1
2
3
4
5
6
7
8
9
10
11
12
#Subset 1 
s1 = dat[Age == 76, ]
print(dim(s1))

#Subset 2
s2 = dat[Purpose == "Education" & Age >= 50, ]
print(dim(s2))


# #Subset 3 - Combine conditions with or (|). 
s3 = dat[Purpose == "Education" | Purpose == "Personal", ]
print(dim(s3))
{r}

Output:

1
2
3
4
5
6
7
8
9
 
#Output – subset 1
[1]  2 10

#Output – subset 2
[1] 97 10

#Output – subset 3
[1] 357  10

Summarizing Data

We have so far learned how to perform data processing tasks such as filtering and subsetting. This section covers more sophisticated data aggregation techniques, such as performing summary operations within groups. This is a crucial task in descriptive and diagnostic analytics.

We start with the simple task of finding the mean of a single column, Age. The output confirms that the average age of the applicants in the data is 49 years.

1
2
n1 = dat[, mean(Age)]
n1
{r}

Output:

1
2
[1] 49.31167
 

The above operation on a single column can be extended to combine rows and columns. Using the data.table syntax format of [i, j, by], we can calculate the average income by approval status of the application. In this case, i represents all rows, j represents the computation on the column (mean of income), and by represents the grouping operation (approval status in this case).

1
dat[, mean(Income), by="approval_status"]
{r}

Output:

1
2
3
#    approval_status       V1
# 1:               1             709832.2
# 2:               0             698685.3

The output above created a variable V1 that contains the average income value for approved and rejected loan applications. It is possible to give this variable a unique name by using the list function, as in the line of code below.

1
dat[, list(Avg_income = mean(Income)), by="approval_status"]
{r}

Output:

1
2
3
#    approval_status Avg_income
# 1:               1   	709832.2
# 2:               0   	698685.3

The output confirms that the column name has been changed to Avg_income, which is more intuitive. The list() function can be used to create and name multiple variables as shown in the code below.

1
dat[, list(Avg_income = mean(Income), Max_income = max(Income), Min_income = min(Income)), by="Purpose"]
{r}

Output:

1
2
3
4
5
6
7
#      Purpose Avg_income Max_income Min_income
# 1:   Wedding   313294.9     333300     274700
# 2:    Travel   596590.2     776100     466300
# 3:  Personal   884995.2    5219600     133300
# 4: Furniture   961673.7    8444900     133300
# 5: Education   727725.7    5029200     133300
# 6:  Business   365904.7     397300     333300

Finally, we can perform the aggregation on more than one column. In the example below, we want to compute the average income and loan amount grouped by two columns, Purpose and approval_status. The grouping at the column level is done by changing the by argument, of the [i,j,by] syntax, as is done in the code below.

1
2
3
d1 = dat[, list(Income=mean(Income), Loan_amount=mean(Loan_amount), Age = mean(Age)), by=c("Purpose", "approval_status")]

d1
{r}

Output:

1
2
3
4
5
6
7
8
9
#      Purpose approval_status   Income 		Loan_amount      Age
# 1:   Wedding               1 		313294.9    	75884.62 	50.43590
# 2:    Travel               1 		596590.2  	 146926.83 	49.11382
# 3:  Personal               0 		417825.0   	213618.75 	49.28125
# 4:  Personal               1 		996558.2   	508768.66 	49.61940
# 5: Furniture               1 		961673.7   	736642.11 	49.76316
# 6: Education               0 		755568.4   	352221.52 	49.08861
# 7: Education               1 		594418.2   	361351.52 	47.66667
# 8:  Business               1		 365904.7    	62093.02 	49.60465

Conclusion

In this guide, you have learned how to use the powerful data.table package for data manipulation and aggregation. Understanding of these techniques will enable you to perform faster descriptive and diagnostic analytics on the data. To learn more about data science using R, please refer to the following guides:

0