Querying and Converting Data Types in R

Jul 3, 2020 • 11 Minute Read

Introduction

Working with data is an obvious requirement from data science professionals. The building block of working with data is to understand the most common data types, and acquire the knowledge of processing, querying and converting them. In this guide, you will learn the techniques of querying and converting data types in R.

Data Types

There are several data types in R, and the most integral ones are listed below:

Characters: Text (or string) values are called characters. Assigning a text value to a variable, 't', will make it a character, as is shown below. You can confirm its type with the class() or typeof() function.

      t = "pluralsight"
class(t)
typeof(t)
    

Output:

      1] "character"

[1] "character"

Numerics: Decimal values like 3.5 are called numerics in R. It is the default computational data type.

      N = 3.5
class(N)
    

Output:

      1] "numeric"

The variable N is stored as a numeric value, and not an integer. This can be checked using the is.integer() function.

      is.integer(N)

Output:

      1] FALSE

Integers: If you want to create an integer variable, you can use the as.integer() function. Also, all integers are numeric, but the reverse is not true.

      i = as.integer(3.1)
print(i)
    

Output:

      1] 3

Logical: Logical values are often created by comparing two or more variables. These are denoted by boolean values, TRUE or FALSE.

      x = 100
y = 56
x < y
    

Output:

      1] FALSE

The most common data types are discussed above, but the most important data type is a data frame.

Data Frame

Data frame is the de-facto data type for most data science projects, as it's organized in tabular format. In simple terms, a data frame is a special type of list where all the elements are of equal length.

Data frames are normally created by read_csv() and read.table() functions when importing the data into R. You can also create a new data frame with the data.frame() function.

      df <- data.frame(rollnum = seq(1:10), h1 = 15:24, h2 = 81:90)
df
    

Output:

      rollnum h1 h2
      1 15 81
      2 16 82
      3 17 83
      4 18 84
      5 19 85
      6 20 86
      7 21 87
      8 22 88
      9 23 89
    10 24 90
    

The most common method of dealing with a data frame is by importing the flat files--csv or Excel--into the R environment. The code below performs this task and loads the data that will be used in the subsequent sections.

      library(readr)
dat <- read_csv("data.csv")
glimpse(dat)
    

Output:

      Observations: 585
Variables: 6
$ UID             <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
$ Income          <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
$ Age             <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
$ Purpose         <chr> "Business", "Personal", "Travel", "Personal", "Persona...
    

The output shows there are 585 observations of 6 variables, described below.

UID: Unique identifier tag of the loan applicant.
Income: Annual income of the applicant (in US dollars).
Credit_score: Whether the applicant's credit score was satisfactory or not.
approval_status: Whether the loan application was approved ("1") or not ("0").
Age: The applicant’s age in years.
Purpose: The reason for the loan application.

Inspecting and Converting Data Types

For data science and machine learning, it's important for the variables to be in the right data type. To begin, you will use the str() function that prints the structure of the data.

      str(dat)

Output:

      Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	585 obs. of  6 variables:
 $ UID            : chr  "UIDA467" "UIDA402" "UIDA354" "UIDA209" ...
 $ Income         : num  36850 45470 53240 198400 83410 ...
 $ Credit_score   : Factor w/ 2 levels "Not _satisfactory",..: 2 2 2 2 1 2 2 2 2 2 ...
 $ approval_status: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age            : int  -12 -10 -3 23 23 23 23 23 23 24 ...
 $ Purpose        : Factor w/ 6 levels "Business","Education",..: 1 4 5 4 4 4 4 4 5 4
    

From the output above, you can see that the data has six variables, three numerical and three categorical. You will start by understanding the levels of character variables.

      table(dat$Credit_score)

Output:

      Not _satisfactory      Satisfactory 
              124               461
    

The variable Credit_score has only two levels, so it can be converted to a factor variable with the as.factor() function.

      dat$Credit_score = as.factor(dat$Credit_score)
class(dat$Credit_score)
    

Output:

      1] "factor"

Next, inspect the number of levels for the variable Purpose.

      table(dat$Purpose)

Output:

      Business Education Furniture  Personal    Travel   Wedding 
       43       184        37       161       122        38
    

There are six levels in the variable Purpose which is converted to the factor data type with the code below.

      dat$Purpose = as.factor(dat$Purpose)
class(dat$Purpose)
    

Output:

      1] "factor"

The last conversion to make is for the variable approval_status. Start by examining the class of the variable.

      class(dat$approval_status)

table(dat$approval_status)

Output:

      1] "integer" 

 0   1 
186 399
    

The class of the variable approval_status is shown as integer, but it takes only two values, zero and one. In fact, this is a categorical variable and needs to be converted to factor.

      dat$approval_status = as.factor(dat$approval_status)
class(dat$approval_status)
    

Output:

      1] "factor"

The required conversions have been made, and this can be verified with the code below.

      glimpse(dat)

Output:

      Observations: 585
Variables: 7
$ UID             <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
$ Income          <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
$ Age             <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
$ Purpose         <fct> Business, Personal, Travel, Personal, Personal, Person...
    

You have inspected and converted the variables in the section above, and will learn how to query some of the numerical variables. The summary() function provides key statistics about the variables.

      summary(dat)

Output:

      UID                Income                  Credit_score approval_status
 Length:585         Min.   :  3000   Not _satisfactory:124   0:186          
 Class :character   1st Qu.: 38890   Satisfactory     :461   1:399          
 Mode  :character   Median : 51440                                          
                    Mean   : 71655                                          
                    3rd Qu.: 77570                                          
                    Max.   :844490                                          


      Age              Purpose   
 Min.   :-12.00   Business : 43  
 1st Qu.: 37.00   Education:184  
 Median : 51.00   Furniture: 37  
 Mean   : 49.39   Personal :161  
 3rd Qu.: 61.00   Travel   :122  
 Max.   : 76.00   Wedding  : 38
    

From the output above, you can see that the variable, Age, has negative values. This is incorrect data and needs further querying. There are various ways to do it, one of which is to find out how many such values are there.

      neg_age = dat[dat$Age<0,]
nrow(neg_age)
    

Output:

      1] 3

There are only three such records and deleting them won't make any difference. However, the other technique can be to create a new logical variable that will check the condition of age being negative.

The first line uses the ifelse() command to create a new variable AgeNegative, that returns a value TRUE if the expression is correct. Otherwise it returns a FALSE. The second line prints the first five values of the variable.

      dat$AgeNegative <-ifelse(dat$Age < 0, "TRUE", "FALSE")
dat$AgeNegative[1:5]
    

Output:

      1] "TRUE"  "TRUE"  "TRUE"  "FALSE" "FALSE"

The output above shows that the first three values are TRUE, which indicates the three negative age values of the data. In the similar manner, you can inspect other variables in the data.

Conclusion

In this guide, you learned about the most common data types, and acquired the knowledge of querying and converting them. This will help you understand and transform data better to perform complex data science tasks.

To learn more about Data Science with R, please refer to the following guides: