 Deepika Singh

# Querying and Converting Data Types in R

• Jul 3, 2020
• 1,859 Views
• Jul 3, 2020
• 1,859 Views
Data
R

## Introduction

Working with data is an obvious requirement from data science professionals. The building block of working with data is to understand the most common data types, and acquire the knowledge of processing, querying and converting them. In this guide, you will learn the techniques of querying and converting data types in R.

## Data Types

There are several data types in R, and the most integral ones are listed below:

1. Characters: Text (or string) values are called characters. Assigning a text value to a variable, 't', will make it a character, as is shown below. You can confirm its type with the `class()` or `typeof()` function.
``````1t = "pluralsight"
2class(t)
3typeof(t)``````
{r}

Output:

``````1 "character"
2
3 "character"``````
1. Numerics: Decimal values like 3.5 are called numerics in R. It is the default computational data type.
``````1N = 3.5
2class(N) ``````
{r}

Output:

``1 "numeric"``

The variable `N` is stored as a numeric value, and not an integer. This can be checked using the `is.integer()` function.

``1is.integer(N)``
{r}

Output:

``1 FALSE``
1. Integers: If you want to create an integer variable, you can use the `as.integer()` function. Also, all integers are numeric, but the reverse is not true.
``````1i = as.integer(3.1)
2print(i)``````
{r}

Output:

``1 3``
1. Logical: Logical values are often created by comparing two or more variables. These are denoted by boolean values, TRUE or FALSE.
``````1x = 100
2y = 56
3x < y``````
{r}

Output:

``1 FALSE``

The most common data types are discussed above, but the most important data type is a data frame.

## Data Frame

Data frame is the de-facto data type for most data science projects, as it's organized in tabular format. In simple terms, a data frame is a special type of list where all the elements are of equal length.

Data frames are normally created by `read_csv()` and `read.table()` functions when importing the data into R. You can also create a new data frame with the `data.frame()` function.

``````1df <- data.frame(rollnum = seq(1:10), h1 = 15:24, h2 = 81:90)
2df``````
{r}

Output:

``````1    rollnum h1 h2
2 1        1 15 81
3 2        2 16 82
4 3        3 17 83
5 4        4 18 84
6 5        5 19 85
7 6        6 20 86
8 7        7 21 87
9 8        8 22 88
10 9        9 23 89
11 10      10 24 90``````

The most common method of dealing with a data frame is by importing the flat files--csv or Excel--into the R environment. The code below performs this task and loads the data that will be used in the subsequent sections.

``````1library(readr)
3glimpse(dat)``````
{r}

Output:

``````1Observations: 585
2Variables: 6
3\$ UID             <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
4\$ Income          <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
5\$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
6\$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
7\$ Age             <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
8\$ Purpose         <chr> "Business", "Personal", "Travel", "Personal", "Persona...``````

The output shows there are 585 observations of 6 variables, described below.

1. `UID`: Unique identifier tag of the loan applicant.

2. `Income`: Annual income of the applicant (in US dollars).

3. `Credit_score`: Whether the applicant's credit score was satisfactory or not.

4. `approval_status`: Whether the loan application was approved ("1") or not ("0").

5. `Age`: The applicant’s age in years.

6. `Purpose`: The reason for the loan application.

## Inspecting and Converting Data Types

For data science and machine learning, it's important for the variables to be in the right data type. To begin, you will use the `str()` function that prints the structure of the data.

``1str(dat)``
{r}

Output:

``````1Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	585 obs. of  6 variables:
2 \$ UID            : chr  "UIDA467" "UIDA402" "UIDA354" "UIDA209" ...
3 \$ Income         : num  36850 45470 53240 198400 83410 ...
4 \$ Credit_score   : Factor w/ 2 levels "Not _satisfactory",..: 2 2 2 2 1 2 2 2 2 2 ...
5 \$ approval_status: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
6 \$ Age            : int  -12 -10 -3 23 23 23 23 23 23 24 ...
7 \$ Purpose        : Factor w/ 6 levels "Business","Education",..: 1 4 5 4 4 4 4 4 5 4 ``````

From the output above, you can see that the data has six variables, three numerical and three categorical. You will start by understanding the levels of character variables.

``1table(dat\$Credit_score)``
{r}

Output:

``````1Not _satisfactory      Satisfactory
2              124               461 ``````

The variable `Credit_score` has only two levels, so it can be converted to a factor variable with the `as.factor()` function.

``````1dat\$Credit_score = as.factor(dat\$Credit_score)
2class(dat\$Credit_score)``````
{r}

Output:

``1 "factor"``

Next, inspect the number of levels for the variable `Purpose`.

``1table(dat\$Purpose)``
{r}

Output:

``````1Business Education Furniture  Personal    Travel   Wedding
2       43       184        37       161       122        38 ``````

There are six levels in the variable `Purpose` which is converted to the factor data type with the code below.

``````1dat\$Purpose = as.factor(dat\$Purpose)
2class(dat\$Purpose)``````
{r}

Output:

``1 "factor"``

The last conversion to make is for the variable `approval_status`. Start by examining the class of the variable.

``````1class(dat\$approval_status)
2
3table(dat\$approval_status)``````
{r}

Output:

``````1 "integer"
2
3 0   1
4186 399 ``````

The class of the variable `approval_status` is shown as `integer`, but it takes only two values, zero and one. In fact, this is a categorical variable and needs to be converted to factor.

``````1dat\$approval_status = as.factor(dat\$approval_status)
2class(dat\$approval_status)``````
{r}

Output:

``1 "factor"``

The required conversions have been made, and this can be verified with the code below.

``1glimpse(dat)``
{r}

Output:

``````1Observations: 585
2Variables: 7
3\$ UID             <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
4\$ Income          <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
5\$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
6\$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
7\$ Age             <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
8\$ Purpose         <fct> Business, Personal, Travel, Personal, Personal, Person...``````

You have inspected and converted the variables in the section above, and will learn how to query some of the numerical variables. The `summary()` function provides key statistics about the variables.

``1summary(dat)``
{r}

Output:

``````1     UID                Income                  Credit_score approval_status
2 Length:585         Min.   :  3000   Not _satisfactory:124   0:186
3 Class :character   1st Qu.: 38890   Satisfactory     :461   1:399
4 Mode  :character   Median : 51440
5                    Mean   : 71655
6                    3rd Qu.: 77570
7                    Max.   :844490
8
9
10      Age              Purpose
11 Min.   :-12.00   Business : 43
12 1st Qu.: 37.00   Education:184
13 Median : 51.00   Furniture: 37
14 Mean   : 49.39   Personal :161
15 3rd Qu.: 61.00   Travel   :122
16 Max.   : 76.00   Wedding  : 38  ``````

From the output above, you can see that the variable, `Age`, has negative values. This is incorrect data and needs further querying. There are various ways to do it, one of which is to find out how many such values are there.

``````1neg_age = dat[dat\$Age<0,]
2nrow(neg_age)``````
{r}

Output:

``1 3``

There are only three such records and deleting them won't make any difference. However, the other technique can be to create a new logical variable that will check the condition of age being negative.

The first line uses the `ifelse()` command to create a new variable `AgeNegative`, that returns a value `TRUE` if the expression is correct. Otherwise it returns a `FALSE`. The second line prints the first five values of the variable.

``````1dat\$AgeNegative <-ifelse(dat\$Age < 0, "TRUE", "FALSE")
2dat\$AgeNegative[1:5]``````
{r}

Output:

``1 "TRUE"  "TRUE"  "TRUE"  "FALSE" "FALSE"``

The output above shows that the first three values are `TRUE`, which indicates the three negative age values of the data. In the similar manner, you can inspect other variables in the data.

## Conclusion

In this guide, you learned about the most common data types, and acquired the knowledge of querying and converting them. This will help you understand and transform data better to perform complex data science tasks.