Author avatar

Deepika Singh

Validating Data Using Asserts in R

Deepika Singh

  • Mar 2, 2020
  • 13 Min read
  • 3,837 Views
  • Mar 2, 2020
  • 13 Min read
  • 3,837 Views
Data
R

Introduction

The quality of data plays a crucial role in machine learning. Without good data, errors are generated that adversely affect data analysis and model performance results. Often, these errors are difficult to detect and occur late in the analysis. Still worse, sometimes errors remain undetected and flow in to the data, producing inaccurate results. The solution to this problem lies in data validation. Enter asserts, debugging aids that test a condition and are used to programmatically check data.

In this guide, you will learn to validate data using asserts in R. Specifically, we'll be using the Assertr package, which provides variety of functions designed to verify assumptions about data early in a data analysis pipeline.

Data

In this guide, we'll be using a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No").
  1. Is_graduate: Whether the applicant is graduate ("Yes") or not ("No").
  1. Income: Annual Income of the applicant (in USD).
  1. Loan_amount: Loan amount (in USD) for which the application was submitted.
  1. Credit_score: Whether the applicants credit score is satisfactory or not.
  1. approval_status: Whether the loan application was approved ("Yes") or not ("No").
  1. Age: The applicant's age in years.
  1. Sex: Whether the applicant was a male ("M") or a female ("F").
  1. Dependents: Number of dependents in the applicant's family.
  1. Purpose: Purpose of applying for the loan.

Let's start by loading the required libraries and the data.

1library(readr)
2library(assertr)
3library(assertive)
4library(magrittr)
5library(dplyr)
6
7dat <- read_csv("dataset.csv")
8
9dim(dat) 
{r}

Output:

1[1] 600  10

Importance of Asserts

The example below demonstrates the importance of asserts, in which we summarize the average age of the applicants grouped-by their approval status. The first line of code below converts the approval_status variable into a factor, while the second line performs the required computation.

1dat$approval_status = as.factor(dat$approval_status)
2
3dat %>%
4  group_by(approval_status) %>%
5  summarise(avg_age=mean(Age))
{r}

Output:

1approval_status         avg_age
2<fctr>                    <dbl>
3
40	                      47.40000			
51	                      48.61463

There does not seem to be anything wrong in the above output, but let's look at summary function for the Age variable.

1summary(dat$Age)
{r}

Output:

1  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2 -10.00   36.00   50.00   48.23   61.00   76.00 

From the output above, we can see that some of the applicants' ages are negative, which is not possible. This is incorrect data, but this error was not detected in the previous code where we performed the group-by operation. This is where the Assertr’s verify() function can be used to ensure that such mistakes don't go unidentified.

The verify function takes a data frame (dat) and a logical expression (Age >= 0). Then, it evaluates that expression for the provided data. If the condition of the expression is not met, verify raises an error alert and terminates further processing of the code pipeline. In this example, the lines of code below will perform this task.

1dat %>%
2  verify(Age >= 0) %>%
3  group_by(approval_status) %>%
4  summarise(avg_age=mean(Age))
{r}

Output:

1verification [Age >= 0] failed! (10 failures)
2
3     verb redux_fn predicate column index value
41  verify       NA  Age >= 0     NA     1    NA
52  verify       NA  Age >= 0     NA     2    NA
63  verify       NA  Age >= 0     NA     3    NA
74  verify       NA  Age >= 0     NA     4    NA
85  verify       NA  Age >= 0     NA   193    NA
96  verify       NA  Age >= 0     NA   194    NA
107  verify       NA  Age >= 0     NA   195    NA
118  verify       NA  Age >= 0     NA   199    NA
129  verify       NA  Age >= 0     NA   209    NA
1310 verify       NA  Age >= 0     NA   600    NA
14
15Error: assertr stopped execution

The output shows ten instances where the age takes negative values, highlighted by the index. Finally, the error message Error: assertr stopped execution shows that the execution was stopped, which is why the desired output was not displayed.

The same task can be performed using Assertr’s assert() function. In the code below, the assert() function takes the data, dat, and applies a predicate function, within_bounds(0,Inf). We have set the range to only include positive values, but this can be altered as necessary. The next step is to apply the predicate function to the column of interest, Age. The code below raises the error alert when the condition is not met.

1dat %>%
2  assert(within_bounds(0,Inf), Age) %>%
3  group_by(approval_status) %>%
4  summarise(avg_age=mean(Age))
{r}

Output:

1Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times
2    verb redux_fn             predicate column index value
31 assert       NA within_bounds(0, Inf)    Age     1    -2
42 assert       NA within_bounds(0, Inf)    Age     2    -3
53 assert       NA within_bounds(0, Inf)    Age     3    -4
64 assert       NA within_bounds(0, Inf)    Age     4    -5
75 assert       NA within_bounds(0, Inf)    Age   193    -5
8  [omitted 5 rows]
9
10
11Error: assertr stopped execution

The first line of the output, Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times, indicates that there are ten rows with negative age values.

Combining Several Asserts

It can be a time consuming and inefficient to validate data points one at a time using asserts. A more efficient way is to use the family of assert functions and create a chain of such commands for data validation, as shown in the example below.

Let's assume we want to validate the following conditions in our data.

  1. The data has all the ten variables described in the initial section of the guide. This is achieved with the verify(has_all_names()) command in the code below.
  1. The dataset contains atleast 120 observations, which represents twenty percent of the initial data. This is achieved with the verify((nrow(.) > 120)) command below.
  1. The variable Age only takes positive values. This is achieved with the verify(Age > 0) command below.
  1. The variables Income and Loan_amount should have values within three standard deviations of their respective means. This is achieved with the insist(within_n_sds(3), Income) command in the code below.
  1. The target variable, approval_status, contains only the binary values zero and one. This is achieved with the assert(in_set(0,1), approval_status) command in the code below.
  1. Each row in the data contains at most six missing records. This is achieved with the assert_rows(num_row_NAs, within_bounds(0,6), everything()) command below.
  1. Each row is unique jointly between the Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, and Credit_score variables. This is achieved with the assert_rows(col_concat, is_uniq,...) command below.
1dat %>%
2  verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
3  verify(nrow(.) > 120) %>%
4  verify(Age > 0) %>%
5  insist(within_n_sds(3), Income) %>%
6  insist(within_n_sds(3), Loan_amount) %>%
7  assert(in_set(0,1), approval_status) %>%
8  assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
9  assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
10  group_by(approval_status) %>%
11  summarise(avg.Age=mean(Age))
{r}

Output:

1verification [Age > 0] failed! (10 failures)
2
3     verb redux_fn predicate column index value
41  verify       NA   Age > 0     NA     1    NA
52  verify       NA   Age > 0     NA     2    NA
63  verify       NA   Age > 0     NA     3    NA
74  verify       NA   Age > 0     NA     4    NA
85  verify       NA   Age > 0     NA   193    NA
96  verify       NA   Age > 0     NA   194    NA
107  verify       NA   Age > 0     NA   195    NA
118  verify       NA   Age > 0     NA   199    NA
129  verify       NA   Age > 0     NA   209    NA
1310 verify       NA   Age > 0     NA   600    NA
14
15Error: assertr stopped execution

The output shows that the first two requirements are met but the execution was halted in the third condition with the variable,Age taking negative values. Let's make this correction and create a new data frame, dat2, which only takes positive age values. This is done using the code below.

1dat2 <- dat %>%
2        filter(Age > 0)
3
4dim(dat2)
{r}

Output:

1[1] 590  10

The resulting data has 590 observations because ten rows containing negative values of age were removed. We'll recheck the combination of the data conditions, specified above, using the code below.

1dat2 %>%
2  verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
3  verify(nrow(.) > 120) %>%
4  verify(Age > 0) %>%
5  insist(within_n_sds(3), Income) %>%
6  insist(within_n_sds(3), Loan_amount) %>%
7  assert(in_set(0,1), approval_status) %>%
8  assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
9  assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
10  group_by(approval_status) %>%
11  summarise(avg.Age=mean(Age))
{r}

Output:

1Column 'Income' violates assertion 'within_n_sds(3)' 7 times
2    verb redux_fn       predicate column index   value
31 insist       NA within_n_sds(3) Income   190 3173700
42 insist       NA within_n_sds(3) Income   255 5219600
53 insist       NA within_n_sds(3) Income   321 5333200
64 insist       NA within_n_sds(3) Income   324 6901700
75 insist       NA within_n_sds(3) Income   344 8444900
8  [omitted 2 rows]

The output shows that now there is no error alert for negative age values, since those were dropped. Instead, the insist() function found seven records where the Income variable was not within three standard deviations from the mean. The output also prints the index of these records, making it easier for us to treat them as outliers. In this way, we can go on validating the data assumptions and incorporating required corrections if needed.