The quality of data plays a crucial role in machine learning. Without good data, errors are generated that adversely affect data analysis and model performance results. Often, these errors are difficult to detect and occur late in the analysis. Still worse, sometimes errors remain undetected and flow in to the data, producing inaccurate results. The solution to this problem lies in data validation. Enter asserts, debugging aids that test a condition and are used to programmatically check data.
In this guide, you will learn to validate data using asserts in R. Specifically, we'll be using the Assertr package, which provides variety of functions designed to verify assumptions about data early in a data analysis pipeline.
In this guide, we'll be using a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
Marital_status
: Whether the applicant is married ("Yes") or not ("No").Is_graduate
: Whether the applicant is graduate ("Yes") or not ("No").Income
: Annual Income of the applicant (in USD). Loan_amount
: Loan amount (in USD) for which the application was submitted.Credit_score
: Whether the applicants credit score is satisfactory or not.approval_status
: Whether the loan application was approved ("Yes") or not ("No").Age
: The applicant's age in years.Sex
: Whether the applicant was a male ("M") or a female ("F").Dependents
: Number of dependents in the applicant's family. Purpose
: Purpose of applying for the loan.Let's start by loading the required libraries and the data.
1library(readr)
2library(assertr)
3library(assertive)
4library(magrittr)
5library(dplyr)
6
7dat <- read_csv("dataset.csv")
8
9dim(dat)
Output:
1[1] 600 10
The example below demonstrates the importance of asserts, in which we summarize the average age of the applicants grouped-by their approval status. The first line of code below converts the approval_status
variable into a factor, while the second line performs the required computation.
1dat$approval_status = as.factor(dat$approval_status)
2
3dat %>%
4 group_by(approval_status) %>%
5 summarise(avg_age=mean(Age))
Output:
1approval_status avg_age
2<fctr> <dbl>
3
40 47.40000
51 48.61463
There does not seem to be anything wrong in the above output, but let's look at summary function for the Age
variable.
1summary(dat$Age)
Output:
1 Min. 1st Qu. Median Mean 3rd Qu. Max.
2 -10.00 36.00 50.00 48.23 61.00 76.00
From the output above, we can see that some of the applicants' ages are negative, which is not possible. This is incorrect data, but this error was not detected in the previous code where we performed the group-by
operation. This is where the Assertr’s verify()
function can be used to ensure that such mistakes don't go unidentified.
The verify function takes a data frame (dat
) and a logical expression (Age >= 0
). Then, it evaluates that expression for the provided data. If the condition of the expression is not met, verify raises an error alert and terminates further processing of the code pipeline. In this example, the lines of code below will perform this task.
1dat %>%
2 verify(Age >= 0) %>%
3 group_by(approval_status) %>%
4 summarise(avg_age=mean(Age))
Output:
1verification [Age >= 0] failed! (10 failures)
2
3 verb redux_fn predicate column index value
41 verify NA Age >= 0 NA 1 NA
52 verify NA Age >= 0 NA 2 NA
63 verify NA Age >= 0 NA 3 NA
74 verify NA Age >= 0 NA 4 NA
85 verify NA Age >= 0 NA 193 NA
96 verify NA Age >= 0 NA 194 NA
107 verify NA Age >= 0 NA 195 NA
118 verify NA Age >= 0 NA 199 NA
129 verify NA Age >= 0 NA 209 NA
1310 verify NA Age >= 0 NA 600 NA
14
15Error: assertr stopped execution
The output shows ten instances where the age takes negative values, highlighted by the index. Finally, the error message Error: assertr stopped execution
shows that the execution was stopped, which is why the desired output was not displayed.
The same task can be performed using Assertr’s assert()
function. In the code below, the assert()
function takes the data, dat
, and applies a predicate function, within_bounds(0,Inf)
. We have set the range to only include positive values, but this can be altered as necessary. The next step is to apply the predicate function to the column of interest, Age
. The code below raises the error alert when the condition is not met.
1dat %>%
2 assert(within_bounds(0,Inf), Age) %>%
3 group_by(approval_status) %>%
4 summarise(avg_age=mean(Age))
Output:
1Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times
2 verb redux_fn predicate column index value
31 assert NA within_bounds(0, Inf) Age 1 -2
42 assert NA within_bounds(0, Inf) Age 2 -3
53 assert NA within_bounds(0, Inf) Age 3 -4
64 assert NA within_bounds(0, Inf) Age 4 -5
75 assert NA within_bounds(0, Inf) Age 193 -5
8 [omitted 5 rows]
9
10
11Error: assertr stopped execution
The first line of the output, Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times
, indicates that there are ten rows with negative age values.
It can be a time consuming and inefficient to validate data points one at a time using asserts. A more efficient way is to use the family of assert functions and create a chain of such commands for data validation, as shown in the example below.
Let's assume we want to validate the following conditions in our data.
verify(has_all_names())
command in the code below. verify((nrow(.) > 120))
command below. Age
only takes positive values. This is achieved with the verify(Age > 0)
command below. Income
and Loan_amount
should have values within three standard deviations of their respective means. This is achieved with the insist(within_n_sds(3), Income)
command in the code below. approval_status
, contains only the binary values zero and one. This is achieved with the assert(in_set(0,1), approval_status)
command in the code below. assert_rows(num_row_NAs, within_bounds(0,6), everything())
command below. Income
, Dependents
, approval_status
, Age
, Sex
, Purpose
, Loan_amount
, and Credit_score
variables. This is achieved with the assert_rows(col_concat, is_uniq,...)
command below. 1dat %>%
2 verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
3 verify(nrow(.) > 120) %>%
4 verify(Age > 0) %>%
5 insist(within_n_sds(3), Income) %>%
6 insist(within_n_sds(3), Loan_amount) %>%
7 assert(in_set(0,1), approval_status) %>%
8 assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
9 assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
10 group_by(approval_status) %>%
11 summarise(avg.Age=mean(Age))
Output:
1verification [Age > 0] failed! (10 failures)
2
3 verb redux_fn predicate column index value
41 verify NA Age > 0 NA 1 NA
52 verify NA Age > 0 NA 2 NA
63 verify NA Age > 0 NA 3 NA
74 verify NA Age > 0 NA 4 NA
85 verify NA Age > 0 NA 193 NA
96 verify NA Age > 0 NA 194 NA
107 verify NA Age > 0 NA 195 NA
118 verify NA Age > 0 NA 199 NA
129 verify NA Age > 0 NA 209 NA
1310 verify NA Age > 0 NA 600 NA
14
15Error: assertr stopped execution
The output shows that the first two requirements are met but the execution was halted in the third condition with the variable,Age
taking negative values. Let's make this correction and create a new data frame, dat2
, which only takes positive age values. This is done using the code below.
1dat2 <- dat %>%
2 filter(Age > 0)
3
4dim(dat2)
Output:
1[1] 590 10
The resulting data has 590 observations because ten rows containing negative values of age were removed. We'll recheck the combination of the data conditions, specified above, using the code below.
1dat2 %>%
2 verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
3 verify(nrow(.) > 120) %>%
4 verify(Age > 0) %>%
5 insist(within_n_sds(3), Income) %>%
6 insist(within_n_sds(3), Loan_amount) %>%
7 assert(in_set(0,1), approval_status) %>%
8 assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
9 assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
10 group_by(approval_status) %>%
11 summarise(avg.Age=mean(Age))
Output:
1Column 'Income' violates assertion 'within_n_sds(3)' 7 times
2 verb redux_fn predicate column index value
31 insist NA within_n_sds(3) Income 190 3173700
42 insist NA within_n_sds(3) Income 255 5219600
53 insist NA within_n_sds(3) Income 321 5333200
64 insist NA within_n_sds(3) Income 324 6901700
75 insist NA within_n_sds(3) Income 344 8444900
8 [omitted 2 rows]
The output shows that now there is no error alert for negative age values, since those were dropped. Instead, the insist()
function found seven records where the Income
variable was not within three standard deviations from the mean. The output also prints the index of these records, making it easier for us to treat them as outliers. In this way, we can go on validating the data assumptions and incorporating required corrections if needed.
In this guide, you have learned methods of validating data using asserts in R. You have applied these assertions using two functions, verify()
and assert()
. This knowledge will help you perform proper data validation, resulting in better data science and analytics results.
To learn more about Data Science with R, please refer to the following guides: