- Lab
- Data

Validating Data Using Asserts in R Hands-on Practice
In this lab, Validating Data Using Asserts in R: Hands-on Practice, you'll dive into the essentials of data validation in R. Learn how to use assertions to clean and prepare datasets, employing the assertr package to ensure data integrity and accuracy. By the end, you'll be adept at performing quality checks and ready to tackle data validation challenges in your projects.

Path Info
Table of Contents
-
Challenge
Exploring and Defining Asserts
RStudio Guide
To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Exploring and Defining Asserts
To review the concepts covered in this step, please refer to the Introducing Asserts module of the Validating Data Using Asserts in R course.
Understanding and defining asserts is important because they are foundational to identifying and preparing problem data for analysis. This step will help learners grasp the concept of data cleaning and the role of assertions in this process.
Begin your journey into data validation with R by exploring the concept of assertions. In this step, you will:
- Write a simple R script to define a dataset with potential data quality issues.
- Use basic assert statements to identify rows with missing values.
Goal: Understand the basic concept of assertions and their role in data cleaning. Tools: R script, basic assert statements.
Task 1.1: Examine a Dataset with Quality Issues
The provided code creates a simple dataset named
data_quality_issues
that contains some rows with missing values. Print the dataset and examine its structure.π Hint
After defining the dataset, print it using the
print
function, and examine its structure using thestr
function.π Solution
data_quality_issues <- data.frame( Name = c('Alice', 'Bob', NA, 'Diana', 'Evan'), Age = c(25, NA, 30, 22, 28), Score = c(85, 90, 88, NA, 95) ) print(data_quality_issues) str(data_quality_issues)
Task 1.2: Verifying Data Integrity with assertr
Utilize the
assertr
package to verify that the Name column does not contain missing values.π Hint
Load the
assertr
package with thelibrary()
function. Use theassert
function to check for NA values. The first argument should be the data frame, the second argument should check for NAs withnot_na
, and the third argument should be the Name column.π Solution
library(assertr) assert(data_quality_issues, not_na, Name)
Task 1.3: Detecting a Data Problem using assertr
Now use
assert
to check for missing values in theAge
column. Notice the difference in output whenassert
encounters data that does not satisfy the predicate.π Hint
Use the
assert
function to check for NA values. The first argument should be the data frame, the second argument should check for NAs withnot_na
, and the third argument should be the Age column.π Solution
assert(data_quality_issues, not_na, Age)
-
Challenge
Validating Column Elements with assert
Validating Column Elements with assert
To review the concepts covered in this step, please refer to the Validating Elements in a Column module of the Validating Data Using Asserts in R course.
Validating elements in a column is crucial because it ensures data integrity and accuracy within individual data points. This step focuses on using the
assert
function from theassertr
package to perform column-wise validation.Dive deeper into column-wise data validation by:
- Loading the
assertr
package in R. - Creating a dataset with specific column data types and values.
- Using the
assert
function to validate data elements in a column against predefined criteria.
Goal: Practice column-wise data validation using the
assert
function. Tools: R,assertr
package,assert
function.
Task 2.1: Load the
assertr
Package and Create a DatasetStart by loading the
assertr
package to use its functions for data validation. Use the provided code to create a simple dataset that contains some data quality issues. Print the dataset and examine its structure.π Hint
Use the
library
function to load a package. The package you need to load isassertr
. After defining the dataset, print it using the print function, and examine its structure using the str function.π Solution
# Load the assertr Package library(assertr) # Provided code to create a simple dataset data_quality_issues <- data.frame( Name = c('Alice', 'Bob', 'John', 'Diana', 'Evan'), Age = c(25, NA, 30, 22, 28), Score = c(85, 90, 88, NA, 105) )
Task 2.2: Validate Column Types
Use the
assert
function from theassertr
package to validate that allScore
values in the data frame are numeric.π Hint
To check if the data are numeric, use the is.numeric function as the predicate. Make sure to pass in the Age column as an argument to
assert
π Solution
assert(data_quality_issues, is.numeric, Age)
Task 2.3: Validate Column Elements
Use the
assert
function from theassertr
package to validate that allScore
values in thedata_frame
are between 0 and 100.π Hint
Use
within_bounds(0, 100)
as the predicate. Remember to specify the column names as the second argument inassert
function calls.π Solution
assert(data_quality_issues, within_bounds(0, 100), Score)
- Loading the
-
Challenge
Using insist for Column-Wide Validation
Using insist for Column-Wide Validation
To review the concepts covered in this step, please refer to the Validating Elements Using the Column as a Whole module of the Validating Data Using Asserts in R course.
Validating data using information about the column as a whole is important for understanding the broader context of data points. This step introduces the
insist
function for such validations.Expand your validation skills by:
- Exploring the
insist
function from theassertr
package. - Applying the
insist
function to ensure all values in a column meet the generated criteria.
Goal: Learn to use the
insist
function for column-wide validation based on aggregate data. Tools: R,assertr
package,insist
function.
Task 3.1: Loading the Required Packages
Before you can use the
insist
function for column-wide validation, you need to load theassertr
package. In the following steps, we will also use the pipe operator%>%
. Load themagrittr
package to gain access to this operatorπ Hint
Use the
library
function and pass the name of the package as the argument.π Solution
# Load the required packages library(assertr) library(magrittr)
Task 3.2: Applying the
insist
FunctionLoad the iris dataset using
data(iris)
. Use theinsist
function to check whether theSepal.Length
column values in the iris dataset fall within 2 standard deviations from the mean.π Hint
Pipe the iris dataset to the insist() function. Use within_n_sds(2) as the first argument and the column name Sepal.Length as the second argument.
π Solution
iris %>% insist(within_n_sds(2), Sepal.Length)
- Exploring the
-
Challenge
Row-wise Data Validation
Row-wise Data Validation
To review the concepts covered in this step, please refer to the Validating Rows in a Dataset module of the Validating Data Using Asserts in R course.
Row-wise validation is essential for ensuring that data across multiple columns meets certain criteria. This step covers the use of
assert_rows
andinsist_rows
functions for comprehensive row-wise checks.In this task, you'll learn about row-wise data validation by:
- Understanding the difference between
assert_rows
andinsist_rows
functions. - Applying row-wise functions to validate data across multiple columns in a dataset.
- Exploring row reduction functions and predicate generators for advanced validation scenarios.
Goal: Gain proficiency in row-wise data validation techniques. Tools: R,
assertr
package,assert_rows
,insist_rows
functions.
Task 4.1: Loading the Required Packages
First, load the
assertr
package. In the following steps, we will also use the pipe operator%>%
. Load themagrittr
package to gain access to this operatorπ Hint
Use the
library
function and pass the name of the package as the argument.π Solution
# Load the required packages library(assertr) library(magrittr)
Task 4.2: Understanding assert_rows and insist_rows
To effectively use
assert_rows
andinsist_rows
, it's important to understand their differences and how they work. Use the?
operator to view the documentation for each function.π Hint
Use
?assert_rows
to view the documentation forassert_rows
, and?insist_rows
forinsist_rows
.π Solution
# View documentation for assert_rows ?assert_rows # View documentation for insist_rows ?insist_rows
Task 4.3: Checking Missing Values with Row Reduction Functions
Load the
airquality
dataset using thedata()
function. Check for any rows that have more than one NA value.π Hint
Use the
assert_rows
function with thenum_row_NAs
predicate. Check that the number of NAs iswithin_bounds
0 and 1. Run over all columns witheverything()
.π Solution
data(airquality) airquality %>% assert_rows(num_row_NAs, within_bounds(0,1), everything()) # 2 rows have missing data
Task 4.4: Validating Unique Data with Row Reduction Functions
With the
airquality
dataset, useassert_rows
to ensure that there are no duplicate dates on the columnsMonth
andDay
(that is, that all combinations ofMonth
andDay
are unique).π Hint
Use the
assert_rows
function. Combine theMonth
andDay
columns using thecol_conca
t reduction function and verify with theis_uniq
predicate function.π Solution
airquality %>% assert_rows(col_concat, is_uniq, c(Month, Day)) # all Month/Day combinations are unique
- Understanding the difference between
-
Challenge
Validating Data Frames and Quality Checks
Validating Data Frames and Quality Checks
To review the concepts covered in this step, please refer to the Validating a Data Frame module of the Validating Data Using Asserts in R course.
Validating entire data frames is crucial for ensuring the overall integrity and quality of datasets. This step focuses on using assertions to check properties of data frames and perform comprehensive quality checks.
Achieve a higher level of data validation by:
- Learning to use the
verify
function from theassertr
package to validate properties of entire datasets. - Exploring the use of
all.equal
andidentical
functions for checking equality between objects.
Goal: Master the validation of entire data frames and performing quality checks. Tools: R,
assertr
package,verify
,all.equal
,identical
functions.
Task 5.1: Loading the Required Packages
First, load the
assertr
package. In the following steps, we will also use the pipe operator%>%
and themutate
function. Load themagrittr
anddplyr
packages to gain access to these utilities.π Hint
Use the
library
function and pass the name of the package as the argument.π Solution
# Load the required packages library(assertr) library(magrittr) library(dplyr)
Task 5.2: Using verify to Validate Data Frame Properties
Load the airquality dataset with the
data
function. Use theverify
function from theassertr
package to check that the average of the columnTemp
in the data frameairquality
is greater than 70. This is a basic validation to ensure the data meets a specific condition.π Hint
Pipe the airquality dataset into the
verify
function with the argument as the conditionmean(Temp) > 70
.π Solution
data(airquality) airquality %>% verify(mean(Temp) > 70)
Task 5.3: Comparing Data Frames with all.equal
The provided code loads the iris dataset, then creates a copy and modifies one column. Compare airquality with this new data frame to check if the values are considered equal using the
all.equal
function.π Hint
Use the
all.equal
function withiris
andiris2
as arguments to check for equality between the two data frames.π Solution
# Provided code to copy and modify a dataframe data(iris) iris2 <- iris %>% mutate(Sepal.Length = Sepal.Length + 1e-8) # Compare the data frames all.equal(iris, iris2) # The tiny float difference is ignored by all.equal
Task 5.4: Identifying Exact Matches with identical
Use the
identical
function to check if the two data framesiris
andiris2
are exactly the same. This is a stricter comparison thanall.equal
and useful for ensuring data integrity.π Hint
Use the
identical
function withiris
andiris2
as arguments. This function returnsTRUE
if the two objects are exactly the same, andFALSE
otherwise.π Solution
identical(iris, iris2) # The data frames are not exactly identical
- Learning to use the
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.