- Lab
-
Libraries: If you want this lab, consider one of these libraries.
- Data
Validating Data Using Asserts in R Hands-on Practice
In this lab, Validating Data Using Asserts in R: Hands-on Practice, you'll dive into the essentials of data validation in R. Learn how to use assertions to clean and prepare datasets, employing the assertr package to ensure data integrity and accuracy. By the end, you'll be adept at performing quality checks and ready to tackle data validation challenges in your projects.
Lab Info
Table of Contents
-
Challenge
Exploring and Defining Asserts
RStudio Guide
To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Exploring and Defining Asserts
To review the concepts covered in this step, please refer to the Introducing Asserts module of the Validating Data Using Asserts in R course.
Understanding and defining asserts is important because they are foundational to identifying and preparing problem data for analysis. This step will help learners grasp the concept of data cleaning and the role of assertions in this process.
Begin your journey into data validation with R by exploring the concept of assertions. In this step, you will:
- Write a simple R script to define a dataset with potential data quality issues.
- Use basic assert statements to identify rows with missing values.
Goal: Understand the basic concept of assertions and their role in data cleaning. Tools: R script, basic assert statements.
Task 1.1: Examine a Dataset with Quality Issues
The provided code creates a simple dataset named
data_quality_issuesthat contains some rows with missing values. Print the dataset and examine its structure.🔍 Hint
After defining the dataset, print it using the
printfunction, and examine its structure using thestrfunction.🔑 Solution
data_quality_issues <- data.frame( Name = c('Alice', 'Bob', NA, 'Diana', 'Evan'), Age = c(25, NA, 30, 22, 28), Score = c(85, 90, 88, NA, 95) ) print(data_quality_issues) str(data_quality_issues)
Task 1.2: Verifying Data Integrity with assertr
Utilize the
assertrpackage to verify that the Name column does not contain missing values.🔍 Hint
Load the
assertrpackage with thelibrary()function. Use theassertfunction to check for NA values. The first argument should be the data frame, the second argument should check for NAs withnot_na, and the third argument should be the Name column.🔑 Solution
library(assertr) assert(data_quality_issues, not_na, Name)
Task 1.3: Detecting a Data Problem using assertr
Now use
assertto check for missing values in theAgecolumn. Notice the difference in output whenassertencounters data that does not satisfy the predicate.🔍 Hint
Use the
assertfunction to check for NA values. The first argument should be the data frame, the second argument should check for NAs withnot_na, and the third argument should be the Age column.🔑 Solution
assert(data_quality_issues, not_na, Age)
-
Challenge
Validating Column Elements with assert
Validating Column Elements with assert
To review the concepts covered in this step, please refer to the Validating Elements in a Column module of the Validating Data Using Asserts in R course.
Validating elements in a column is crucial because it ensures data integrity and accuracy within individual data points. This step focuses on using the
assertfunction from theassertrpackage to perform column-wise validation.Dive deeper into column-wise data validation by:
- Loading the
assertrpackage in R. - Creating a dataset with specific column data types and values.
- Using the
assertfunction to validate data elements in a column against predefined criteria.
Goal: Practice column-wise data validation using the
assertfunction. Tools: R,assertrpackage,assertfunction.
Task 2.1: Load the
assertrPackage and Create a DatasetStart by loading the
assertrpackage to use its functions for data validation. Use the provided code to create a simple dataset that contains some data quality issues. Print the dataset and examine its structure.🔍 Hint
Use the
libraryfunction to load a package. The package you need to load isassertr. After defining the dataset, print it using the print function, and examine its structure using the str function.🔑 Solution
# Load the assertr Package library(assertr) # Provided code to create a simple dataset data_quality_issues <- data.frame( Name = c('Alice', 'Bob', 'John', 'Diana', 'Evan'), Age = c(25, NA, 30, 22, 28), Score = c(85, 90, 88, NA, 105) )
Task 2.2: Validate Column Types
Use the
assertfunction from theassertrpackage to validate that allScorevalues in the data frame are numeric.🔍 Hint
To check if the data are numeric, use the is.numeric function as the predicate. Make sure to pass in the Age column as an argument to
assert🔑 Solution
assert(data_quality_issues, is.numeric, Age)
Task 2.3: Validate Column Elements
Use the
assertfunction from theassertrpackage to validate that allScorevalues in thedata_frameare between 0 and 100.🔍 Hint
Use
within_bounds(0, 100)as the predicate. Remember to specify the column names as the second argument inassertfunction calls.🔑 Solution
assert(data_quality_issues, within_bounds(0, 100), Score) - Loading the
-
Challenge
Using insist for Column-Wide Validation
Using insist for Column-Wide Validation
To review the concepts covered in this step, please refer to the Validating Elements Using the Column as a Whole module of the Validating Data Using Asserts in R course.
Validating data using information about the column as a whole is important for understanding the broader context of data points. This step introduces the
insistfunction for such validations.Expand your validation skills by:
- Exploring the
insistfunction from theassertrpackage. - Applying the
insistfunction to ensure all values in a column meet the generated criteria.
Goal: Learn to use the
insistfunction for column-wide validation based on aggregate data. Tools: R,assertrpackage,insistfunction.
Task 3.1: Loading the Required Packages
Before you can use the
insistfunction for column-wide validation, you need to load theassertrpackage. In the following steps, we will also use the pipe operator%>%. Load themagrittrpackage to gain access to this operator🔍 Hint
Use the
libraryfunction and pass the name of the package as the argument.🔑 Solution
# Load the required packages library(assertr) library(magrittr)
Task 3.2: Applying the
insistFunctionLoad the iris dataset using
data(iris). Use theinsistfunction to check whether theSepal.Lengthcolumn values in the iris dataset fall within 2 standard deviations from the mean.🔍 Hint
Pipe the iris dataset to the insist() function. Use within_n_sds(2) as the first argument and the column name Sepal.Length as the second argument.
🔑 Solution
iris %>% insist(within_n_sds(2), Sepal.Length) - Exploring the
-
Challenge
Row-wise Data Validation
Row-wise Data Validation
To review the concepts covered in this step, please refer to the Validating Rows in a Dataset module of the Validating Data Using Asserts in R course.
Row-wise validation is essential for ensuring that data across multiple columns meets certain criteria. This step covers the use of
assert_rowsandinsist_rowsfunctions for comprehensive row-wise checks.In this task, you'll learn about row-wise data validation by:
- Understanding the difference between
assert_rowsandinsist_rowsfunctions. - Applying row-wise functions to validate data across multiple columns in a dataset.
- Exploring row reduction functions and predicate generators for advanced validation scenarios.
Goal: Gain proficiency in row-wise data validation techniques. Tools: R,
assertrpackage,assert_rows,insist_rowsfunctions.
Task 4.1: Loading the Required Packages
First, load the
assertrpackage. In the following steps, we will also use the pipe operator%>%. Load themagrittrpackage to gain access to this operator🔍 Hint
Use the
libraryfunction and pass the name of the package as the argument.🔑 Solution
# Load the required packages library(assertr) library(magrittr)
Task 4.2: Understanding assert_rows and insist_rows
To effectively use
assert_rowsandinsist_rows, it's important to understand their differences and how they work. Use the?operator to view the documentation for each function.🔍 Hint
Use
?assert_rowsto view the documentation forassert_rows, and?insist_rowsforinsist_rows.🔑 Solution
# View documentation for assert_rows ?assert_rows # View documentation for insist_rows ?insist_rows
Task 4.3: Checking Missing Values with Row Reduction Functions
Load the
airqualitydataset using thedata()function. Check for any rows that have more than one NA value.🔍 Hint
Use the
assert_rowsfunction with thenum_row_NAspredicate. Check that the number of NAs iswithin_bounds0 and 1. Run over all columns witheverything().🔑 Solution
data(airquality) airquality %>% assert_rows(num_row_NAs, within_bounds(0,1), everything()) # 2 rows have missing data
Task 4.4: Validating Unique Data with Row Reduction Functions
With the
airqualitydataset, useassert_rowsto ensure that there are no duplicate dates on the columnsMonthandDay(that is, that all combinations ofMonthandDayare unique).🔍 Hint
Use the
assert_rowsfunction. Combine theMonthandDaycolumns using thecol_concat reduction function and verify with theis_uniqpredicate function.🔑 Solution
airquality %>% assert_rows(col_concat, is_uniq, c(Month, Day)) # all Month/Day combinations are unique
- Understanding the difference between
-
Challenge
Validating Data Frames and Quality Checks
Validating Data Frames and Quality Checks
To review the concepts covered in this step, please refer to the Validating a Data Frame module of the Validating Data Using Asserts in R course.
Validating entire data frames is crucial for ensuring the overall integrity and quality of datasets. This step focuses on using assertions to check properties of data frames and perform comprehensive quality checks.
Achieve a higher level of data validation by:
- Learning to use the
verifyfunction from theassertrpackage to validate properties of entire datasets. - Exploring the use of
all.equalandidenticalfunctions for checking equality between objects.
Goal: Master the validation of entire data frames and performing quality checks. Tools: R,
assertrpackage,verify,all.equal,identicalfunctions.
Task 5.1: Loading the Required Packages
First, load the
assertrpackage. In the following steps, we will also use the pipe operator%>%and themutatefunction. Load themagrittranddplyrpackages to gain access to these utilities.🔍 Hint
Use the
libraryfunction and pass the name of the package as the argument.🔑 Solution
# Load the required packages library(assertr) library(magrittr) library(dplyr)
Task 5.2: Using verify to Validate Data Frame Properties
Load the airquality dataset with the
datafunction. Use theverifyfunction from theassertrpackage to check that the average of the columnTempin the data frameairqualityis greater than 70. This is a basic validation to ensure the data meets a specific condition.🔍 Hint
Pipe the airquality dataset into the
verifyfunction with the argument as the conditionmean(Temp) > 70.🔑 Solution
data(airquality) airquality %>% verify(mean(Temp) > 70)
Task 5.3: Comparing Data Frames with all.equal
The provided code loads the iris dataset, then creates a copy and modifies one column. Compare airquality with this new data frame to check if the values are considered equal using the
all.equalfunction.🔍 Hint
Use the
all.equalfunction withirisandiris2as arguments to check for equality between the two data frames.🔑 Solution
# Provided code to copy and modify a dataframe data(iris) iris2 <- iris %>% mutate(Sepal.Length = Sepal.Length + 1e-8) # Compare the data frames all.equal(iris, iris2) # The tiny float difference is ignored by all.equal
Task 5.4: Identifying Exact Matches with identical
Use the
identicalfunction to check if the two data framesirisandiris2are exactly the same. This is a stricter comparison thanall.equaland useful for ensuring data integrity.🔍 Hint
Use the
identicalfunction withirisandiris2as arguments. This function returnsTRUEif the two objects are exactly the same, andFALSEotherwise.🔑 Solution
identical(iris, iris2) # The data frames are not exactly identical - Learning to use the
About the author
Real skill practice before real-world application
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Learn by doing
Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.
Follow your guide
All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.
Turn time into mastery
On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.