• Labs icon Lab
  • Data
Labs

Validating Data Using Asserts in R Hands-on Practice

In this lab, Validating Data Using Asserts in R: Hands-on Practice, you'll dive into the essentials of data validation in R. Learn how to use assertions to clean and prepare datasets, employing the assertr package to ensure data integrity and accuracy. By the end, you'll be adept at performing quality checks and ready to tackle data validation challenges in your projects.

Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 1h 0m
Published
Clock icon Mar 26, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Exploring and Defining Asserts

    RStudio Guide

    To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Exploring and Defining Asserts

    To review the concepts covered in this step, please refer to the Introducing Asserts module of the Validating Data Using Asserts in R course.

    Understanding and defining asserts is important because they are foundational to identifying and preparing problem data for analysis. This step will help learners grasp the concept of data cleaning and the role of assertions in this process.

    Begin your journey into data validation with R by exploring the concept of assertions. In this step, you will:

    1. Write a simple R script to define a dataset with potential data quality issues.
    2. Use basic assert statements to identify rows with missing values.

    Goal: Understand the basic concept of assertions and their role in data cleaning. Tools: R script, basic assert statements.


    Task 1.1: Examine a Dataset with Quality Issues

    The provided code creates a simple dataset named data_quality_issues that contains some rows with missing values. Print the dataset and examine its structure.

    πŸ” Hint

    After defining the dataset, print it using the print function, and examine its structure using the str function.

    πŸ”‘ Solution
    data_quality_issues <- data.frame(
      Name = c('Alice', 'Bob', NA, 'Diana', 'Evan'),
      Age = c(25, NA, 30, 22, 28),
      Score = c(85, 90, 88, NA, 95)
    )
    print(data_quality_issues)
    str(data_quality_issues)
    

    Task 1.2: Verifying Data Integrity with assertr

    Utilize the assertr package to verify that the Name column does not contain missing values.

    πŸ” Hint

    Load the assertr package with the library() function. Use the assert function to check for NA values. The first argument should be the data frame, the second argument should check for NAs with not_na, and the third argument should be the Name column.

    πŸ”‘ Solution
    library(assertr)
    
    assert(data_quality_issues, not_na, Name)
    

    Task 1.3: Detecting a Data Problem using assertr

    Now use assert to check for missing values in the Age column. Notice the difference in output when assert encounters data that does not satisfy the predicate.

    πŸ” Hint

    Use the assert function to check for NA values. The first argument should be the data frame, the second argument should check for NAs with not_na, and the third argument should be the Age column.

    πŸ”‘ Solution
    assert(data_quality_issues, not_na, Age)
    

  2. Challenge

    Validating Column Elements with assert

    Validating Column Elements with assert

    To review the concepts covered in this step, please refer to the Validating Elements in a Column module of the Validating Data Using Asserts in R course.

    Validating elements in a column is crucial because it ensures data integrity and accuracy within individual data points. This step focuses on using the assert function from the assertr package to perform column-wise validation.

    Dive deeper into column-wise data validation by:

    1. Loading the assertr package in R.
    2. Creating a dataset with specific column data types and values.
    3. Using the assert function to validate data elements in a column against predefined criteria.

    Goal: Practice column-wise data validation using the assert function. Tools: R, assertr package, assert function.


    Task 2.1: Load the assertr Package and Create a Dataset

    Start by loading the assertr package to use its functions for data validation. Use the provided code to create a simple dataset that contains some data quality issues. Print the dataset and examine its structure.

    πŸ” Hint

    Use the library function to load a package. The package you need to load is assertr. After defining the dataset, print it using the print function, and examine its structure using the str function.

    πŸ”‘ Solution
    # Load the assertr Package
    library(assertr)
    
    # Provided code to create a simple dataset
    data_quality_issues <- data.frame(
      Name = c('Alice', 'Bob', 'John', 'Diana', 'Evan'),
      Age = c(25, NA, 30, 22, 28),
      Score = c(85, 90, 88, NA, 105)
    )
    

    Task 2.2: Validate Column Types

    Use the assert function from the assertr package to validate that all Score values in the data frame are numeric.

    πŸ” Hint

    To check if the data are numeric, use the is.numeric function as the predicate. Make sure to pass in the Age column as an argument to assert

    πŸ”‘ Solution
    assert(data_quality_issues, is.numeric, Age)
    

    Task 2.3: Validate Column Elements

    Use the assert function from the assertr package to validate that all Score values in the data_frame are between 0 and 100.

    πŸ” Hint

    Use within_bounds(0, 100) as the predicate. Remember to specify the column names as the second argument in assert function calls.

    πŸ”‘ Solution
    assert(data_quality_issues, within_bounds(0, 100), Score)
    
  3. Challenge

    Using insist for Column-Wide Validation

    Using insist for Column-Wide Validation

    To review the concepts covered in this step, please refer to the Validating Elements Using the Column as a Whole module of the Validating Data Using Asserts in R course.

    Validating data using information about the column as a whole is important for understanding the broader context of data points. This step introduces the insist function for such validations.

    Expand your validation skills by:

    1. Exploring the insist function from the assertr package.
    2. Applying the insist function to ensure all values in a column meet the generated criteria.

    Goal: Learn to use the insist function for column-wide validation based on aggregate data. Tools: R, assertr package, insist function.


    Task 3.1: Loading the Required Packages

    Before you can use the insist function for column-wide validation, you need to load the assertr package. In the following steps, we will also use the pipe operator %>%. Load the magrittr package to gain access to this operator

    πŸ” Hint

    Use the library function and pass the name of the package as the argument.

    πŸ”‘ Solution
    # Load the required packages
    library(assertr)
    library(magrittr)
    

    Task 3.2: Applying the insist Function

    Load the iris dataset using data(iris). Use the insist function to check whether the Sepal.Length column values in the iris dataset fall within 2 standard deviations from the mean.

    πŸ” Hint

    Pipe the iris dataset to the insist() function. Use within_n_sds(2) as the first argument and the column name Sepal.Length as the second argument.

    πŸ”‘ Solution
    iris %>%
      insist(within_n_sds(2), Sepal.Length)
    
  4. Challenge

    Row-wise Data Validation

    Row-wise Data Validation

    To review the concepts covered in this step, please refer to the Validating Rows in a Dataset module of the Validating Data Using Asserts in R course.

    Row-wise validation is essential for ensuring that data across multiple columns meets certain criteria. This step covers the use of assert_rows and insist_rows functions for comprehensive row-wise checks.

    In this task, you'll learn about row-wise data validation by:

    1. Understanding the difference between assert_rows and insist_rows functions.
    2. Applying row-wise functions to validate data across multiple columns in a dataset.
    3. Exploring row reduction functions and predicate generators for advanced validation scenarios.

    Goal: Gain proficiency in row-wise data validation techniques. Tools: R, assertr package, assert_rows, insist_rows functions.


    Task 4.1: Loading the Required Packages

    First, load the assertr package. In the following steps, we will also use the pipe operator %>%. Load the magrittr package to gain access to this operator

    πŸ” Hint

    Use the library function and pass the name of the package as the argument.

    πŸ”‘ Solution
    # Load the required packages
    library(assertr)
    library(magrittr)
    

    Task 4.2: Understanding assert_rows and insist_rows

    To effectively use assert_rows and insist_rows, it's important to understand their differences and how they work. Use the ? operator to view the documentation for each function.

    πŸ” Hint

    Use ?assert_rows to view the documentation for assert_rows, and ?insist_rows for insist_rows.

    πŸ”‘ Solution
    # View documentation for assert_rows
    ?assert_rows
    
    # View documentation for insist_rows
    ?insist_rows
    

    Task 4.3: Checking Missing Values with Row Reduction Functions

    Load the airquality dataset using the data() function. Check for any rows that have more than one NA value.

    πŸ” Hint

    Use the assert_rows function with the num_row_NAs predicate. Check that the number of NAs is within_bounds 0 and 1. Run over all columns with everything().

    πŸ”‘ Solution
    data(airquality)
    airquality %>%
      assert_rows(num_row_NAs, within_bounds(0,1), everything()) 
    
    # 2 rows have missing data
    

    Task 4.4: Validating Unique Data with Row Reduction Functions

    With the airquality dataset, use assert_rows to ensure that there are no duplicate dates on the columns Month and Day (that is, that all combinations of Month and Day are unique).

    πŸ” Hint

    Use the assert_rows function. Combine the Month and Day columns using the col_concat reduction function and verify with the is_uniq predicate function.

    πŸ”‘ Solution
    airquality %>%
      assert_rows(col_concat, is_uniq, c(Month, Day))
    
    # all Month/Day combinations are unique
    

  5. Challenge

    Validating Data Frames and Quality Checks

    Validating Data Frames and Quality Checks

    To review the concepts covered in this step, please refer to the Validating a Data Frame module of the Validating Data Using Asserts in R course.

    Validating entire data frames is crucial for ensuring the overall integrity and quality of datasets. This step focuses on using assertions to check properties of data frames and perform comprehensive quality checks.

    Achieve a higher level of data validation by:

    1. Learning to use the verify function from the assertr package to validate properties of entire datasets.
    2. Exploring the use of all.equal and identical functions for checking equality between objects.

    Goal: Master the validation of entire data frames and performing quality checks. Tools: R, assertr package, verify, all.equal, identical functions.


    Task 5.1: Loading the Required Packages

    First, load the assertr package. In the following steps, we will also use the pipe operator %>% and the mutate function. Load the magrittr and dplyr packages to gain access to these utilities.

    πŸ” Hint

    Use the library function and pass the name of the package as the argument.

    πŸ”‘ Solution
    # Load the required packages
    library(assertr)
    library(magrittr)
    library(dplyr)
    

    Task 5.2: Using verify to Validate Data Frame Properties

    Load the airquality dataset with the data function. Use the verify function from the assertr package to check that the average of the column Temp in the data frame airquality is greater than 70. This is a basic validation to ensure the data meets a specific condition.

    πŸ” Hint

    Pipe the airquality dataset into the verify function with the argument as the condition mean(Temp) > 70.

    πŸ”‘ Solution
    data(airquality)
    airquality %>%
      verify(mean(Temp) > 70)
    

    Task 5.3: Comparing Data Frames with all.equal

    The provided code loads the iris dataset, then creates a copy and modifies one column. Compare airquality with this new data frame to check if the values are considered equal using the all.equal function.

    πŸ” Hint

    Use the all.equal function with iris and iris2 as arguments to check for equality between the two data frames.

    πŸ”‘ Solution
    # Provided code to copy and modify a dataframe
    data(iris)
    iris2 <- iris %>%
      mutate(Sepal.Length = Sepal.Length + 1e-8)
    
    # Compare the data frames
    all.equal(iris, iris2)
    
    # The tiny float difference is ignored by all.equal
    

    Task 5.4: Identifying Exact Matches with identical

    Use the identical function to check if the two data frames iris and iris2 are exactly the same. This is a stricter comparison than all.equal and useful for ensuring data integrity.

    πŸ” Hint

    Use the identical function with iris and iris2 as arguments. This function returns TRUE if the two objects are exactly the same, and FALSE otherwise.

    πŸ”‘ Solution
    identical(iris, iris2)
    
    # The data frames are not exactly identical
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.