Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Validate Data Cleanliness Using Asserts in Python Hands-on Practice

This lab is the companion hands-on experience to the Pluralsight course Validate Data Cleanliness Using Asserts in Python. In this lab you'll practice using pandas and assert statements to reveal issues in your data.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 46m
Published
Clock icon Oct 26, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Asserts for Data Validation

    Asserts for Data Validation

    To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.

    Understanding how to use assert statements in Python is important because they help in catching mistakes and validating data cleanliness. They can be used to check if a Boolean condition returns true or false, which is crucial in data validation.

    Let's put our knowledge into practice! Your task is to write an assert statement that checks whether all values in the likes column of our dataset are greater than 0. This will help ensure that our data doesn't contain any erroneous entries where the number of likes is less than 0. We'll also see what it looks like when an assertion fails.

    To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Task 1.1: Importing Necessary Libraries

    Before we can start working with our data, we need to import the necessary libraries. For this task, import the pandas library as pd.

    πŸ” Hint

    Use the import keyword to import pandas. To import it as pd, use the as keyword.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 1.2: Loading the Dataset

    Now that we have our necessary library, let's load our dataset. The dataset is stored in a CSV file named 'time_series_data.csv'. Load this dataset into a DataFrame and assign it to the variable df.

    πŸ” Hint

    Use the pd.read_csv() function to read the CSV file. Pass the file name as a string argument to this function.

    πŸ”‘ Solution
    df = pd.read_csv('time_series_data.csv')
    

    Task 1.3: Checking the Data

    Before we write our assert statement, let's first check our data. Display the first 5 rows of the DataFrame using the head() function.

    πŸ” Hint

    Call the head() function on the DataFrame df.

    πŸ”‘ Solution
    df.head()
    

    Task 1.4: Writing the Assert Statement

    Now that we have our data, let's write our assert statement. Write an assert statement that checks whether all values in the likes column of our dataset are greater than 0.

    πŸ” Hint

    Use the assert keyword followed by the condition to check. To check if all values in the likes column are greater than 0, use the all() function on the condition df['likes'] > 0. Remember that if an assert runs and has no error, no text will print.

    πŸ”‘ Solution
    assert (df['likes'] > 0).all()
    

    Task 1.5: Failed Assertions

    Now that you have an assertion that's passing, we should see what it looks like when an assert statement doesn't pass. Rewrite your assert from the previous tasks, but this time change your assert so that an AssertionError is raised.

    πŸ” Hint

    Use the assert keyword followed by the condition to check. To raise an AssertionError, we need to have the condition we evaluate result in False. To do this, consider altering > 0 to < 0.

    πŸ”‘ Solution
    assert (df['likes'] < 0).all()
    
  2. Challenge

    Verifying Index Equality

    Verifying Index Equality

    To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.

    Verifying index equality is important because it helps ensure that the data we are comparing is aligned correctly. The assert_index_equal function from the pandas.testing module can be used to assert that both indexes are equal.

    Now, let's create two index objects based on the timestamp column of our dataset and use the assert_index_equal function to verify their equality. Remember, the indexes must be equal for our data to be considered clean.


    Task 2.1: Import Necessary Libraries

    Import the necessary libraries for this step. You will need pandas for data manipulation and assert_index_equal from the pandas.testing module for verifying index equality.

    πŸ” Hint

    Use the import keyword to import pandas and pandas.testing. Remember to use the as keyword to give them an alias if you want.

    πŸ”‘ Solution
    import pandas as pd
    from pandas.testing import assert_index_equal
    

    Task 2.2: Load the Dataset

    Load the time_series_data.csv file into a pandas DataFrame. Remember to parse the timestamp column as datetime.

    πŸ” Hint

    Use the pd.read_csv() function to load the csv file. To parse the 'timestamp' column as datetime, use the parse_dates parameter and pass it a list of column names to parse as datetime.

    πŸ”‘ Solution
    df = pd.read_csv('time_series_data.csv', parse_dates=['timestamp'])
    

    Task 2.3: Create Index Objects

    Create two index objects based on the timestamp column of the DataFrame. Print one of the index objects.

    πŸ” Hint

    Use the pd.Index() function to create an index object. Pass the timestamp column of the DataFrame to this function.

    πŸ”‘ Solution
    index1 = pd.Index(df['timestamp'])
    index2 = pd.Index(df['timestamp'])
    print(index1)
    

    Task 2.4: Verify Index Equality

    Use the assert_index_equal function to verify the equality of the two index objects you created. Save the results to results and print it. It should print None.

    πŸ” Hint

    Use the assert_index_equal() function from the pandas.testing module. Pass the two index objects as arguments to this function. The result of assert_index_equal() should print as None.

    πŸ”‘ Solution
    results = assert_index_equal(index1, index2)
    print(results)
    
  3. Challenge

    Verifying Series Equality

    Verifying Series Equality

    To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.

    Verifying series equality is important because it allows us to compare two series in pandas and ensure that they contain the same data. The assert_series_equal function can be used to compare the values of the series.

    For this step, we'll create two series from the likes column of our dataset. Then, use the assert_series_equal function to verify that they are equal. We'll also see what it looks like when the series are not equal.


    Task 3.1: Import Necessary Libraries

    Import the necessary libraries for this step. You will need pandas for data manipulation and assert_series_equal from the pandas.testing module for verifying index equality.

    πŸ” Hint

    Use the import keyword to import pandas and the from keyword to import assert_series_equal from pandas.testing. You can use the as keyword to give them an alias if you want.

    πŸ”‘ Solution
    import pandas as pd
    from pandas.testing import assert_series_equal
    

    Task 3.2: Load the Dataset

    Now that we have our libraries, we can load our dataset. Load the 'time_series_data.csv' file into a pandas DataFrame.

    πŸ” Hint

    Use the pd.read_csv() function to read the csv file. The file path is 'time_series_data.csv'.

    πŸ”‘ Solution
    df = pd.read_csv('time_series_data.csv')
    

    Task 3.3: Create Two Series from the 'likes' Column

    Next, we need to create two series from the likes column of our DataFrame. Let's call them series1 and series2.

    πŸ” Hint

    You can create a series from a DataFrame column by using the syntax df['column_name']. Do this twice to create series1 and series2.

    πŸ”‘ Solution
    series1 = df['likes']
    series2 = df['likes']
    

    Task 3.4: Verify Series Equality

    Finally, we need to verify that our two series are equal. Use the assert_series_equal function from pandas to do this.

    πŸ” Hint

    The assert_series_equal function can be used like this: assert_series_equal(series1, series2). Replace 'series1' and 'series2' with the names of your series.

    πŸ”‘ Solution
    assert_series_equal(series1, series2)
    

    Task 3.5: Raising an AssertionError

    Finally, lets see what happens when two series are not equal! Change your second series so that it is different from the first series and use assert_series_equal again.

    πŸ” Hint

    The assert_series_equal function can be used like this: assert_series_equal(series1, series2). To modify the second series, you can change it to be the shares column.

    πŸ”‘ Solution
    series2 = df['shares']
    assert_series_equal(series1, series2)
    
  4. Challenge

    Verifying DataFrame Equality

    Verifying DataFrame Equality

    To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.

    Verifying DataFrame equality is important because it allows us to compare two DataFrames and ensure that they contain the same data. The assert_frame_equal function can be used for this purpose.

    In this task, create two DataFrames from our dataset and use the assert_frame_equal function to verify their equality. Remember to use the optional parameter check_like=True to compare only the data and shape, ignoring the order of index and column names.


    Task 4.1: Import Required Libraries

    Import the pandas library as pd and the pandas.testing module as tm.

    πŸ” Hint

    Use the import keyword to import pandas and pandas.testing. Remember to use as keyword to give them an alias.

    πŸ”‘ Solution
    import pandas as pd
    import pandas.testing as tm
    

    Task 4.2: Load the Dataset

    Load the time_series_data.csv file into two separate pandas DataFrames.

    πŸ” Hint

    Use the pd.read_csv() function to load the csv file into a DataFrame. Do this twice to create two separate DataFrames.

    πŸ”‘ Solution
    df1 = pd.read_csv('time_series_data.csv')
    df2 = pd.read_csv('time_series_data.csv')
    

    Task 4.3: Verify DataFrame Equality

    Use the assert_frame_equal function to verify the equality of the two DataFrames.

    πŸ” Hint

    Use the tm.assert_frame_equal() function and pass in the two DataFrames as arguments. Also, set the check_like parameter to True.

    πŸ”‘ Solution
    tm.assert_frame_equal(df1, df2, check_like=True)
    

    Task 4.4: Test the check_like=True Behavior

    Use the code in the notebook to switch the order of the columns in the second dataframe and then use the assert_frame_equal function to verify the equality of the two DataFrames. Use both check_like=True and check_like=False

    πŸ” Hint

    Use the tm.assert_frame_equal() function and pass in the two DataFrames as arguments. Also, set the check_like parameter to True.

    πŸ”‘ Solution
    reversed_columns = df1.columns[::-1]
    df2 = df1[reversed_columns]
    
    tm.assert_frame_equal(df1, df2, check_like=True)
    
    tm.assert_frame_equal(df1, df2, check_like=False)
    
  5. Challenge

    Quantitative Test for Clean Data Using Asserts

    Quantitative Test for Clean Data Using Asserts

    To review the concepts covered in this step, please refer to the Using Assert-based Tests for Data Cleaning module of the Validate Data Cleanliness Using Asserts in Python course.

    Quantitative tests using asserts are important because they help check for missing data, out-of-range values, and incorrect data types. These tests can save time and resources in data analysis.

    Now, let's write a quantitative test using asserts to check for missing data in our DataFrame. Additionally, write tests to check for out-of-range values in the likes, shares, and comments columns, and to check that these columns contain numerical data.


    Task 5.1: Import Necessary Libraries

    Before we start, we need to import the necessary libraries. Import pandas as pd.

    πŸ” Hint

    Use the import keyword to import pandas. Remember to import it as pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 5.2: Load the Dataset

    Load the dataset 'time_series_data.csv' into a DataFrame named df.

    πŸ” Hint

    Use the pd.read_csv() function to read the csv file. Pass the file path as a string to this function.

    πŸ”‘ Solution
    df = pd.read_csv('time_series_data.csv')
    

    Task 5.3: Check for Missing Data

    Write an assert statement to check if there are any missing values in the DataFrame.

    πŸ” Hint

    Use the pd.isnull() function to check for missing values. Then use the any() function to check if there are any True values. Finally, use the assert keyword to make sure the result is False. (Remember to call the any() function twice.)

    πŸ”‘ Solution
    assert not df.isnull().any().any()
    

    Task 5.4: Check for Out-of-Range Values

    Write assert statements to check if there are any out-of-range values in the likes, shares, and comments columns. The range for likes is 0-3000, for shares is 0-500, and for comments is 0-200.

    πŸ” Hint

    Use the assert keyword along with the all() function to check if all values in the specified columns are within the given range. Use the pd.Series.between method which takes a lower bound and an upper bound. Here's an example:

    df[col].between(lower, upper)
    
    πŸ”‘ Solution
    assert df['likes'].between(0, 3000).all()
    assert df['shares'].between(0, 500).all()
    assert df['comments'].between(0, 200).all()
    

    Task 5.5: Check for Correct Data Types

    Write assert statements to check if the likes, shares, and comments columns contain numerical data.

    πŸ” Hint

    Use the assert keyword along with the dtypes attribute to check if the data type of the specified columns is int64.

    πŸ”‘ Solution
    assert df['likes'].dtype == 'int64'
    assert df['shares'].dtype == 'int64'
    assert df['comments'].dtype == 'int64'
    
  6. Challenge

    Logical Test for Clean Data Using Asserts

    Logical Test for Clean Data Using Asserts

    To review the concepts covered in this step, please refer to the Using Assert-based Tests for Data Cleaning module of the Validate Data Cleanliness Using Asserts in Python course.

    Logical tests using asserts are important because they help detect data entry errors, inconsistencies, and unreasonable values, ensuring data coherence and consistency.

    Finally, let's write a logical test using asserts to check for data coherence in our DataFrame. For example, you could write a test to check that the number of likes is always greater than or equal to the number of shares and comments.


    Task 6.1: Import Required Libraries

    Before we can start working with our data, we need to import the necessary libraries. For this task, import the pandas library as pd.

    πŸ” Hint

    Use the import keyword followed by the library name and as keyword to give it a short alias. For example, import _________ as pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 6.2: Load the Dataset

    Now that we have our libraries imported, let's load our dataset. The dataset is stored in a CSV file named 'time_series_data.csv'. Load this file into a DataFrame and display the first few rows of the DataFrame.

    πŸ” Hint

    Use the pd.read_csv() function to read the CSV file and store it in a variable. Then, use the .head() method on the DataFrame to display the first few rows.

    πŸ”‘ Solution
    df = pd.read_csv('time_series_data.csv')
    print(df.head())
    

    Task 6.3: Write a Logical Test Using Assert

    Now, let's write a logical test using assert to check that the number of likes is always greater than or equal to the number of shares and comments. If the condition is not met, the assert statement will throw an AssertionError.

    πŸ” Hint

    Use the assert keyword followed by the condition you want to check. For example:

    assert (df[col1] >= df[col2]).all()
    
    πŸ”‘ Solution
    assert (df['likes'] >= df['shares']).all() and (df['likes'] >= df['comments']).all()
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.