- Lab
- Data

Validate Data Cleanliness Using Asserts in Python Hands-on Practice
This lab is the companion hands-on experience to the Pluralsight course Validate Data Cleanliness Using Asserts in Python. In this lab you'll practice using pandas and assert statements to reveal issues in your data.

Path Info
Table of Contents
-
Challenge
Asserts for Data Validation
Asserts for Data Validation
To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.
Understanding how to use assert statements in Python is important because they help in catching mistakes and validating data cleanliness. They can be used to check if a Boolean condition returns true or false, which is crucial in data validation.
Let's put our knowledge into practice! Your task is to write an assert statement that checks whether all values in the
likes
column of our dataset are greater than 0. This will help ensure that our data doesn't contain any erroneous entries where the number of likes is less than 0. We'll also see what it looks like when an assertion fails.To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Task 1.1: Importing Necessary Libraries
Before we can start working with our data, we need to import the necessary libraries. For this task, import the pandas library as pd.
π Hint
Use the
import
keyword to import pandas. To import it aspd
, use theas
keyword.π Solution
import pandas as pd
Task 1.2: Loading the Dataset
Now that we have our necessary library, let's load our dataset. The dataset is stored in a CSV file named 'time_series_data.csv'. Load this dataset into a DataFrame and assign it to the variable
df
.π Hint
Use the
pd.read_csv()
function to read the CSV file. Pass the file name as a string argument to this function.π Solution
df = pd.read_csv('time_series_data.csv')
Task 1.3: Checking the Data
Before we write our assert statement, let's first check our data. Display the first 5 rows of the DataFrame using the
head()
function.π Hint
Call the
head()
function on the DataFramedf
.π Solution
df.head()
Task 1.4: Writing the Assert Statement
Now that we have our data, let's write our assert statement. Write an assert statement that checks whether all values in the
likes
column of our dataset are greater than 0.π Hint
Use the
assert
keyword followed by the condition to check. To check if all values in thelikes
column are greater than 0, use theall()
function on the conditiondf['likes'] > 0
. Remember that if an assert runs and has no error, no text will print.π Solution
assert (df['likes'] > 0).all()
Task 1.5: Failed Assertions
Now that you have an assertion that's passing, we should see what it looks like when an assert statement doesn't pass. Rewrite your assert from the previous tasks, but this time change your assert so that an
AssertionError
is raised.π Hint
Use the
assert
keyword followed by the condition to check. To raise an AssertionError, we need to have the condition we evaluate result inFalse
. To do this, consider altering> 0
to< 0
.π Solution
assert (df['likes'] < 0).all()
-
Challenge
Verifying Index Equality
Verifying Index Equality
To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.
Verifying index equality is important because it helps ensure that the data we are comparing is aligned correctly. The
assert_index_equal
function from thepandas.testing
module can be used to assert that both indexes are equal.Now, let's create two index objects based on the
timestamp
column of our dataset and use theassert_index_equal
function to verify their equality. Remember, the indexes must be equal for our data to be considered clean.
Task 2.1: Import Necessary Libraries
Import the necessary libraries for this step. You will need
pandas
for data manipulation andassert_index_equal
from thepandas.testing
module for verifying index equality.π Hint
Use the
import
keyword to importpandas
andpandas.testing
. Remember to use theas
keyword to give them an alias if you want.π Solution
import pandas as pd from pandas.testing import assert_index_equal
Task 2.2: Load the Dataset
Load the
time_series_data.csv
file into a pandas DataFrame. Remember to parse thetimestamp
column as datetime.π Hint
Use the
pd.read_csv()
function to load the csv file. To parse the'timestamp'
column as datetime, use theparse_dates
parameter and pass it a list of column names to parse as datetime.π Solution
df = pd.read_csv('time_series_data.csv', parse_dates=['timestamp'])
Task 2.3: Create Index Objects
Create two index objects based on the
timestamp
column of the DataFrame. Print one of the index objects.π Hint
Use the
pd.Index()
function to create an index object. Pass thetimestamp
column of the DataFrame to this function.π Solution
index1 = pd.Index(df['timestamp']) index2 = pd.Index(df['timestamp']) print(index1)
Task 2.4: Verify Index Equality
Use the
assert_index_equal
function to verify the equality of the two index objects you created. Save the results toresults
and print it. It should printNone
.π Hint
Use the
assert_index_equal()
function from the pandas.testing module. Pass the two index objects as arguments to this function. The result ofassert_index_equal()
should print asNone
.π Solution
results = assert_index_equal(index1, index2) print(results)
-
Challenge
Verifying Series Equality
Verifying Series Equality
To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.
Verifying series equality is important because it allows us to compare two series in pandas and ensure that they contain the same data. The
assert_series_equal
function can be used to compare the values of the series.For this step, we'll create two series from the
likes
column of our dataset. Then, use theassert_series_equal
function to verify that they are equal. We'll also see what it looks like when the series are not equal.
Task 3.1: Import Necessary Libraries
Import the necessary libraries for this step. You will need
pandas
for data manipulation andassert_series_equal
from thepandas.testing
module for verifying index equality.π Hint
Use the
import
keyword to importpandas
and the from keyword to importassert_series_equal
frompandas.testing
. You can use theas
keyword to give them an alias if you want.π Solution
import pandas as pd from pandas.testing import assert_series_equal
Task 3.2: Load the Dataset
Now that we have our libraries, we can load our dataset. Load the 'time_series_data.csv' file into a pandas DataFrame.
π Hint
Use the
pd.read_csv()
function to read the csv file. The file path is'time_series_data.csv'
.π Solution
df = pd.read_csv('time_series_data.csv')
Task 3.3: Create Two Series from the 'likes' Column
Next, we need to create two series from the
likes
column of our DataFrame. Let's call themseries1
andseries2
.π Hint
You can create a series from a DataFrame column by using the syntax
df['column_name']
. Do this twice to createseries1
andseries2
.π Solution
series1 = df['likes'] series2 = df['likes']
Task 3.4: Verify Series Equality
Finally, we need to verify that our two series are equal. Use the
assert_series_equal
function from pandas to do this.π Hint
The
assert_series_equal
function can be used like this:assert_series_equal(series1, series2)
. Replace 'series1' and 'series2' with the names of your series.π Solution
assert_series_equal(series1, series2)
Task 3.5: Raising an
AssertionError
Finally, lets see what happens when two series are not equal! Change your second series so that it is different from the first series and use
assert_series_equal
again.π Hint
The
assert_series_equal
function can be used like this:assert_series_equal(series1, series2)
. To modify the second series, you can change it to be theshares
column.π Solution
series2 = df['shares'] assert_series_equal(series1, series2)
-
Challenge
Verifying DataFrame Equality
Verifying DataFrame Equality
To review the concepts covered in this step, please refer to the Validating and Verifying Data Using Asserts module of the Validate Data Cleanliness Using Asserts in Python course.
Verifying DataFrame equality is important because it allows us to compare two DataFrames and ensure that they contain the same data. The
assert_frame_equal
function can be used for this purpose.In this task, create two DataFrames from our dataset and use the
assert_frame_equal
function to verify their equality. Remember to use the optional parametercheck_like=True
to compare only the data and shape, ignoring the order of index and column names.
Task 4.1: Import Required Libraries
Import the pandas library as pd and the pandas.testing module as tm.
π Hint
Use the
import
keyword to import pandas and pandas.testing. Remember to useas
keyword to give them an alias.π Solution
import pandas as pd import pandas.testing as tm
Task 4.2: Load the Dataset
Load the time_series_data.csv file into two separate pandas DataFrames.
π Hint
Use the
pd.read_csv()
function to load the csv file into a DataFrame. Do this twice to create two separate DataFrames.π Solution
df1 = pd.read_csv('time_series_data.csv') df2 = pd.read_csv('time_series_data.csv')
Task 4.3: Verify DataFrame Equality
Use the
assert_frame_equal
function to verify the equality of the two DataFrames.π Hint
Use the
tm.assert_frame_equal()
function and pass in the two DataFrames as arguments. Also, set thecheck_like
parameter to True.π Solution
tm.assert_frame_equal(df1, df2, check_like=True)
Task 4.4: Test the
check_like=True
BehaviorUse the code in the notebook to switch the order of the columns in the second dataframe and then use the
assert_frame_equal
function to verify the equality of the two DataFrames. Use bothcheck_like=True
andcheck_like=False
π Hint
Use the
tm.assert_frame_equal()
function and pass in the two DataFrames as arguments. Also, set thecheck_like
parameter to True.π Solution
reversed_columns = df1.columns[::-1] df2 = df1[reversed_columns] tm.assert_frame_equal(df1, df2, check_like=True) tm.assert_frame_equal(df1, df2, check_like=False)
-
Challenge
Quantitative Test for Clean Data Using Asserts
Quantitative Test for Clean Data Using Asserts
To review the concepts covered in this step, please refer to the Using Assert-based Tests for Data Cleaning module of the Validate Data Cleanliness Using Asserts in Python course.
Quantitative tests using asserts are important because they help check for missing data, out-of-range values, and incorrect data types. These tests can save time and resources in data analysis.
Now, let's write a quantitative test using asserts to check for missing data in our DataFrame. Additionally, write tests to check for out-of-range values in the
likes
,shares
, andcomments
columns, and to check that these columns contain numerical data.
Task 5.1: Import Necessary Libraries
Before we start, we need to import the necessary libraries. Import pandas as pd.
π Hint
Use the
import
keyword to import pandas. Remember to import it aspd
.π Solution
import pandas as pd
Task 5.2: Load the Dataset
Load the dataset 'time_series_data.csv' into a DataFrame named
df
.π Hint
Use the
pd.read_csv()
function to read the csv file. Pass the file path as a string to this function.π Solution
df = pd.read_csv('time_series_data.csv')
Task 5.3: Check for Missing Data
Write an assert statement to check if there are any missing values in the DataFrame.
π Hint
Use the
pd.isnull()
function to check for missing values. Then use theany()
function to check if there are any True values. Finally, use theassert
keyword to make sure the result is False. (Remember to call theany()
function twice.)π Solution
assert not df.isnull().any().any()
Task 5.4: Check for Out-of-Range Values
Write assert statements to check if there are any out-of-range values in the
likes
,shares
, andcomments
columns. The range forlikes
is 0-3000, forshares
is 0-500, and forcomments
is 0-200.π Hint
Use the
assert
keyword along with theall()
function to check if all values in the specified columns are within the given range. Use thepd.Series.between
method which takes a lower bound and an upper bound. Here's an example:df[col].between(lower, upper)
π Solution
assert df['likes'].between(0, 3000).all() assert df['shares'].between(0, 500).all() assert df['comments'].between(0, 200).all()
Task 5.5: Check for Correct Data Types
Write assert statements to check if the
likes
,shares
, andcomments
columns contain numerical data.π Hint
Use the
assert
keyword along with thedtypes
attribute to check if the data type of the specified columns isint64
.π Solution
assert df['likes'].dtype == 'int64' assert df['shares'].dtype == 'int64' assert df['comments'].dtype == 'int64'
-
Challenge
Logical Test for Clean Data Using Asserts
Logical Test for Clean Data Using Asserts
To review the concepts covered in this step, please refer to the Using Assert-based Tests for Data Cleaning module of the Validate Data Cleanliness Using Asserts in Python course.
Logical tests using asserts are important because they help detect data entry errors, inconsistencies, and unreasonable values, ensuring data coherence and consistency.
Finally, let's write a logical test using asserts to check for data coherence in our DataFrame. For example, you could write a test to check that the number of
likes
is always greater than or equal to the number ofshares
andcomments
.
Task 6.1: Import Required Libraries
Before we can start working with our data, we need to import the necessary libraries. For this task, import the pandas library as pd.
π Hint
Use the
import
keyword followed by the library name andas
keyword to give it a short alias. For example,import _________ as pd
.π Solution
import pandas as pd
Task 6.2: Load the Dataset
Now that we have our libraries imported, let's load our dataset. The dataset is stored in a CSV file named 'time_series_data.csv'. Load this file into a DataFrame and display the first few rows of the DataFrame.
π Hint
Use the
pd.read_csv()
function to read the CSV file and store it in a variable. Then, use the.head()
method on the DataFrame to display the first few rows.π Solution
df = pd.read_csv('time_series_data.csv') print(df.head())
Task 6.3: Write a Logical Test Using Assert
Now, let's write a logical test using assert to check that the number of
likes
is always greater than or equal to the number ofshares
andcomments
. If the condition is not met, the assert statement will throw an AssertionError.π Hint
Use the
assert
keyword followed by the condition you want to check. For example:assert (df[col1] >= df[col2]).all()
π Solution
assert (df['likes'] >= df['shares']).all() and (df['likes'] >= df['comments']).all()
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.