- Lab
- Data

Manage Invalid, Duplicate, and Missing Data in Python Hands-on Practice
In this lab, You will tackle key data cleaning challenges using Python and Pandas. You'll identify and handle missing values, removing duplicates, and correcting invalid data. Then, you will use functions like isna(), dropna(), fillna(), and drop_duplicates() to ensure data accuracy, a vital skill for effective data analysis.

Path Info
Table of Contents
-
Challenge
Identifying and Handling Missing Values
Jupyter Guide
To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells
(ctrl/cmd + Enter)
for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Identifying and Handling Missing Values
To review the concepts covered in this step, please refer to the Handling Missing Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.
Let's get our hands dirty with some real data cleaning! In this step, we will be practicing how to identify and handle missing values in our dataset. Our goal is to ensure that our dataset is free from any NaN values that could potentially skew our analysis. We will be using the
isna()
,dropna()
, andfillna()
functions from the pandas library to achieve this. Remember, clean data is happy data!
Task 1.1: Importing Libraries and Loading the Dataset
First up, let's get our environment ready. Import pandas and load 'Modified_Transactions.csv' into a DataFrame. Take a moment to display the first few rows with
df.head()
. This will give you a glimpse of the dataset, especially the missing values we've introduced for this exercise.π Hint
Start by importing pandas with
import pandas as pd
. Then load the dataset usingpd.read_csv(filepath)
and showcase the first few rows withdf.head()
.π Solution
import pandas as pd # Load the dataset df = pd.read_csv('Modified_Transactions.csv') df.head()
Task 1.2: Identifying Missing Values
Now, let's find out where these missing values are. By calculating missing values per column using
isna()
, you can gauge the extent of missing data. Display the number of NaN values in each column.π Hint
Use
isna()
combined withsum(axis=0)
ondf
to count missing values for each row. Theaxis=0
column counts the number of NaN values in the first (0th) axis, along each column.π Solution
# Identify missing values per col df.isna().sum(axis=0)
Task 1.3: Removing Rows with Missing Values
Let's practice cleaning the data by removing rows with any missing values. Use
df.dropna()
and assign the result to a new variable. Compare the original and the new DataFrame sizes.π Hint
Apply
dropna()
todf
to remove rows with missing values. Store the cleaned data indf_dropped
. Compare the shapes of both DataFrames to see how many rows were affected.π Solution
# Remove rows with missing values df_dropped = df.dropna() print("Original shape:", df.shape) print("New shape:", df_dropped.shape)
Task 1.4: Filling Missing Values
Finally, let's fill in some missing values instead of removing them. Use
df.fillna()
to fill the missing values in theprice
column. Fill the missing values with the mean of the price column. Use theSeries.mean()
method to compute the mean.Make a copy of the original DataFrame using the provided code. Make your edits to this copy. Display the first few rows to verify the results.
π Hint
First, calculate the mean of the price column with
df['price'].mean()
. Then usefillna()
on theprice
column, passing in the mean value as the argument.π Solution
# Provided Code df_filled = df.copy() # Calculate the mean of the price column price_mean = df['price'].mean() # Fill missing values in the price column with the mean df_filled['price'] = df['price'].fillna(price_mean) # Display the first few rows to verify the changes df_filled.head()
-
Challenge
Identifying and Removing Duplicate Rows
Identifying and Removing Duplicate Rows
To review the concepts covered in this step, please refer to the Handling Duplicate Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.
Now that we've dealt with missing values, let's move on to another common data cleaning task: dealing with duplicate rows. In this step, we will be practicing how to identify and remove duplicate rows from our dataset using the
duplicated()
anddrop_duplicates()
functions in pandas. By the end of this step, our dataset should be free from any redundant information.
Task 2.1: Importing Necessary Libraries
Before we start working with our data, we need to import the necessary libraries. In this task, you will import the pandas library which we will use for data manipulation and analysis.
π Hint
Use the
import
keyword to import the pandas library. It is common practice to import pandas aspd
.π Solution
import pandas as pd
Task 2.2: Loading the Dataset
Now that we have imported the necessary libraries, the next step is to load our dataset. In this task, you will load the dataset into a pandas DataFrame. The data is in
Duplicated_Transactions.csv
.Once the data is loaded, save it as
df
and display the first few rows. Notice, there should be some duplicated rows within the first few rows.π Hint
Use the
pd.read_csv()
function to read the CSV file. The file path is 'Duplicated_Transactions.csv'. Assign the result to a variable nameddf
. Display the first few rows with theDataFrame.head()
method.π Solution
df = pd.read_csv('Duplicated_Transactions.csv') df.head()
Task 2.3: Identifying Duplicate Rows
With our data loaded, we can now start identifying duplicate rows. In this task, you will use the
DataFrame.duplicated()
method to identify duplicate rows in our DataFrame. Add the results to a new column,Duplicate
.Display the head of
df
after adding the column. Notice that the first occurrence of the duplicate set is not marked as a duplicate.π Hint
Use the
duplicated()
function on the DataFramedf
. This will return a Series of booleans where True indicates a duplicate row. Assign the result to a new column,df['Duplicate']
.π Solution
duplicates = df.duplicated() df['Duplicate'] = duplicates df.head()
Task 2.4: Counting Duplicate Rows
After identifying the duplicate rows, it would be useful to know how many there are. In this task, you will print the number of duplicate rows in our DataFrame by summing the
Duplicate
column we created inTask 2.3
.π Hint
Use the
sum()
function on theDuplicate
column to count the number of True values (i.e., the number of duplicate rows).π Solution
num_duplicates = duplicates.sum() print(num_duplicates)
Task 2.5: Removing Duplicate Rows
Now that we know how many duplicate rows there are, the next step is to remove them. In this task, you first remove the
Duplicate
column using the provided code, and then will use thedrop_duplicates()
function to remove duplicate rows from our DataFrame. Assign the result to a new DataFrameDisplay the shape of the original
df
and the new DataFrame.π Hint
Use the
drop_duplicates()
function on the DataFramedf
. This will return a new DataFrame with the duplicate rows removed. Use theDataFrame.shape
attribute to access the shape of the DataFrame.π Solution
# Provided code to remove the 'Duplicate' column df.drop('Duplicate', axis=1, inplace=True) clean_df = df.drop_duplicates() print("Original:", df.shape) print("Deduplicated:", clean_df.shape)
-
Challenge
Identifying and Handling Invalid Values
Identifying and Handling Invalid Values
To review the concepts covered in this step, please refer to the Handling Invalid Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.
We're almost there! The final step in our data cleaning journey involves dealing with invalid values. These could be values that are out of range, of the wrong data type, or simply erroneous. In this step, we will be practicing how to identify and handle such values using pandas. By the end of this step, our dataset should be clean, accurate, and ready for analysis!
Task 3.1: Load the Dataset
Start by importing pandas and loading the dataset into a pandas DataFrame. The file path is 'Modified_Transactions.csv'. Display the first few rows of the data and
π Hint
Use the
pd.read_csv()
function to load the dataset. The file path should be passed as a string argument to this function.π Solution
import pandas as pd # Load the dataset df = pd.read_csv('Modified_Transactions.csv')
Task 3.2: Identify Invalid Values
Identify any invalid values in the 'price' column. For this dataset, a valid quantity is any number greater than 0. Print the rows with the invalid values.
π Hint
Create a boolean mask using the greater than (
<
) operator against the column's values. Index the DataFrame's rows using this mask to produce a subset of the original data. Display the subsetted data.π Solution
# Identify invalid values invalid_values = df['price'] < 0 df[invalid_values]
Task 3.3: Handle Invalid Values
Begin by importing the
numpy
library, which will be used to represent NaN values. Then, proceed to replace any invalid values in the 'price' column with NaN. After making these replacements, confirm that the changes are correct by displaying the rows that initially contained invalid values.π Hint
Utilize
np.nan
from thenumpy
library for NaN values. Apply the boolean mask, created inTask 3.2
, to accurately identify and replace the invalid values in the DataFrame. For example, if your boolean mask is namedinvalid_values
, index the DataFrame withdf.loc[invalid_values, 'price']
to specifically target the 'price' column in the rows flagged by the mask.π Solution
import numpy as np # Replace invalid values with NaN df.loc[invalid_values, 'price'] = np.nan # Verify the changes by displaying the rows that originally had invalid values df[invalid_values]
Task 3.4: Explore for More Invalid Data
π Congratulations on reaching this point! There is additional invalid data in the
Modified_Transactions.csv
dataset. If you want more practice, search for and replace any other invalid data.
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.