Manage Invalid, Duplicate, and Missing Data in Python Hands-on Practice

Libraries: If you want this lab, consider one of these libraries.
Data

Manage Invalid, Duplicate, and Missing Data in Python Hands-on Practice

In this lab, You will tackle key data cleaning challenges using Python and Pandas. You'll identify and handle missing values, removing duplicates, and correcting invalid data. Then, you will use functions like isna(), dropna(), fillna(), and drop_duplicates() to ensure data accuracy, a vital skill for effective data analysis.

Get started Contact sales

Lab Info

Level

Beginner

Last updated

Sep 15, 2025

Duration

30m

Challenge

Identifying and Handling Missing Values
Jupyter Guide

To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.

Identifying and Handling Missing Values

To review the concepts covered in this step, please refer to the Handling Missing Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

Let's get our hands dirty with some real data cleaning! In this step, we will be practicing how to identify and handle missing values in our dataset. Our goal is to ensure that our dataset is free from any NaN values that could potentially skew our analysis. We will be using the isna(), dropna(), and fillna() functions from the pandas library to achieve this. Remember, clean data is happy data!

Task 1.1: Importing Libraries and Loading the Dataset

First up, let's get our environment ready. Import pandas and load 'Modified_Transactions.csv' into a DataFrame. Take a moment to display the first few rows with df.head(). This will give you a glimpse of the dataset, especially the missing values we've introduced for this exercise.

🔍 Hint

Start by importing pandas with import pandas as pd. Then load the dataset using pd.read_csv(filepath) and showcase the first few rows with df.head().
🔑 Solution

import pandas as pd # Load the dataset df = pd.read_csv('Modified_Transactions.csv') df.head()
Task 1.2: Identifying Missing Values

Now, let's find out where these missing values are. By calculating missing values per column using isna(), you can gauge the extent of missing data. Display the number of NaN values in each column.

🔍 Hint

Use isna() combined with sum(axis=0) on df to count missing values for each row. The axis=0 column counts the number of NaN values in the first (0th) axis, along each column.
🔑 Solution

# Identify missing values per col df.isna().sum(axis=0)
Task 1.3: Removing Rows with Missing Values

Let's practice cleaning the data by removing rows with any missing values. Use df.dropna() and assign the result to a new variable. Compare the original and the new DataFrame sizes.

🔍 Hint

Apply dropna() to df to remove rows with missing values. Store the cleaned data in df_dropped. Compare the shapes of both DataFrames to see how many rows were affected.
🔑 Solution

# Remove rows with missing values df_dropped = df.dropna() print("Original shape:", df.shape) print("New shape:", df_dropped.shape)
Task 1.4: Filling Missing Values

Finally, let's fill in some missing values instead of removing them. Use df.fillna() to fill the missing values in the price column. Fill the missing values with the mean of the price column. Use the Series.mean() method to compute the mean.

Make a copy of the original DataFrame using the provided code. Make your edits to this copy. Display the first few rows to verify the results.

🔍 Hint

First, calculate the mean of the price column with df['price'].mean(). Then use fillna() on the price column, passing in the mean value as the argument.
🔑 Solution

# Provided Code df_filled = df.copy() # Calculate the mean of the price column price_mean = df['price'].mean() # Fill missing values in the price column with the mean df_filled['price'] = df['price'].fillna(price_mean) # Display the first few rows to verify the changes df_filled.head()
Challenge

Identifying and Removing Duplicate Rows
Identifying and Removing Duplicate Rows

To review the concepts covered in this step, please refer to the Handling Duplicate Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

Now that we've dealt with missing values, let's move on to another common data cleaning task: dealing with duplicate rows. In this step, we will be practicing how to identify and remove duplicate rows from our dataset using the duplicated() and drop_duplicates() functions in pandas. By the end of this step, our dataset should be free from any redundant information.

Task 2.1: Importing Necessary Libraries

Before we start working with our data, we need to import the necessary libraries. In this task, you will import the pandas library which we will use for data manipulation and analysis.

🔍 Hint

Use the import keyword to import the pandas library. It is common practice to import pandas as pd.
🔑 Solution

import pandas as pd
Task 2.2: Loading the Dataset

Now that we have imported the necessary libraries, the next step is to load our dataset. In this task, you will load the dataset into a pandas DataFrame. The data is in Duplicated_Transactions.csv.

Once the data is loaded, save it as df and display the first few rows. Notice, there should be some duplicated rows within the first few rows.

🔍 Hint

Use the pd.read_csv() function to read the CSV file. The file path is 'Duplicated_Transactions.csv'. Assign the result to a variable named df. Display the first few rows with the DataFrame.head() method.
🔑 Solution

df = pd.read_csv('Duplicated_Transactions.csv') df.head()
Task 2.3: Identifying Duplicate Rows

With our data loaded, we can now start identifying duplicate rows. In this task, you will use the DataFrame.duplicated() method to identify duplicate rows in our DataFrame. Add the results to a new column, Duplicate.

Display the head of df after adding the column. Notice that the first occurrence of the duplicate set is not marked as a duplicate.

🔍 Hint

Use the duplicated() function on the DataFrame df. This will return a Series of booleans where True indicates a duplicate row. Assign the result to a new column, df['Duplicate'].
🔑 Solution

duplicates = df.duplicated() df['Duplicate'] = duplicates df.head()
Task 2.4: Counting Duplicate Rows

After identifying the duplicate rows, it would be useful to know how many there are. In this task, you will print the number of duplicate rows in our DataFrame by summing the Duplicate column we created in Task 2.3.

🔍 Hint

Use the sum() function on the Duplicate column to count the number of True values (i.e., the number of duplicate rows).
🔑 Solution

num_duplicates = duplicates.sum() print(num_duplicates)
Task 2.5: Removing Duplicate Rows

Now that we know how many duplicate rows there are, the next step is to remove them. In this task, you first remove the Duplicate column using the provided code, and then will use the drop_duplicates() function to remove duplicate rows from our DataFrame. Assign the result to a new DataFrame

Display the shape of the original df and the new DataFrame.

🔍 Hint

Use the drop_duplicates() function on the DataFrame df. This will return a new DataFrame with the duplicate rows removed. Use the DataFrame.shape attribute to access the shape of the DataFrame.
🔑 Solution

# Provided code to remove the 'Duplicate' column df.drop('Duplicate', axis=1, inplace=True) clean_df = df.drop_duplicates() print("Original:", df.shape) print("Deduplicated:", clean_df.shape)
Challenge

Identifying and Handling Invalid Values
Identifying and Handling Invalid Values

To review the concepts covered in this step, please refer to the Handling Invalid Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

We're almost there! The final step in our data cleaning journey involves dealing with invalid values. These could be values that are out of range, of the wrong data type, or simply erroneous. In this step, we will be practicing how to identify and handle such values using pandas. By the end of this step, our dataset should be clean, accurate, and ready for analysis!

Task 3.1: Load the Dataset

Start by importing pandas and loading the dataset into a pandas DataFrame. The file path is 'Modified_Transactions.csv'. Display the first few rows of the data and

🔍 Hint

Use the pd.read_csv() function to load the dataset. The file path should be passed as a string argument to this function.
🔑 Solution

import pandas as pd # Load the dataset df = pd.read_csv('Modified_Transactions.csv')
Task 3.2: Identify Invalid Values

Identify any invalid values in the 'price' column. For this dataset, a valid quantity is any number greater than 0. Print the rows with the invalid values.

🔍 Hint

Create a boolean mask using the greater than (<) operator against the column's values. Index the DataFrame's rows using this mask to produce a subset of the original data. Display the subsetted data.
🔑 Solution

# Identify invalid values invalid_values = df['price'] < 0 df[invalid_values]
Task 3.3: Handle Invalid Values

Begin by importing the numpy library, which will be used to represent NaN values. Then, proceed to replace any invalid values in the 'price' column with NaN. After making these replacements, confirm that the changes are correct by displaying the rows that initially contained invalid values.

🔍 Hint

Utilize np.nan from the numpy library for NaN values. Apply the boolean mask, created in Task 3.2, to accurately identify and replace the invalid values in the DataFrame. For example, if your boolean mask is named invalid_values, index the DataFrame with df.loc[invalid_values, 'price'] to specifically target the 'price' column in the rows flagged by the mask.
🔑 Solution

import numpy as np # Replace invalid values with NaN df.loc[invalid_values, 'price'] = np.nan # Verify the changes by displaying the rows that originally had invalid values df[invalid_values]
Task 3.4: Explore for More Invalid Data

🎉 Congratulations on reaching this point! There is additional invalid data in the Modified_Transactions.csv dataset. If you want more practice, search for and replace any other invalid data.

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.