Cleaning Data with Pandas Hands-on Practice

Libraries: If you want this lab, consider one of these libraries.
Data

Cleaning Data with Pandas Hands-on Practice

This lab provides an environment to practice cleaning data from various datasets. It covers Missing, Illogical, and Duplicate data, as well as correlation analysis and categorical variable encoding.

Get started Contact sales

Lab Info

Level

Beginner

Last updated

Sep 26, 2025

Duration

35m

Challenge

Exploring and Cleaning the Dataset
Exploring and Cleaning the Dataset

To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.

Data exploration and cleaning are important because they allow us to understand the structure and content of our data, and ensure that it is in the right format for further analysis. This step will involve importing the dataset, exploring its features, and dealing with missing and duplicate values.

In this step, you will practice importing a dataset using Pandas, and then explore its features using various Pandas functions such as shape, head, info, and describe. You will also practice dealing with missing and duplicate data. The goal is to familiarize yourself with the dataset and ensure it is clean and ready for further analysis. Tools used in this step include the Pandas library and its various functions for data import, exploration, and cleaning.

After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.

Task 1.1: Importing the Dataset

Open the file Step 1 Exploring_and_Cleaning_the_Dataset.ipynb and import the 'employees.csv' dataset using pandas and assign it to a variable named df.

🔍 Hint

Use the pd.read_csv() function to import the dataset. The file path is 'employees.csv'.
🔑 Solution

import pandas as pd df = pd.read_csv('employees.csv')
Task 1.2: Exploring the Dataset

Explore the dataset using the shape, head, info, and describe functions.

🔍 Hint

Use the shape attribute to get the number of rows and columns, head() function to get the first 5 rows, info() function to get a summary of the dataset, and describe() function to get statistical details of the dataset.
🔑 Solution

print('Shape:', df.shape) print('Head:') print(df.head()) print('Info:') print(df.info()) print('Describe:') print(df.describe())
Task 1.3: Checking for Missing Values

Check for missing values in the dataset.

🔍 Hint

Use the isnull() function followed by the sum() function to get the total number of missing values in each column.
🔑 Solution

print('Missing values:') print(df.isnull().sum())
Task 1.4: Dropping Rows with Missing Values

Drop all the rows with missing values in the 'Gender' column. Print the shape of the original array and the new array.

🔍 Hint

You can use the dropna() method to remove rows with missing values. Use the subset keyword argument to drop only the rows with null values in the 'Gender' column.
🔑 Solution

print(df.shape) df.dropna(subset=['Gender'], inplace=True) # or # df = df.dropna(subset=['Gender']) print(df.shape)
Task 1.5: Replacing Missing Values

Replace all missing values in the 'Salary' column with the mean salary.
🔍 Hint

You can use the fillna() method to replace missing values. fillna() is a Series method, so to replace the missing values in a specific column, assign the values of that column to the results of fillna(). mean() will return the mean of a Series. Here's an example:

mean = df[col].mean()
🔑 Solution

mean = df['Salary'].mean() df['Salary'] = df['Salary'].fillna(mean)
Task 1.6: Checking for Duplicate Values

Check for duplicate values in the dataset.

🔍 Hint

Use the duplicated() function followed by the sum() function to get the total number of duplicate rows.
🔑 Solution

print('Duplicate values:', df.duplicated().sum())
Task 1.7: Dropping Rows with Duplicate Values

If there are any duplicate values, drop them. Keep only the first occurrence of any duplicates. Print the shape of the new array.

🔍 Hint

You can use the drop_duplicates() function to remove duplicate rows. the keep='first' keyword argument will keep the first occurrence of any duplicates.
🔑 Solution

df = df.drop_duplicates(keep='first') # or # df.drop_duplicates(inplace=True, keep='first') print(df.shape)
Challenge

Handling Missing, Illogical, and Duplicate Data
Handling Missing, Illogical, and Duplicate Data

To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.

Handling missing, illogical, and duplicate data is important because these issues can significantly affect the accuracy of our data analysis and machine learning models. This step will involve identifying and dealing with these issues in our dataset.

In this step, you will practice identifying and dealing with illogical data in the dataset. You will use Pandas methods such as drop and value_counts to handle these issues. The goal is to ensure that your dataset is clean and ready for further analysis.

After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.

Task 2.1: Explore the Gender Column

Open the Step 2 Handling_Missing__Illogical__and_Duplicate_Data.ipynb and load the data from the 'employees.csv' dataset into a dataframe and identify the unique values in the 'Gender' column of the dataset.

🔍 Hint

Use the value_counts() method to see how many unique values there are in the 'Gender' column.
🔑 Solution

import pandas as pd # Load the dataset df = pd.read_csv('employees.csv') # Print the value_counts() of the gender column df.Gender.value_counts()
Task 2.2: Drop Illogical Data

Handle the unknown Gender values by dropping those rows where gender is 'unknown'. Print the shape of the original dataframe and the new dataframe once the rows are dropped. The code provided will give you the rows to drop.

🔍 Hint

Use df.drop with the indices of the rows to drop. Get the indices using the provided code.
🔑 Solution

# Print the shape of the original df print(df.shape) # Provided code indices = df.index[df['Gender'] == 'unknown'] # Handle 'unknown' Gender df = df.drop(indices) # Print the shape of the new df print(df.shape)
Task 2.3: Unify the Gender Column

Use the provided code to convert the 'Male' and female 'Female' values to be 'M' and 'F'. Print the value counts after you clean the 'Gender' column.

🔍 Hint

To replace 'Male' and 'Female' with 'M' and 'F', use the provided code which assumed your data is stored in a dataframe named df.
🔑 Solution

# Provided Code df.loc[df.Gender=='Female', 'Gender'] = 'F' df.loc[df.Gender=='Male', 'Gender'] = 'M' df.Gender.value_counts()
Challenge

Conducting Correlation Analysis
Conducting Correlation Analysis

To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.

Correlation analysis is important because it helps us understand the relationships between different features in our dataset. This can help us identify redundant features and improve the accuracy of our data analysis and machine learning models. This step will involve conducting correlation analysis on our dataset.

In this step, you will practice conducting correlation analysis on the dataset. You will use the corr function in Pandas to calculate correlation coefficients, and then use a heatmap to visualize these correlations.

After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.

Task 3.1: Load the Dataset

Open the Step 3 Conducting_Correlation_Analysis.ipynb and load the dataset 'Student_Scores.csv' into a pandas DataFrame. Print the first 5 rows.

🔍 Hint

Use the pd.read_csv() function to read the csv file. For example, df = pd.read_csv('file_name.csv'). Print the first 5 rows with the .head() method
🔑 Solution

import pandas as pd df = pd.read_csv('Student_Scores.csv') print(df.head())
Task 3.2: Calculate Correlation Coefficients

Calculate the correlation coefficients between the different numeric features in the dataset using the corr function in pandas. Save the result to a variable named correlation. Print correlation.

🔍 Hint

Use the corr() method on the DataFrame to calculate the correlation coefficients. Note that all non-numeric columns will be omitted.
🔑 Solution

correlation = df.corr() print(correlation)
Task 3.3: Visualize Correlations with a Heatmap

Use the provided code to visualize the correlation matrix returned from the corr() method used in Task 3.2.

🔍 Hint

The provided code assumes that the results of the corr() method were saved to a variable named correlation.
🔑 Solution

# Provided code import seaborn as sns import matplotlib.pyplot as plt # Create a heatmap using seaborn sns.heatmap(correlation, annot=True, cmap='coolwarm') plt.title('Correlation Heatmap') plt.show()
Challenge

Encoding Categorical Features
Encoding Categorical Features

To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.

Encoding categorical features is important because many machine learning algorithms require numerical input. This step will involve encoding the categorical features in our dataset using one-hot encoding.

In this step, you will practice encoding categorical features in the dataset. The goal is to convert all categorical features in your dataset to numerical form so that they can be used in machine learning algorithms.

After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.

Task 4.1: Load the Dataset

Open the file Step 4 Encoding_Categorical_Features.ipynb, import pandas, and load the student scores dataset, 'Student_Scores.csv', into a pandas DataFrame called df. Print the first 5 rows.

🔍 Hint

To load the CSV data into a pandas DataFrame, use the pd.read_csv() function. The head() method can be used to preview the first few rows of the DataFrame.
🔑 Solution

# Import pandas import pandas as pd # Load the data into a pandas DataFrame. df = pd.read_csv('Student_Scores.csv') print(df.head())
Task 4.2: One-Hot Encoding

Transform the 'gender' column in the dataset from text to a binary numerical format. Print the new dataframe.

🔍 Hint

Pandas has a built-in function called pd.get_dummies() that can be used for one-hot encoding. Apply this to the 'gender' column with the columns keyword argument and prefix the new columns to indicate the encoded variable with the prefix keyword argument.
🔑 Solution

# Perform one-hot encoding on the 'gender' column with pandas. df_with_dummies = pd.get_dummies(df, columns=['gender'], prefix='gender') # Output the modified DataFrame to check the new columns. print(df_with_dummies.head())

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.