- Lab
- Data

Cleaning Data with Pandas Hands-on Practice
This lab provides an environment to practice cleaning data from various datasets. It covers Missing, Illogical, and Duplicate data, as well as correlation analysis and categorical variable encoding.

Path Info
Table of Contents
-
Challenge
Exploring and Cleaning the Dataset
Exploring and Cleaning the Dataset
To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.
Data exploration and cleaning are important because they allow us to understand the structure and content of our data, and ensure that it is in the right format for further analysis. This step will involve importing the dataset, exploring its features, and dealing with missing and duplicate values.
In this step, you will practice importing a dataset using Pandas, and then explore its features using various Pandas functions such as
shape
,head
,info
, anddescribe
. You will also practice dealing with missing and duplicate data. The goal is to familiarize yourself with the dataset and ensure it is clean and ready for further analysis. Tools used in this step include the Pandas library and its various functions for data import, exploration, and cleaning.After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the
Ctrl/cmd + Enter
key combination to enact any necessary changes.
Task 1.1: Importing the Dataset
Open the file
Step 1 Exploring_and_Cleaning_the_Dataset.ipynb
and import the'employees.csv'
dataset using pandas and assign it to a variable nameddf
.π Hint
Use the
pd.read_csv()
function to import the dataset. The file path is 'employees.csv'.π Solution
import pandas as pd df = pd.read_csv('employees.csv')
Task 1.2: Exploring the Dataset
Explore the dataset using the
shape
,head
,info
, anddescribe
functions.π Hint
Use the
shape
attribute to get the number of rows and columns,head()
function to get the first 5 rows,info()
function to get a summary of the dataset, anddescribe()
function to get statistical details of the dataset.π Solution
print('Shape:', df.shape) print('Head:') print(df.head()) print('Info:') print(df.info()) print('Describe:') print(df.describe())
Task 1.3: Checking for Missing Values
Check for missing values in the dataset.
π Hint
Use the
isnull()
function followed by thesum()
function to get the total number of missing values in each column.π Solution
print('Missing values:') print(df.isnull().sum())
Task 1.4: Dropping Rows with Missing Values
Drop all the rows with missing values in the
'Gender'
column. Print the shape of the original array and the new array.π Hint
You can use the
dropna()
method to remove rows with missing values. Use thesubset
keyword argument to drop only the rows with null values in the'Gender'
column.π Solution
print(df.shape) df.dropna(subset=['Gender'], inplace=True) # or # df = df.dropna(subset=['Gender']) print(df.shape)
Task 1.5: Replacing Missing Values
Replace all missing values in the
'Salary'
column with the mean salary.π Hint
You can use the
fillna()
method to replace missing values.fillna()
is a Series method, so to replace the missing values in a specific column, assign the values of that column to the results offillna()
.mean()
will return the mean of a Series. Here's an example:mean = df[col].mean()
π Solution
mean = df['Salary'].mean() df['Salary'] = df['Salary'].fillna(mean)
Task 1.6: Checking for Duplicate Values
Check for duplicate values in the dataset.
π Hint
Use the
duplicated()
function followed by thesum()
function to get the total number of duplicate rows.π Solution
print('Duplicate values:', df.duplicated().sum())
Task 1.7: Dropping Rows with Duplicate Values
If there are any duplicate values, drop them. Keep only the first occurrence of any duplicates. Print the shape of the new array.
π Hint
You can use the
drop_duplicates()
function to remove duplicate rows. thekeep='first'
keyword argument will keep the first occurrence of any duplicates.π Solution
df = df.drop_duplicates(keep='first') # or # df.drop_duplicates(inplace=True, keep='first') print(df.shape)
-
Challenge
Handling Missing, Illogical, and Duplicate Data
Handling Missing, Illogical, and Duplicate Data
To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.
Handling missing, illogical, and duplicate data is important because these issues can significantly affect the accuracy of our data analysis and machine learning models. This step will involve identifying and dealing with these issues in our dataset.
In this step, you will practice identifying and dealing with illogical data in the dataset. You will use Pandas methods such as
drop
andvalue_counts
to handle these issues. The goal is to ensure that your dataset is clean and ready for further analysis.After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the
Ctrl/cmd + Enter
key combination to enact any necessary changes.
Task 2.1: Explore the Gender Column
Open the
Step 2 Handling_Missing__Illogical__and_Duplicate_Data.ipynb
and load the data from the'employees.csv'
dataset into a dataframe and identify the unique values in the'Gender'
column of the dataset.π Hint
Use the
value_counts()
method to see how many unique values there are in the'Gender'
column.π Solution
import pandas as pd # Load the dataset df = pd.read_csv('employees.csv') # Print the value_counts() of the gender column df.Gender.value_counts()
Task 2.2: Drop Illogical Data
Handle the
unknown
Gender values by dropping those rows where gender is'unknown'
. Print the shape of the original dataframe and the new dataframe once the rows are dropped. The code provided will give you the rows to drop.π Hint
Use
df.drop
with the indices of the rows to drop. Get the indices using the provided code.π Solution
# Print the shape of the original df print(df.shape) # Provided code indices = df.index[df['Gender'] == 'unknown'] # Handle 'unknown' Gender df = df.drop(indices) # Print the shape of the new df print(df.shape)
Task 2.3: Unify the Gender Column
Use the provided code to convert the
'Male'
and female'Female'
values to be'M'
and'F'
. Print the value counts after you clean the'Gender'
column.π Hint
To replace
'Male'
and'Female'
with'M'
and'F'
, use the provided code which assumed your data is stored in a dataframe nameddf
.π Solution
# Provided Code df.loc[df.Gender=='Female', 'Gender'] = 'F' df.loc[df.Gender=='Male', 'Gender'] = 'M' df.Gender.value_counts()
-
Challenge
Conducting Correlation Analysis
Conducting Correlation Analysis
To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.
Correlation analysis is important because it helps us understand the relationships between different features in our dataset. This can help us identify redundant features and improve the accuracy of our data analysis and machine learning models. This step will involve conducting correlation analysis on our dataset.
In this step, you will practice conducting correlation analysis on the dataset. You will use the
corr
function in Pandas to calculate correlation coefficients, and then use a heatmap to visualize these correlations.After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the
Ctrl/cmd + Enter
key combination to enact any necessary changes.
Task 3.1: Load the Dataset
Open the
Step 3 Conducting_Correlation_Analysis.ipynb
and load the dataset'Student_Scores.csv'
into a pandas DataFrame. Print the first 5 rows.π Hint
Use the
pd.read_csv()
function to read the csv file. For example,df = pd.read_csv('file_name.csv')
. Print the first 5 rows with the.head()
methodπ Solution
import pandas as pd df = pd.read_csv('Student_Scores.csv') print(df.head())
Task 3.2: Calculate Correlation Coefficients
Calculate the correlation coefficients between the different numeric features in the dataset using the
corr
function in pandas. Save the result to a variable namedcorrelation
. Printcorrelation
.π Hint
Use the
corr()
method on the DataFrame to calculate the correlation coefficients. Note that all non-numeric columns will be omitted.π Solution
correlation = df.corr() print(correlation)
Task 3.3: Visualize Correlations with a Heatmap
Use the provided code to visualize the correlation matrix returned from the
corr()
method used inTask 3.2
.π Hint
The provided code assumes that the results of the
corr()
method were saved to a variable namedcorrelation
.π Solution
# Provided code import seaborn as sns import matplotlib.pyplot as plt # Create a heatmap using seaborn sns.heatmap(correlation, annot=True, cmap='coolwarm') plt.title('Correlation Heatmap') plt.show()
-
Challenge
Encoding Categorical Features
Encoding Categorical Features
To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.
Encoding categorical features is important because many machine learning algorithms require numerical input. This step will involve encoding the categorical features in our dataset using one-hot encoding.
In this step, you will practice encoding categorical features in the dataset. The goal is to convert all categorical features in your dataset to numerical form so that they can be used in machine learning algorithms.
After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the
Ctrl/cmd + Enter
key combination to enact any necessary changes.
Task 4.1: Load the Dataset
Open the file
Step 4 Encoding_Categorical_Features.ipynb
, import pandas, and load the student scores dataset,'Student_Scores.csv'
, into a pandas DataFrame calleddf
. Print the first 5 rows.π Hint
To load the CSV data into a pandas DataFrame, use the pd.read_csv() function. The head() method can be used to preview the first few rows of the DataFrame.
π Solution
# Import pandas import pandas as pd # Load the data into a pandas DataFrame. df = pd.read_csv('Student_Scores.csv') print(df.head())
Task 4.2: One-Hot Encoding
Transform the 'gender' column in the dataset from text to a binary numerical format. Print the new dataframe.
π Hint
Pandas has a built-in function called
pd.get_dummies()
that can be used for one-hot encoding. Apply this to the 'gender' column with thecolumns
keyword argument and prefix the new columns to indicate the encoded variable with theprefix
keyword argument.π Solution
# Perform one-hot encoding on the 'gender' column with pandas. df_with_dummies = pd.get_dummies(df, columns=['gender'], prefix='gender') # Output the modified DataFrame to check the new columns. print(df_with_dummies.head())
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.