• Labs icon Lab
  • Data
Labs

Cleaning Data with Pandas Hands-on Practice

This lab provides an environment to practice cleaning data from various datasets. It covers Missing, Illogical, and Duplicate data, as well as correlation analysis and categorical variable encoding.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 38m
Published
Clock icon Nov 14, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Exploring and Cleaning the Dataset

    Exploring and Cleaning the Dataset

    To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.

    Data exploration and cleaning are important because they allow us to understand the structure and content of our data, and ensure that it is in the right format for further analysis. This step will involve importing the dataset, exploring its features, and dealing with missing and duplicate values.

    In this step, you will practice importing a dataset using Pandas, and then explore its features using various Pandas functions such as shape, head, info, and describe. You will also practice dealing with missing and duplicate data. The goal is to familiarize yourself with the dataset and ensure it is clean and ready for further analysis. Tools used in this step include the Pandas library and its various functions for data import, exploration, and cleaning.

    After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.


    Task 1.1: Importing the Dataset

    Open the file Step 1 Exploring_and_Cleaning_the_Dataset.ipynb and import the 'employees.csv' dataset using pandas and assign it to a variable named df.

    πŸ” Hint

    Use the pd.read_csv() function to import the dataset. The file path is 'employees.csv'.

    πŸ”‘ Solution
    import pandas as pd
    
    df = pd.read_csv('employees.csv')
    

    Task 1.2: Exploring the Dataset

    Explore the dataset using the shape, head, info, and describe functions.

    πŸ” Hint

    Use the shape attribute to get the number of rows and columns, head() function to get the first 5 rows, info() function to get a summary of the dataset, and describe() function to get statistical details of the dataset.

    πŸ”‘ Solution
    print('Shape:', df.shape)
    print('Head:')
    print(df.head())
    print('Info:')
    print(df.info())
    print('Describe:')
    print(df.describe())
    

    Task 1.3: Checking for Missing Values

    Check for missing values in the dataset.

    πŸ” Hint

    Use the isnull() function followed by the sum() function to get the total number of missing values in each column.

    πŸ”‘ Solution
    print('Missing values:')
    print(df.isnull().sum())
    

    Task 1.4: Dropping Rows with Missing Values

    Drop all the rows with missing values in the 'Gender' column. Print the shape of the original array and the new array.

    πŸ” Hint

    You can use the dropna() method to remove rows with missing values. Use the subset keyword argument to drop only the rows with null values in the 'Gender' column.

    πŸ”‘ Solution
    print(df.shape)
    	
    df.dropna(subset=['Gender'], inplace=True)
    # or
    # df = df.dropna(subset=['Gender'])
    	
    print(df.shape)
    

    Task 1.5: Replacing Missing Values

    Replace all missing values in the 'Salary' column with the mean salary.

    πŸ” Hint

    You can use the fillna() method to replace missing values. fillna() is a Series method, so to replace the missing values in a specific column, assign the values of that column to the results of fillna(). mean() will return the mean of a Series. Here's an example:

    mean = df[col].mean()
    
    πŸ”‘ Solution
    mean = df['Salary'].mean()
    df['Salary'] = df['Salary'].fillna(mean)
    

    Task 1.6: Checking for Duplicate Values

    Check for duplicate values in the dataset.

    πŸ” Hint

    Use the duplicated() function followed by the sum() function to get the total number of duplicate rows.

    πŸ”‘ Solution
    print('Duplicate values:', df.duplicated().sum())
    

    Task 1.7: Dropping Rows with Duplicate Values

    If there are any duplicate values, drop them. Keep only the first occurrence of any duplicates. Print the shape of the new array.

    πŸ” Hint

    You can use the drop_duplicates() function to remove duplicate rows. the keep='first' keyword argument will keep the first occurrence of any duplicates.

    πŸ”‘ Solution
    df = df.drop_duplicates(keep='first')
    # or
    # df.drop_duplicates(inplace=True, keep='first')
    
    print(df.shape)
    
  2. Challenge

    Handling Missing, Illogical, and Duplicate Data

    Handling Missing, Illogical, and Duplicate Data

    To review the concepts covered in this step, please refer to the Introduction to Data Cleaning with Pandas module of the Cleaning Data with Pandas course.

    Handling missing, illogical, and duplicate data is important because these issues can significantly affect the accuracy of our data analysis and machine learning models. This step will involve identifying and dealing with these issues in our dataset.

    In this step, you will practice identifying and dealing with illogical data in the dataset. You will use Pandas methods such as drop and value_counts to handle these issues. The goal is to ensure that your dataset is clean and ready for further analysis.

    After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.


    Task 2.1: Explore the Gender Column

    Open the Step 2 Handling_Missing__Illogical__and_Duplicate_Data.ipynb and load the data from the 'employees.csv' dataset into a dataframe and identify the unique values in the 'Gender' column of the dataset.

    πŸ” Hint

    Use the value_counts() method to see how many unique values there are in the 'Gender' column.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('employees.csv')
    
    # Print the value_counts() of the gender column
    df.Gender.value_counts()
    

    Task 2.2: Drop Illogical Data

    Handle the unknown Gender values by dropping those rows where gender is 'unknown'. Print the shape of the original dataframe and the new dataframe once the rows are dropped. The code provided will give you the rows to drop.

    πŸ” Hint

    Use df.drop with the indices of the rows to drop. Get the indices using the provided code.

    πŸ”‘ Solution
    # Print the shape of the original df
    print(df.shape)
    
    # Provided code
    indices = df.index[df['Gender'] == 'unknown']
    	
    # Handle 'unknown' Gender 
    df = df.drop(indices)
    
    # Print the shape of the new df
    print(df.shape)
    

    Task 2.3: Unify the Gender Column

    Use the provided code to convert the 'Male' and female 'Female' values to be 'M' and 'F'. Print the value counts after you clean the 'Gender' column.

    πŸ” Hint

    To replace 'Male' and 'Female' with 'M' and 'F', use the provided code which assumed your data is stored in a dataframe named df.

    πŸ”‘ Solution
    # Provided Code
    df.loc[df.Gender=='Female', 'Gender'] = 'F'
    df.loc[df.Gender=='Male', 'Gender'] = 'M'
    
    df.Gender.value_counts()
    
  3. Challenge

    Conducting Correlation Analysis

    Conducting Correlation Analysis

    To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.

    Correlation analysis is important because it helps us understand the relationships between different features in our dataset. This can help us identify redundant features and improve the accuracy of our data analysis and machine learning models. This step will involve conducting correlation analysis on our dataset.

    In this step, you will practice conducting correlation analysis on the dataset. You will use the corr function in Pandas to calculate correlation coefficients, and then use a heatmap to visualize these correlations.

    After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.


    Task 3.1: Load the Dataset

    Open the Step 3 Conducting_Correlation_Analysis.ipynb and load the dataset 'Student_Scores.csv' into a pandas DataFrame. Print the first 5 rows.

    πŸ” Hint

    Use the pd.read_csv() function to read the csv file. For example, df = pd.read_csv('file_name.csv'). Print the first 5 rows with the .head() method

    πŸ”‘ Solution
    import pandas as pd
    
    df = pd.read_csv('Student_Scores.csv')
    print(df.head())
    

    Task 3.2: Calculate Correlation Coefficients

    Calculate the correlation coefficients between the different numeric features in the dataset using the corr function in pandas. Save the result to a variable named correlation. Print correlation.

    πŸ” Hint

    Use the corr() method on the DataFrame to calculate the correlation coefficients. Note that all non-numeric columns will be omitted.

    πŸ”‘ Solution
    correlation = df.corr()
    print(correlation)
    

    Task 3.3: Visualize Correlations with a Heatmap

    Use the provided code to visualize the correlation matrix returned from the corr() method used in Task 3.2.

    πŸ” Hint

    The provided code assumes that the results of the corr() method were saved to a variable named correlation.

    πŸ”‘ Solution
    # Provided code 
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Create a heatmap using seaborn
    sns.heatmap(correlation, annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap')
    plt.show()
    
  4. Challenge

    Encoding Categorical Features

    Encoding Categorical Features

    To review the concepts covered in this step, please refer to the Correlation Analysis and Data Preparation module of the Cleaning Data with Pandas course.

    Encoding categorical features is important because many machine learning algorithms require numerical input. This step will involve encoding the categorical features in our dataset using one-hot encoding.

    In this step, you will practice encoding categorical features in the dataset. The goal is to convert all categorical features in your dataset to numerical form so that they can be used in machine learning algorithms.

    After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Ctrl/cmd + Enter key combination to enact any necessary changes.


    Task 4.1: Load the Dataset

    Open the file Step 4 Encoding_Categorical_Features.ipynb, import pandas, and load the student scores dataset, 'Student_Scores.csv', into a pandas DataFrame called df. Print the first 5 rows.

    πŸ” Hint

    To load the CSV data into a pandas DataFrame, use the pd.read_csv() function. The head() method can be used to preview the first few rows of the DataFrame.

    πŸ”‘ Solution
    # Import pandas 
    import pandas as pd	
    
    # Load the data into a pandas DataFrame.
    df = pd.read_csv('Student_Scores.csv')
    print(df.head())
    

    Task 4.2: One-Hot Encoding

    Transform the 'gender' column in the dataset from text to a binary numerical format. Print the new dataframe.

    πŸ” Hint

    Pandas has a built-in function called pd.get_dummies() that can be used for one-hot encoding. Apply this to the 'gender' column with the columns keyword argument and prefix the new columns to indicate the encoded variable with the prefix keyword argument.

    πŸ”‘ Solution
    # Perform one-hot encoding on the 'gender' column with pandas.
    df_with_dummies = pd.get_dummies(df, columns=['gender'], prefix='gender')
    
    # Output the modified DataFrame to check the new columns.
    print(df_with_dummies.head())
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.