Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Manage Invalid, Duplicate, and Missing Data in Python Hands-on Practice

In this lab, You will tackle key data cleaning challenges using Python and Pandas. You'll identify and handle missing values, removing duplicates, and correcting invalid data. Then, you will use functions like isna(), dropna(), fillna(), and drop_duplicates() to ensure data accuracy, a vital skill for effective data analysis.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 29m
Published
Clock icon Dec 13, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Identifying and Handling Missing Values

    Jupyter Guide

    To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Identifying and Handling Missing Values

    To review the concepts covered in this step, please refer to the Handling Missing Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

    Let's get our hands dirty with some real data cleaning! In this step, we will be practicing how to identify and handle missing values in our dataset. Our goal is to ensure that our dataset is free from any NaN values that could potentially skew our analysis. We will be using the isna(), dropna(), and fillna() functions from the pandas library to achieve this. Remember, clean data is happy data!


    Task 1.1: Importing Libraries and Loading the Dataset

    First up, let's get our environment ready. Import pandas and load 'Modified_Transactions.csv' into a DataFrame. Take a moment to display the first few rows with df.head(). This will give you a glimpse of the dataset, especially the missing values we've introduced for this exercise.

    πŸ” Hint

    Start by importing pandas with import pandas as pd. Then load the dataset using pd.read_csv(filepath) and showcase the first few rows with df.head().

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('Modified_Transactions.csv')
    df.head()
    

    Task 1.2: Identifying Missing Values

    Now, let's find out where these missing values are. By calculating missing values per column using isna(), you can gauge the extent of missing data. Display the number of NaN values in each column.

    πŸ” Hint

    Use isna() combined with sum(axis=0) on df to count missing values for each row. The axis=0 column counts the number of NaN values in the first (0th) axis, along each column.

    πŸ”‘ Solution
    # Identify missing values per col
    df.isna().sum(axis=0)
    

    Task 1.3: Removing Rows with Missing Values

    Let's practice cleaning the data by removing rows with any missing values. Use df.dropna() and assign the result to a new variable. Compare the original and the new DataFrame sizes.

    πŸ” Hint

    Apply dropna() to df to remove rows with missing values. Store the cleaned data in df_dropped. Compare the shapes of both DataFrames to see how many rows were affected.

    πŸ”‘ Solution
    # Remove rows with missing values
    df_dropped = df.dropna()
    print("Original shape:", df.shape)
    print("New shape:", df_dropped.shape)
    

    Task 1.4: Filling Missing Values

    Finally, let's fill in some missing values instead of removing them. Use df.fillna() to fill the missing values in the price column. Fill the missing values with the mean of the price column. Use the Series.mean() method to compute the mean.

    Make a copy of the original DataFrame using the provided code. Make your edits to this copy. Display the first few rows to verify the results.

    πŸ” Hint

    First, calculate the mean of the price column with df['price'].mean(). Then use fillna() on the price column, passing in the mean value as the argument.

    πŸ”‘ Solution
    # Provided Code
    df_filled = df.copy()
    
    # Calculate the mean of the price column
    price_mean = df['price'].mean()
    
    # Fill missing values in the price column with the mean
    df_filled['price'] = df['price'].fillna(price_mean)
    
    # Display the first few rows to verify the changes
    df_filled.head()
    
  2. Challenge

    Identifying and Removing Duplicate Rows

    Identifying and Removing Duplicate Rows

    To review the concepts covered in this step, please refer to the Handling Duplicate Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

    Now that we've dealt with missing values, let's move on to another common data cleaning task: dealing with duplicate rows. In this step, we will be practicing how to identify and remove duplicate rows from our dataset using the duplicated() and drop_duplicates() functions in pandas. By the end of this step, our dataset should be free from any redundant information.


    Task 2.1: Importing Necessary Libraries

    Before we start working with our data, we need to import the necessary libraries. In this task, you will import the pandas library which we will use for data manipulation and analysis.

    πŸ” Hint

    Use the import keyword to import the pandas library. It is common practice to import pandas as pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 2.2: Loading the Dataset

    Now that we have imported the necessary libraries, the next step is to load our dataset. In this task, you will load the dataset into a pandas DataFrame. The data is in Duplicated_Transactions.csv.

    Once the data is loaded, save it as df and display the first few rows. Notice, there should be some duplicated rows within the first few rows.

    πŸ” Hint

    Use the pd.read_csv() function to read the CSV file. The file path is 'Duplicated_Transactions.csv'. Assign the result to a variable named df. Display the first few rows with the DataFrame.head() method.

    πŸ”‘ Solution
    df = pd.read_csv('Duplicated_Transactions.csv')
    df.head()
    

    Task 2.3: Identifying Duplicate Rows

    With our data loaded, we can now start identifying duplicate rows. In this task, you will use the DataFrame.duplicated() method to identify duplicate rows in our DataFrame. Add the results to a new column, Duplicate.

    Display the head of df after adding the column. Notice that the first occurrence of the duplicate set is not marked as a duplicate.

    πŸ” Hint

    Use the duplicated() function on the DataFrame df. This will return a Series of booleans where True indicates a duplicate row. Assign the result to a new column, df['Duplicate'].

    πŸ”‘ Solution
    duplicates = df.duplicated()
    df['Duplicate'] = duplicates
    df.head()
    

    Task 2.4: Counting Duplicate Rows

    After identifying the duplicate rows, it would be useful to know how many there are. In this task, you will print the number of duplicate rows in our DataFrame by summing the Duplicate column we created in Task 2.3.

    πŸ” Hint

    Use the sum() function on the Duplicate column to count the number of True values (i.e., the number of duplicate rows).

    πŸ”‘ Solution
    num_duplicates = duplicates.sum()
    print(num_duplicates)
    

    Task 2.5: Removing Duplicate Rows

    Now that we know how many duplicate rows there are, the next step is to remove them. In this task, you first remove the Duplicate column using the provided code, and then will use the drop_duplicates() function to remove duplicate rows from our DataFrame. Assign the result to a new DataFrame

    Display the shape of the original df and the new DataFrame.

    πŸ” Hint

    Use the drop_duplicates() function on the DataFrame df. This will return a new DataFrame with the duplicate rows removed. Use the DataFrame.shape attribute to access the shape of the DataFrame.

    πŸ”‘ Solution
    # Provided code to remove the 'Duplicate' column
    df.drop('Duplicate', axis=1, inplace=True)
    
    clean_df = df.drop_duplicates()
    print("Original:", df.shape)
    print("Deduplicated:", clean_df.shape)
    
  3. Challenge

    Identifying and Handling Invalid Values

    Identifying and Handling Invalid Values

    To review the concepts covered in this step, please refer to the Handling Invalid Values in Your Data module of the Manage Invalid, Duplicate, and Missing Data in Python course.

    We're almost there! The final step in our data cleaning journey involves dealing with invalid values. These could be values that are out of range, of the wrong data type, or simply erroneous. In this step, we will be practicing how to identify and handle such values using pandas. By the end of this step, our dataset should be clean, accurate, and ready for analysis!


    Task 3.1: Load the Dataset

    Start by importing pandas and loading the dataset into a pandas DataFrame. The file path is 'Modified_Transactions.csv'. Display the first few rows of the data and

    πŸ” Hint

    Use the pd.read_csv() function to load the dataset. The file path should be passed as a string argument to this function.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('Modified_Transactions.csv')
    

    Task 3.2: Identify Invalid Values

    Identify any invalid values in the 'price' column. For this dataset, a valid quantity is any number greater than 0. Print the rows with the invalid values.

    πŸ” Hint

    Create a boolean mask using the greater than (<) operator against the column's values. Index the DataFrame's rows using this mask to produce a subset of the original data. Display the subsetted data.

    πŸ”‘ Solution
    # Identify invalid values
    invalid_values = df['price'] < 0
    df[invalid_values]
    

    Task 3.3: Handle Invalid Values

    Begin by importing the numpy library, which will be used to represent NaN values. Then, proceed to replace any invalid values in the 'price' column with NaN. After making these replacements, confirm that the changes are correct by displaying the rows that initially contained invalid values.

    πŸ” Hint

    Utilize np.nan from the numpy library for NaN values. Apply the boolean mask, created in Task 3.2, to accurately identify and replace the invalid values in the DataFrame. For example, if your boolean mask is named invalid_values, index the DataFrame with df.loc[invalid_values, 'price'] to specifically target the 'price' column in the rows flagged by the mask.

    πŸ”‘ Solution
    import numpy as np
    
    # Replace invalid values with NaN
    df.loc[invalid_values, 'price'] = np.nan
    
    # Verify the changes by displaying the rows that originally had invalid values
    df[invalid_values]
    

    Task 3.4: Explore for More Invalid Data

    πŸŽ‰ Congratulations on reaching this point! There is additional invalid data in the Modified_Transactions.csv dataset. If you want more practice, search for and replace any other invalid data.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.