Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Index Objects with Pandas Hands-on Practice

In this lab, you'll master data manipulation and retrieval using DataFrame operations, various indexing methods including datetime and multi-indexing, and advanced categorization techniques.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 34m
Published
Clock icon Dec 06, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Exploring DataFrames and Indexing in Pandas

    Jupyter Guide

    To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Exploring DataFrames and Indexing in Pandas

    To review the concepts covered in this step, please refer to the Introduction to Indexing Objects in Pandas module of the Index Objects with Pandas course.

    Understanding the structure of DataFrames and the concept of indexing in Pandas is important because it forms the foundation for data extraction, manipulation, and modification. This step will allow you to practice the basics of indexing and explore a dataset using position-based indexing.

    Let's dive into the world of data analysis with Pandas! In this step, you'll get hands-on experience with the basics of DataFrames and indexing in Pandas. You'll be using the Learning_Management.csv dataset to practice extracting data using numerical indexing, and selecting subsets of data using both row and column labels. The goal here is to familiarize yourself with the structure of DataFrames and understand the importance of indexing in data extraction.


    Task 1.1: Importing the Pandas Library

    Before you can start working with DataFrames, you need to import the pandas library. Import the pandas library as pd.

    πŸ” Hint

    Use the import keyword followed by the library name and as keyword to give it a short alias. For example, import pandas as pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 1.2: Loading the Dataset

    Load the Learning_Management.csv file into a DataFrame using pandas. Name the DataFrame df.

    πŸ” Hint

    Use the pd.read_csv() function to read the csv file. Pass the file path as a string to the function. For example, df = pd.read_csv('file_path').

    πŸ”‘ Solution
    df = pd.read_csv('Learning_Management.csv')
    

    Task 1.3: Inspecting the DataFrame

    Inspect the first 5 rows of the DataFrame using the head() function.

    πŸ” Hint

    Use the head() function on the DataFrame to view the first 5 rows. For example, df.head().

    πŸ”‘ Solution
    df.head()
    

    Task 1.4: Selecting a Single Column

    Select the employee_name column from the DataFrame.

    πŸ” Hint

    Use the column label as an index to select a single column. For example, df['column_name'].

    πŸ”‘ Solution
    df['employee_name']
    

    Task 1.5: Selecting Multiple Columns

    Select the employee_name and course_name columns from the DataFrame.

    πŸ” Hint

    Use a list of column labels as an index to select multiple columns. For example, df[['column1', 'column2']].

    πŸ”‘ Solution
    df[['employee_name', 'course_name']]
    

    Task 1.6: Selecting Rows Using Index

    Select the first 10 rows of the DataFrame using numerical indexing.

    πŸ” Hint

    Use the iloc property with a slice to select rows. For example, df.iloc[start:end]. If you want to start at the beginning, leave start empty and include only 'end'. For example, df.iloc[:end].

    πŸ”‘ Solution
    df.iloc[:10]
    

    Task 1.7: Selecting Subsets of Data

    Select the employee_name and course_name columns for the first 10 rows of the DataFrame.

    πŸ” Hint

    Use the iloc property with a slice for rows and a list of column labels for columns. For example, df.iloc[:10][column list].

    πŸ”‘ Solution
    df.iloc[:10][['employee_name', 'course_name']]
    
  2. Challenge

    Working with Time Series Data in Pandas

    Working with Time Series Data in Pandas

    To review the concepts covered in this step, please refer to the Pandas Index Objects for Time Series Data module of the Index Objects with Pandas course.

    Understanding how to use datetime and timedelta indexing in Pandas is important because it allows for efficient handling and manipulation of time-series data. This step will provide you with the opportunity to practice these concepts using a real-world dataset.

    Time to tackle time-series data! In this step, you'll explore how to use datetime and timedelta indexing in Pandas to manipulate and extract data. Using the completion_date column from the Learning_Management.csv dataset, you'll practice creating a datetime index, extracting data for specific time periods, and performing basic operations on pandas built-in datetime objects. The goal is to get comfortable with handling time-series data in Pandas.


    Task 2.1: Load the Dataset

    Start by loading the Learning_Management.csv dataset into a pandas DataFrame. Name the DataFrame df.

    After loading the data, display the head of the DataFrame to view the first few rows.

    πŸ” Hint

    Use the pd.read_csv() function to load the dataset. The file path is 'Learning_Management.csv'.

    πŸ”‘ Solution
    import pandas as pd
    
    df = pd.read_csv('Learning_Management.csv')
    df.head()
    

    Task 2.2: Convert the 'completion_date' Column to Datetime

    Convert the 'completion_date' column in the DataFrame to a datetime object. This will allow you to perform time-series operations on the data.

    After converting, print the data type of the column again to see the change.

    πŸ” Hint

    Use the pd.to_datetime() function to convert the 'completion_date' column to datetime. Make sure to assign the result back to the 'completion_date' column in the DataFrame.

    πŸ”‘ Solution
    # Provided code
    print(df['completion_date'].dtype)
    
    # Convert the 'completion_date' column to datetime and print column dtype
    df['completion_date'] = pd.to_datetime(df['completion_date'])
    print(df['completion_date'].dtype)
    

    Task 2.3: Set the 'completion_date' Column as the DataFrame Index

    Set the 'completion_date' column as the index of the DataFrame. This will allow you to use datetime indexing to select data based on the completion date.

    After setting the index, display the head of the DataFrame to see the changes.

    πŸ” Hint

    Use the df.set_index() method to set the 'completion_date' column as the index. Make sure to assign the result back to df.

    πŸ”‘ Solution
    df = df.set_index('completion_date')
    df.head()
    

    Task 2.4: Select Data for a Specific Time Period

    Select all rows in the DataFrame where the completion date is in May 2022.

    After selecting, display the selected data to verify the result.

    πŸ” Hint

    Use the df.loc[] indexer to select data for May 2022. The syntax for selecting a specific month is 'YYYY-MM'.

    πŸ”‘ Solution
    may_2022_data = df.loc['2022-05']
    may_2022_data
    

    Task 2.5: Perform a Basic Operation on a Datetime Object

    Calculate the number of days between the earliest and latest completion dates in the DataFrame.

    Display the result to see the number of days.

    πŸ” Hint

    Use the df.index.min() and df.index.max() methods to get the earliest and latest completion dates, respectively. Subtract the earliest date from the latest date to get the number of days between them.

    πŸ”‘ Solution
    num_days = df.index.max() - df.index.min()
    num_days
    
  3. Challenge

    Interval, Categorical, and Period Indexing in Pandas

    Interval, Categorical, and Period Indexing in Pandas

    To review the concepts covered in this step, please refer to the Interval, Categorical, and Period Indexing in Pandas module of the Index Objects with Pandas course.

    Understanding how to create and use interval, categorical, and period indices in Pandas is important because these indexing techniques enable advanced data extraction from a DataFrame. This step will allow you to practice creating these indices and using them to extract data from a DataFrame.

    Ready to level up your indexing skills? In this step, you'll delve into interval, categorical, and period indexing in Pandas. You'll practice creating these indices using sample data and learn how to use them for efficient data extraction. The goal is to practice these advanced indexing techniques for more efficient data extraction.


    Task 3.1: Create an Interval Index

    Create an interval index based on a range of values. Use the pd.cut() function to divide the range 0 to 100 into 5 equal intervals. This method helps in binning or bucketing the data.

    πŸ” Hint

    Use the pd.cut() function with a range of values (e.g., range(0, 101)) as the first argument and 5 as the second argument to create the interval index. This function will return an IntervalIndex which can be used as an index in creating a DataFrame.

    πŸ”‘ Solution
    # Provided Code
    import pandas as pd
    import numpy as np
    
    interval_index = pd.cut(range(0, 101), 5)
    

    Task 3.2: Create a DataFrame Using Interval Index

    Use the interval index created in Task 3.1 to create a DataFrame with random data. The DataFrame should have 101 rows and 2 columns. Use the provided code to create the random data.

    πŸ” Hint

    Use pd.DataFrame() with np.random.randn(101, 2) to create random data. Use the interval index created in Task 3.1 as the index of the DataFrame.

    πŸ”‘ Solution
    # Provided code
    random_data = np.random.randn(101, 2)
    
    df_interval = pd.DataFrame(random_data, index=interval_index, columns=['A', 'B'])
    

    Task 3.3: Index into the Interval Indexed DataFrame

    Select the rows from the DataFrame created in Task 3.2 where the interval index includes the value 42.

    πŸ” Hint

    To index into the DataFrame, use the indexer df_interval.loc[] with the specific value (e.g., 42) you want to find within the intervals.

    πŸ”‘ Solution
    df_interval.loc[42]
    

    Task 3.4: Create a Categorical Index

    Create a categorical index using a list of 4 categories. Categorical data is a Pandas data type corresponding to categorical variables in statistics.

    πŸ” Hint

    Create a list of categories and use the pd.Categorical() function to create the categorical index. The pd.Categorical() function is used for creating array-like objects representing categorical variables.

    πŸ”‘ Solution
    categories = ['Category1', 'Category2', 'Category3', 'Category4']
    categorical_index = pd.Categorical(categories)
    

    Task 3.5: Create a DataFrame Using Categorical Index

    Use the categorical index created in Task 3.4 to create a DataFrame with random data. The DataFrame should have 4 rows and 2 columns. Use the provided code to create the random data.

    πŸ” Hint

    Use pd.DataFrame() with np.random.randn(4, 2) to create random data. Use the categorical index created in Task 3.4 as the index of the DataFrame.

    πŸ”‘ Solution
    # Provided code
    random_data = np.random.randn(4, 2)
    
    df_categorical = pd.DataFrame(random_data, index=categorical_index, columns=['A', 'B'])
    

    Task 3.6: Index into the Categorical Indexed DataFrame

    Select the row from the DataFrame created in Task 3.5 that corresponds to your second category.

    πŸ” Hint

    To index into the DataFrame, use the indexer df_categorical.loc[] with the specific category (e.g., 'Category2') you want to access.

    πŸ”‘ Solution
    df_categorical.loc['Category2']
    

    Task 3.7: Create a Period Index

    Create a period index representing each month in 2023. Period indices are useful for time series data that require to be aggregated or indexed by a particular time period.

    πŸ” Hint

    Use the pd.period_range() function to create a period index that represents each month in a year. The first argument should be in the format 'YYYY-MM', followed by keyword arguments periods=12 and freq=M. This function returns a PeriodIndex which can be used to index data in a DataFrame.

    πŸ”‘ Solution
    period_index = pd.period_range('2023-01', periods=12, freq='M')
    

    Task 3.8: Create a DataFrame Using Period Index

    Use the period index created in Task 3.7 to create a DataFrame with random data. The DataFrame should have 12 rows and 2 columns.

    πŸ” Hint

    Use pd.DataFrame() with np.random.randn(12, 2) to create random data. Use the period index created in Task 3.7 as the index of the DataFrame.

    πŸ”‘ Solution
    # Provided code
    random_data = np.random.randn(12, 2)
    
    df_period = pd.DataFrame(random_data, index=period_index, columns=['A', 'B'])
    

    Task 3.9: Index into the Period Indexed DataFrame

    Select the row from the DataFrame created in Task 3.8 that corresponds to the month '2023-05'.

    πŸ” Hint

    To index into the DataFrame, use the indexer df_period.loc[] with the specific period (e.g., '2023-05') you want to access.

    πŸ”‘ Solution
    df_period.loc['2023-05']
    
  4. Challenge

    Multi-indexing in Pandas

    Multi-indexing in Pandas

    To review the concepts covered in this step, please refer to the Multi-indexing in Pandas module of the Index Objects with Pandas course.

    Understanding how to create and use a MultiIndex in Pandas is important because it allows for efficient organization and retrieval of hierarchical data. This step will provide you with the opportunity to practice creating a MultiIndex and using it to retrieve data at different hierarchy levels.

    Let's dive into the world of multi-indexing! In this step, you'll learn how to create a MultiIndex for hierarchical data organization in a DataFrame. Using the Learning_Management.csv dataset, you'll practice creating a MultiIndex and using it to retrieve data at different hierarchy levels. The goal is to understand the benefits of using MultiIndexing in pandas for hierarchical data organization and efficient data retrieval.


    Task 4.1: Importing Pandas

    Before we start working with the data, we need to import pandas. In this task, import the pandas library which will be used throughout this step.

    πŸ” Hint

    Use the import keyword to import pandas. It's common to import pandas as pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 4.2: Loading the Dataset

    Now that we have imported pandas, let's load the dataset. The dataset is stored in a CSV file named 'Learning_Management.csv'.

    πŸ” Hint

    Use the read_csv function from pandas to load the dataset. The file path is 'Learning_Management.csv'.

    πŸ”‘ Solution
    df = pd.read_csv('Learning_Management.csv')
    

    Task 4.3: Creating a MultiIndex

    Now that we have loaded the dataset, let's create a MultiIndex. We will use the 'employee_id' and 'course_id' columns as our index. This will allow us to organize our data hierarchically.

    After creating the MultiIndex, display the head of the DataFrame to visualize the change.

    πŸ” Hint

    Use the set_index function on the dataframe and pass in a list of column names ['employee_id', 'course_id'] to create a MultiIndex. Then use df.head() to display the first few rows of the DataFrame.

    πŸ”‘ Solution
    df.set_index(['employee_id', 'course_id'], inplace=True)
    df.head()
    # or
    # df = df.set_index(['employee_id', 'course_id'])
    # df.head()
    

    Task 4.4: Retrieving Data Using MultiIndex

    With the MultiIndex created, your next task is to retrieve data for a specific course. Find all employees who completed the course with courseid 'C002'.

    πŸ” Hint

    Use the xs (cross-section) function on the DataFrame to retrieve data for a specific 'course_id'. You'll need to specify the course id (e.g., 'C002') and the level ('course_id') at which to perform the cross-section.

    πŸ”‘ Solution
    df.xs('C002', level='course_id')
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.