Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Up and Running with Pandas Hands-on Practice

In this lab, you'll dive into using pandas for data manipulation and analysis. The lab starts with creating and manipulating DataFrames, where you'll learn to create DataFrames from dictionaries, combine them, reset indexes, and fill missing values. Next, you'll work with JSON and CSV files, focusing on loading data from these formats into DataFrames, examining DataFrame metadata, and converting DataFrames back to JSON and CSV. Finally, the lab explores data evaluation, including displaying data, getting statistical properties and correlation scores, filtering data based on conditions, and determining the size of the DataFrame.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 34m
Published
Clock icon Dec 06, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Creating and Manipulating DataFrames

    Creating and Manipulating DataFrames

    To review the concepts covered in this step, please refer to the Understanding Data Fundamentals with Pandas module of the Up and Running with Pandas course.

    Understanding DataFrames is important because they are a fundamental part of pandas and are used to store and manipulate tabular data.

    In this step, you will create a DataFrame from scratch and learn how to combine multiple DataFrames together. You will also practice resetting the row index of a DataFrame and filling in missing values with NaN values. Use the pd.DataFrame() function to create a DataFrame and the pd.concat() function to combine DataFrames.

    After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the Shift + Enter key combination to enact any necessary changes.


    Task 1.1: Creating a DataFrame

    Import pandas and then create two DataFrames from the provided lists of dictionaries. Name the first DataFrame df1 and the second df2.

    Data for df1:

    [
        {'name': 'Alice', 'age': None, 'city': 'New York'},
        {'name': 'Bob', 'age': 26, 'city': 'Los Angeles'},
        {'name': 'Oliver', 'age': None, 'city': 'Salt Lake'}
    ]
    

    Data for df2:

    [
        {'name': 'Charlie', 'age': 35, 'city': 'Chicago'},
        {'name': 'Diana', 'age': 82, 'city': 'Miami'}
    ]
    

    Use the pd.DataFrame() function to create each DataFrame. Print both DataFrames.

    πŸ” Hint

    To create a DataFrame, call the pd.DataFrame() function and pass the list of dictionaries as an argument.


    **Context:** In pandas, a DataFrame is a fundamental structure for storing and manipulating tabular data. It can be created from various data formats. Using lists of dictionaries is a straightforward method, where each dictionary represents a row in the DataFrame, and the keys correspond to column names.
    πŸ”‘ Solution
    import pandas as pd
    
    # Create DataFrames
    df1 = pd.DataFrame([
        {'name': 'Alice', 'age': None, 'city': 'New York'},
        {'name': 'Bob', 'age': 26, 'city': 'Los Angeles'},
        {'name': 'Oliver', 'age': None, 'city': 'Salt Lake'}
    ])
    
    df2 = pd.DataFrame([
        {'name': 'Charlie', 'age': 35, 'city': 'Chicago'},
        {'name': 'Diana', 'age': 82, 'city': 'Miami'}
    ])
    
    # Display DataFrames
    df1, df2
    

    Task 1.2: Combine DataFrames

    Now, combine df1 and df2 into a single DataFrame named combined_df. Display the result.

    πŸ” Hint

    Use the pd.concat() function and pass a list containing df1 and df2.


    **Context:** Concatenating in pandas merges two or more DataFrames either vertically, adding rows, or horizontally, adding columns. The `pd.concat()` function facilitates this operation, aligning data by index labels to ensure consistency. This process is essential for aggregating and comparing data from different sources.
    πŸ”‘ Solution
    # Combine DataFrames
    combined_df = pd.concat([df1, df2])
    
    # Display the combined DataFrame
    combined_df
    

    Task 1.3: Reset the Index

    Reset the row index of combined_df to ensure it's continuous and sequential.

    πŸ” Hint

    Use the reset_index() method with the drop=True parameter to reset the index without keeping the old index. Optionally, use the inplace=True to mutate the existing DataFrame instead of creating a new DataFrame.


    **Context:** Resetting the row index of a DataFrame renumbers the rows from zero and can turn the old index into a column. This is often done after data manipulation operations like sorting or filtering, which may leave gaps in the original index sequence. The `reset_index()` function in pandas makes this adjustment, helping to maintain a continuous, sequential index.
    πŸ”‘ Solution
    # Reset the index
    combined_df = combined_df.reset_index(drop=True)
    # or
    # combined_df.reset_index(drop=True, inplace=True)
    	
    	
    # Display the new DataFrame
    combined_df
    

    Task 1.4: Fill in Missing Values

    The 'age' column in your concatenated DataFrames has some missing values. Fill the missing values in the 'age' column with the median age.

    πŸ” Hint

    First, calculate the median of the 'age' column using the median() method. Then, use the fillna() method on the df['age'] series, passing in the median age.


    **Context:** In datasets, missing numerical values are often filled with a measure of central tendency, like the mean or median. The median is robust to outliers and is a better choice when the data distribution is skewed. Using the median ensures that the filled values are typical for the dataset without being affected by extreme values.
    πŸ”‘ Solution
    # Calculate the median of the 'age' column
    median_age = combined_df['age'].median()
    
    # Fill in missing values in the 'age' column with the median age
    combined_df['age'].fillna(value=median_age, inplace=True)
    
    # Display the DataFrame
    combined_df
    
  2. Challenge

    Working with JSON and CSV Files

    Working with JSON and CSV Files

    To review the concepts covered in this step, please refer to the Programmatically Representing Data with Pandas module of the Up and Running with Pandas course.

    Knowing how to work with JSON and CSV files is important because these are common data formats that you will encounter in data analysis.

    In this step, you will practice converting JSON and CSV files into DataFrames and vice versa. You will also learn how to examine DataFrame metadata and make minimal transformations to the data. Use the pd.read_json() and pd.read_csv() functions to read JSON and CSV files, respectively, and the to_json() and to_csv() methods to convert DataFrames back to these formats.


    Task 2.1: Load the CSV file into a DataFrame

    Import pandas and use the pd.read_csv() function to load the CSV file 'Student_Scores.csv' into a DataFrame. Name the DataFrame student_scores.

    πŸ” Hint

    Use the pd.read_csv() function and pass the name of the csv file through as a string.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the CSV file into a DataFrame
    student_scores = pd.read_csv('Student_Scores.csv')
    

    Task 2.2: Examine the DataFrame

    Use the head() function to display the first 5 rows of the DataFrame. Then, display the metadata of the DataFrame using the info() method.

    πŸ” Hint

    Call the head() and info() methods on the student_scores DataFrame.


    **Context:** The `head()` function allows analysts and data scientists to swiftly inspect the first few rows of a DataFrame, providing an immediate snapshot of its structure and contents. Typically, the `head()` function displays the initial five rows, but this can be customized as needed. In contrast, the `info()` function furnishes vital metadata about the DataFrame, including data types, non-null counts, and memory utilization.
    πŸ”‘ Solution
    # Display the first 5 rows of the DataFrame
    student_scores.head()
    

    # Display the metadata of the DataFrame
    student_scores.info()
    

    Task 2.3: Convert the DataFrame to a JSON file

    Convert the previous DataFrame created in Task 2.2, student_scores, to a JSON file.

    πŸ” Hint

    Convert the DataFrame to a Json file using the to_json() function.

    πŸ”‘ Solution
    # Convert the DataFrame to a JSON file
    student_scores.to_json('student_scores.json')
    

    Task 2.4: Load the JSON file into a DataFrame

    Load the JSON file from the previous Task 2.3, student_scores.json, into a DataFrame called student_scores_from_json.

    πŸ” Hint

    Use the pd.read_json() function and pass the name of the JSON file through as a string.

    πŸ”‘ Solution
    # Load the JSON file into a DataFrame
    student_scores_from_json = pd.read_json('student_scores.json')
    

    Task 2.5: Compare the two DataFrames

    Compare and check if the dataframe created in Task 2.1, student_scores, and the DataFrame created in Task 2.3, student_scores_from_json are identical.

    πŸ” Hint

    Call the equals() method on the student_scores DataFrame and pass student_scores_from_json as the argument.


    **Context:** The equals() function in pandas is employed to determine if two DataFrames are identical by comparing their shape and values. It returns a Boolean result (True or False) based on whether the DataFrames match or not. This function is indispensable for quality control and data validation, ensuring data consistency and accuracy when comparing or verifying the equality of two datasets in a concise and straightforward manner.
    πŸ”‘ Solution
    # Check if the two DataFrames are identical
    student_scores.equals(student_scores_from_json)
    
  3. Challenge

    Exploring and Evaluating Data

    Exploring and Evaluating Data

    To review the concepts covered in this step, please refer to the Exploring and Evaluating Data with Pandas module of the Up and Running with Pandas course.

    Exploring and evaluating data is important because it allows you to understand the characteristics and relationships within your data, which is crucial for any data analysis task.

    In this step, you will practice deriving different properties from a DataFrame, filtering data according to column names, and determining the size of a DataFrame. You will also learn how to interpret correlation scores and get statistical properties of numerical columns. Use the describe() and corr() methods to get statistical properties and correlation scores, respectively.


    Task 3.1: Load the Data

    Import pandas and load the 'Student_Scores.csv' data into a pandas DataFrame called 'student_scores'.

    πŸ” Hint

    Use the pd.read_csv() function to load the data. The file is 'Student_Scores.csv'.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the data
    student_scores = pd.read_csv('Student_Scores.csv')
    

    Task 3.2: Explore the Data

    Use the head() function to display the first 5 rows of the DataFrame 'student_scores'.

    πŸ” Hint

    Use the head() function on the DataFrame student_scores.


    > **Tip:** To add more than the default 5 rows, pass through a number of your amount. `df.head(#)`
    πŸ”‘ Solution
    # Display the first 5 rows of the DataFrame
    student_scores.head()
    

    Task 3.3: Get Statistical Properties

    Get statistical properties of the numerical columns in the DataFrame 'student_scores'.

    πŸ” Hint

    Use the describe() function on the DataFrame student_scores to view the summary.


    **Context:** The `describe()` function in pandas generates summary statistics for numeric columns in a DataFrame, including measures like mean, standard deviation, and quartiles. It offers a quick overview of the central tendencies and distribution characteristics of the data, aiding in initial data exploration and analysis.
    πŸ”‘ Solution
    # Get statistical properties
    student_scores.describe()
    

    Task 3.4: Get Correlation Scores

    Find the correlation scores between numerical columns in the DataFrame 'student_scores'.

    πŸ” Hint

    Use the corr() function on the DataFrame student_scores.


    **Context:** The `corr()` function in pandas calculates the pairwise correlation coefficients between numeric columns in a DataFrame, measuring the strength and direction of linear relationships. This function is valuable for uncovering relationships among variables in data analysis and decision-making processes.
    πŸ”‘ Solution
    # Get correlation scores
    student_scores.corr()
    

    Task 3.5: Filter Data

    Filter the DataFrame to only include rows where the data in the 'math_score' column is greater than 90 and save the data into a new DataFrame called 'high_math_scores'. Print the number of students with a math score greater than 90.

    πŸ” Hint

    To create 'high_math_scores', find rows where 'math_score' is greater than 90 in your DataFrame. Use a condition like ['math_score'] > 90 to create a filter, and then apply this filter to your DataFrame to extract the desired rows into 'high_math_scores'. The get the number of students in the high_math_scores dataframe, print it's length with the len() function.

    πŸ”‘ Solution
    # Filter data
    high_math_scores = student_scores[student_scores['math_score'] > 90]
    print(len(high_math_scores), "students have score > 90")
    

    Task 3.6: Determine DataFrame Size

    Determine the size of the student scores DataFrame using the shape attribute.

    πŸ” Hint

    Use the shape attribute on the DataFrame student_scores.


    **Context:** The `shape` attribute in pandas returns a tuple representing the dimensions of a DataFrame. It provides two values: the number of rows and the number of columns in the DataFrame. This attribute is a quick and convenient way to ascertain the size and structure of your dataset, helping you understand its extent and organization at a glance.
    πŸ”‘ Solution
    # Determine DataFrame size
    student_scores.shape
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.