- Lab
- Data

Up and Running with Pandas Hands-on Practice
In this lab, you'll dive into using pandas for data manipulation and analysis. The lab starts with creating and manipulating DataFrames, where you'll learn to create DataFrames from dictionaries, combine them, reset indexes, and fill missing values. Next, you'll work with JSON and CSV files, focusing on loading data from these formats into DataFrames, examining DataFrame metadata, and converting DataFrames back to JSON and CSV. Finally, the lab explores data evaluation, including displaying data, getting statistical properties and correlation scores, filtering data based on conditions, and determining the size of the DataFrame.

Path Info
Table of Contents
-
Challenge
Creating and Manipulating DataFrames
Creating and Manipulating DataFrames
To review the concepts covered in this step, please refer to the Understanding Data Fundamentals with Pandas module of the Up and Running with Pandas course.
Understanding DataFrames is important because they are a fundamental part of pandas and are used to store and manipulate tabular data.
In this step, you will create a DataFrame from scratch and learn how to combine multiple DataFrames together. You will also practice resetting the row index of a DataFrame and filling in missing values with NaN values. Use the
pd.DataFrame()
function to create a DataFrame and thepd.concat()
function to combine DataFrames.After the successful completion of each task, proceed to execute the Jupyter Notebook cell by using the
Shift + Enter
key combination to enact any necessary changes.
Task 1.1: Creating a DataFrame
Import pandas and then create two DataFrames from the provided lists of dictionaries. Name the first DataFrame
df1
and the seconddf2
.Data for
df1
:[ {'name': 'Alice', 'age': None, 'city': 'New York'}, {'name': 'Bob', 'age': 26, 'city': 'Los Angeles'}, {'name': 'Oliver', 'age': None, 'city': 'Salt Lake'} ]
Data for
df2
:[ {'name': 'Charlie', 'age': 35, 'city': 'Chicago'}, {'name': 'Diana', 'age': 82, 'city': 'Miami'} ]
Use the
pd.DataFrame()
function to create each DataFrame. Print both DataFrames.π Hint
To create a DataFrame, call the
pd.DataFrame()
function and pass the list of dictionaries as an argument.
**Context:** In pandas, a DataFrame is a fundamental structure for storing and manipulating tabular data. It can be created from various data formats. Using lists of dictionaries is a straightforward method, where each dictionary represents a row in the DataFrame, and the keys correspond to column names.π Solution
import pandas as pd # Create DataFrames df1 = pd.DataFrame([ {'name': 'Alice', 'age': None, 'city': 'New York'}, {'name': 'Bob', 'age': 26, 'city': 'Los Angeles'}, {'name': 'Oliver', 'age': None, 'city': 'Salt Lake'} ]) df2 = pd.DataFrame([ {'name': 'Charlie', 'age': 35, 'city': 'Chicago'}, {'name': 'Diana', 'age': 82, 'city': 'Miami'} ]) # Display DataFrames df1, df2
Task 1.2: Combine DataFrames
Now, combine
df1
anddf2
into a single DataFrame namedcombined_df
. Display the result.π Hint
Use the
pd.concat()
function and pass a list containingdf1
anddf2
.
**Context:** Concatenating in pandas merges two or more DataFrames either vertically, adding rows, or horizontally, adding columns. The `pd.concat()` function facilitates this operation, aligning data by index labels to ensure consistency. This process is essential for aggregating and comparing data from different sources.π Solution
# Combine DataFrames combined_df = pd.concat([df1, df2]) # Display the combined DataFrame combined_df
Task 1.3: Reset the Index
Reset the row index of
combined_df
to ensure it's continuous and sequential.π Hint
Use the
reset_index()
method with thedrop=True
parameter to reset the index without keeping the old index. Optionally, use theinplace=True
to mutate the existing DataFrame instead of creating a new DataFrame.
**Context:** Resetting the row index of a DataFrame renumbers the rows from zero and can turn the old index into a column. This is often done after data manipulation operations like sorting or filtering, which may leave gaps in the original index sequence. The `reset_index()` function in pandas makes this adjustment, helping to maintain a continuous, sequential index.π Solution
# Reset the index combined_df = combined_df.reset_index(drop=True) # or # combined_df.reset_index(drop=True, inplace=True) # Display the new DataFrame combined_df
Task 1.4: Fill in Missing Values
The 'age' column in your concatenated DataFrames has some missing values. Fill the missing values in the 'age' column with the median age.
π Hint
First, calculate the median of the 'age' column using the
median()
method. Then, use thefillna()
method on thedf['age']
series, passing in the median age.
**Context:** In datasets, missing numerical values are often filled with a measure of central tendency, like the mean or median. The median is robust to outliers and is a better choice when the data distribution is skewed. Using the median ensures that the filled values are typical for the dataset without being affected by extreme values.π Solution
# Calculate the median of the 'age' column median_age = combined_df['age'].median() # Fill in missing values in the 'age' column with the median age combined_df['age'].fillna(value=median_age, inplace=True) # Display the DataFrame combined_df
-
Challenge
Working with JSON and CSV Files
Working with JSON and CSV Files
To review the concepts covered in this step, please refer to the Programmatically Representing Data with Pandas module of the Up and Running with Pandas course.
Knowing how to work with JSON and CSV files is important because these are common data formats that you will encounter in data analysis.
In this step, you will practice converting JSON and CSV files into DataFrames and vice versa. You will also learn how to examine DataFrame metadata and make minimal transformations to the data. Use the
pd.read_json()
andpd.read_csv()
functions to read JSON and CSV files, respectively, and theto_json()
andto_csv()
methods to convert DataFrames back to these formats.
Task 2.1: Load the CSV file into a DataFrame
Import pandas and use the
pd.read_csv()
function to load the CSV file 'Student_Scores.csv' into a DataFrame. Name the DataFramestudent_scores
.π Hint
Use the
pd.read_csv()
function and pass the name of the csv file through as a string.π Solution
import pandas as pd # Load the CSV file into a DataFrame student_scores = pd.read_csv('Student_Scores.csv')
Task 2.2: Examine the DataFrame
Use the
head()
function to display the first 5 rows of the DataFrame. Then, display the metadata of the DataFrame using theinfo()
method.π Hint
Call the
head()
andinfo()
methods on thestudent_scores
DataFrame.
**Context:** The `head()` function allows analysts and data scientists to swiftly inspect the first few rows of a DataFrame, providing an immediate snapshot of its structure and contents. Typically, the `head()` function displays the initial five rows, but this can be customized as needed. In contrast, the `info()` function furnishes vital metadata about the DataFrame, including data types, non-null counts, and memory utilization.π Solution
# Display the first 5 rows of the DataFrame student_scores.head()
# Display the metadata of the DataFrame student_scores.info()
Task 2.3: Convert the DataFrame to a JSON file
Convert the previous DataFrame created in Task 2.2,
student_scores
, to a JSON file.π Hint
Convert the DataFrame to a Json file using the
to_json()
function.π Solution
# Convert the DataFrame to a JSON file student_scores.to_json('student_scores.json')
Task 2.4: Load the JSON file into a DataFrame
Load the JSON file from the previous Task 2.3,
student_scores.json
, into a DataFrame calledstudent_scores_from_json
.π Hint
Use the
pd.read_json()
function and pass the name of the JSON file through as a string.π Solution
# Load the JSON file into a DataFrame student_scores_from_json = pd.read_json('student_scores.json')
Task 2.5: Compare the two DataFrames
Compare and check if the dataframe created in Task 2.1,
student_scores
, and the DataFrame created in Task 2.3,student_scores_from_json
are identical.π Hint
Call the
equals()
method on thestudent_scores
DataFrame and passstudent_scores_from_json
as the argument.
**Context:** The equals() function in pandas is employed to determine if two DataFrames are identical by comparing their shape and values. It returns a Boolean result (True or False) based on whether the DataFrames match or not. This function is indispensable for quality control and data validation, ensuring data consistency and accuracy when comparing or verifying the equality of two datasets in a concise and straightforward manner.π Solution
# Check if the two DataFrames are identical student_scores.equals(student_scores_from_json)
-
Challenge
Exploring and Evaluating Data
Exploring and Evaluating Data
To review the concepts covered in this step, please refer to the Exploring and Evaluating Data with Pandas module of the Up and Running with Pandas course.
Exploring and evaluating data is important because it allows you to understand the characteristics and relationships within your data, which is crucial for any data analysis task.
In this step, you will practice deriving different properties from a DataFrame, filtering data according to column names, and determining the size of a DataFrame. You will also learn how to interpret correlation scores and get statistical properties of numerical columns. Use the
describe()
andcorr()
methods to get statistical properties and correlation scores, respectively.
Task 3.1: Load the Data
Import pandas and load the 'Student_Scores.csv' data into a pandas DataFrame called 'student_scores'.
π Hint
Use the
pd.read_csv()
function to load the data. The file is 'Student_Scores.csv'.π Solution
import pandas as pd # Load the data student_scores = pd.read_csv('Student_Scores.csv')
Task 3.2: Explore the Data
Use the
head()
function to display the first 5 rows of the DataFrame 'student_scores'.π Hint
Use the
head()
function on the DataFramestudent_scores
.
> **Tip:** To add more than the default 5 rows, pass through a number of your amount. `df.head(#)`π Solution
# Display the first 5 rows of the DataFrame student_scores.head()
Task 3.3: Get Statistical Properties
Get statistical properties of the numerical columns in the DataFrame 'student_scores'.
π Hint
Use the
describe()
function on the DataFramestudent_scores
to view the summary.
**Context:** The `describe()` function in pandas generates summary statistics for numeric columns in a DataFrame, including measures like mean, standard deviation, and quartiles. It offers a quick overview of the central tendencies and distribution characteristics of the data, aiding in initial data exploration and analysis.π Solution
# Get statistical properties student_scores.describe()
Task 3.4: Get Correlation Scores
Find the correlation scores between numerical columns in the DataFrame 'student_scores'.
π Hint
Use the
corr()
function on the DataFramestudent_scores
.
**Context:** The `corr()` function in pandas calculates the pairwise correlation coefficients between numeric columns in a DataFrame, measuring the strength and direction of linear relationships. This function is valuable for uncovering relationships among variables in data analysis and decision-making processes.π Solution
# Get correlation scores student_scores.corr()
Task 3.5: Filter Data
Filter the DataFrame to only include rows where the data in the 'math_score' column is greater than
90
and save the data into a new DataFrame called 'high_math_scores'. Print the number of students with a math score greater than 90.π Hint
To create 'high_math_scores', find rows where 'math_score' is greater than 90 in your DataFrame. Use a condition like
['math_score'] > 90
to create a filter, and then apply this filter to your DataFrame to extract the desired rows into 'high_math_scores'. The get the number of students in the high_math_scores dataframe, print it's length with thelen()
function.π Solution
# Filter data high_math_scores = student_scores[student_scores['math_score'] > 90] print(len(high_math_scores), "students have score > 90")
Task 3.6: Determine DataFrame Size
Determine the size of the student scores DataFrame using the
shape
attribute.π Hint
Use the
shape
attribute on the DataFramestudent_scores
.
**Context:** The `shape` attribute in pandas returns a tuple representing the dimensions of a DataFrame. It provides two values: the number of rows and the number of columns in the DataFrame. This attribute is a quick and convenient way to ascertain the size and structure of your dataset, helping you understand its extent and organization at a glance.π Solution
# Determine DataFrame size student_scores.shape
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.