- Lab
- Data

Merging Data from Different Sources in Python Hands-on Practice
In this lab, you'll gain practical skills in merging data using pandas. Learn to concatenate datasets, implement various joins (inner, left, outer) with pd.merge(), and tackle advanced merging scenarios including multiple columns and overlapping column names. This is crucial for effective data analysis and integration.

Path Info
Table of Contents
-
Challenge
Understanding and Implementing Concatenation
Jupyter Guide
To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells
(ctrl/cmd(β) + Enter)
for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Understanding and Implementing Concatenation
To review the concepts covered in this step, please refer to the Introduction to Data Merging and Concatenation module of the Merging Data from Different Sources in Python course.
Understanding the difference between concatenation and merge is important because it helps in deciding which operation to use based on the structure and requirements of your data. Concatenation is used when we want to combine datasets with the same structure, while merge is used when we want to combine datasets based on shared keys.
In this step, you will practice the concept of concatenation using pandas'
concat()
function. The goal is to combine two datasets with the same structure. You will also learn how to handle duplicate indices during concatenation usingignore_index
andkeys
parameters.
Task 1.1: Load the Data
First, you need to import the
pandas
library aspd
, then load theMarketingChunk1.csv
andMarketingChunk2.csv
data files into two pandas DataFrames. Use the pandasread_csv()
function to accomplish this.π Hint
Use the
read_csv()
function from the pandas library. The file path should be passed as a string to this function.π Solution
import pandas as pd # Load the data marketing_data_1 = pd.read_csv('MarketingChunk1.csv') marketing_data_2 = pd.read_csv('MarketingChunk2.csv')
Task 1.2: Check the Data
Display the first few rows of each DataFrame to verify they have the same column structure.
π Hint
The
DataFrame.head()
function will show the first few rows of a DataFrame.π Solution
Cell 1
marketing_data_1.head()
Cell2
marketing_data_2.head()
Task 1.3: Concatenate the Data
Now, use the pandas
concat()
function to concatenate the two DataFrames. Print the shape of the new dataframe and it's index.π Hint
Pass a list of the two DataFrames you want to concatenate to the
concat()
function. To access shape, use theshape
attribute. For access to the index, use theindex
attribute of the DataFrame.π Solution
# Concatenate the data combined_data = pd.concat([marketing_data_1, marketing_data_2]) print(combined_data.shape) print(combined_data.index)
Task 1.4: Handle Duplicate Indices
When you concatenated the DataFrames, the original indices were preserved. This means that there are duplicate indices in the combined DataFrame. Use the
ignore_index
parameter of theconcat()
function to create a new DataFrame where the indices are reset. Print the index of the new DataFrame.π Hint
Pass the same list of DataFrames to the
concat()
function, but this time also passTrue
to theignore_index
parameter.π Solution
# Handle duplicate indices combined_data_reset = pd.concat([marketing_data_1, marketing_data_2], ignore_index=True) print(combined_data_reset.index)
Task 1.5: Use Keys Parameter
The
keys
parameter of theconcat()
function can be used to create a hierarchical index. Create a new DataFrame where the original DataFrames are identifiable in the combined DataFrame by using thekeys
parameter.π Hint
Pass the same list of DataFrames to the
concat()
function, but this time also pass a list of strings to thekeys
parameter. These strings will be used as the first level of the index in the combined DataFrame. Print the first few rows of this new DataFrame.π Solution
# Use keys parameter combined_data_keys = pd.concat([marketing_data_1, marketing_data_2], keys=['First Half', 'Second Half']) combined_data_keys.head()
-
Challenge
Implementing Joins with pd.merge()
Implementing Joins with pd.merge()
To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.
Understanding how to implement joins is important because it allows you to link data from separate sources based on shared keys or identifiers. This is crucial for combining and analyzing data from diverse sources.
In this step, you will practice implementing different types of joins, such as one-to-one, one-to-many, and inner and outer joins using the
pd.merge()
function. The goal is to merge two datasets based on shared keys.
Task 2.1: Importing Required Libraries
Before we start with the tasks, we need to import the required libraries. In this task, you will import pandas library which provides us with the functions to manipulate and analyze data.
π Hint
Use the
import
keyword to import pandas. It is a common practice to import pandas with the aliaspd
.π Solution
import pandas as pd
Task 2.2: Loading the Dataset
In this task, you will load the dataset using pandas. The dataset is stored in a CSV file named 'MarketingChunk1.csv'. Display the first few rows.
π Hint
Use the
pd.read_csv()
function to read the CSV file. The file path should be passed as a string to this function.π Solution
df = pd.read_csv('MarketingChunk1.csv') df.head()
Task 2.3: Loading a Second Dataset
To perform a merge we need data to merge! The dataset we want to merge is stored in a CSV file named 'MarketingMerge1.csv'. Display the first few rows.
π Hint
Use the
pd.read_csv()
function to read the CSV file. The file path should be passed as a string to this function.π Solution
df2 = pd.read_csv('MarketingMerge1.csv') df2.head()
Task 2.4: Performing an Inner Join
In this task, you will perform an Inner join between the two datasets using the
RecordID
column as the key. Print the length and display the first 10 rows.Notice there are no NaN values after the join because only matched rows were preserved.
π Hint
Use the
pd.merge()
function to perform the join. The first two arguments should be the two dataframes. Theon
parameter should be set to the column name that will be used as the key. Thehow
parameter should be set to 'inner' to perform an inner join. To display 10 rows, use the head method with 10 as the argument.π Solution
df_joined = pd.merge(df, df2, on='RecordID', how='inner') print(len(df_joined)) df_joined.head(10)
Task 2.5: Performing a Left Join
In this task, you will perform a left join between the two datasets using the
RecordID
column as the key. Print the length and display the first 10 rows.Notice the NaN values where the outer join preserved all records from the
MarketingChunk1
data and appended NaN values whereMarketingMerge1
had no data.π Hint
Use the
pd.merge()
function to perform the join. The first two arguments should be the two dataframes. Theon
parameter should be set to the column name that will be used as the key. Thehow
parameter should be set to 'left' to perform an left join. To display 10 rows, use the head method with 10 as the argument.π Solution
df_joined = pd.merge(df, df2, on='RecordID', how='left') print(len(df_joined)) df_joined.head(10)
Task 2.6: Performing an Outer Join
In this task, you will perform an outer join between the two datasets using the
RecordID
column as the key. Print the length and display the first 5 rows and the last 5 rows.Notice the NaN values where the outer join preserved all records even if they didn't match.
π Hint
Use the
pd.merge()
function to perform the join. The first two arguments should be the two dataframes. Theon
parameter should be set to the column name that will be used as the key. Thehow
parameter should be set to 'outer' to perform an outer join. To display the first 5 and last 5 rows, usehead()
andtail()
.π Solution
Cell 1
df_joined = pd.merge(df, df2, on='RecordID', how='outer') print(len(df_joined)) df_joined.head()
Cell 2
df_joined.tail()
-
Challenge
Managing Merge Keys and Projecting Columns
Advanced Merging Techniques in Pandas
To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.
Let's dive into some practical exercises with pandas, focusing on merging techniques. We're using two datasets:
MarketingChunk1.csv
andMarketingFeedback.csv
, which shareRecordID
andEmail
columns. This setup is perfect for practicing merges on multiple columns, and usingleft_on
andright_on
. Plus, we'll see how to elegantly handle overlapping column names.
Task 3.1: Load the Datasets
Get started by loading
MarketingChunk1.csv
andMarketingFeedback.csv
into pandas DataFrames.π Hint
Use
pd.read_csv()
for each file. Remember, the file paths are just strings.π Solution
import pandas as pd # Loading the datasets marketing_chunk1_df = pd.read_csv('MarketingChunk1.csv') marketing_feedback_df = pd.read_csv('MarketingFeedback.csv')
Task 3.2: Take a Peek at the Data
Let's take a quick look at the first few rows of both DataFrames to understand what we're working with.
π Hint
head()
is your friend here. It'll give you the top 5 rows by default.π Solution
Cell 1 ```python # Checking out MarketingChunk1 marketing_chunk1_df.head() ``` Cell 2 ```python # And now for MarketingFeedback marketing_feedback_df.head() ```
Task 3.3: Merge on Multiple Columns
Merge the two files on
RecordID
andEmail
, then inspect the results.π Hint
In
pd.merge()
, useon=['RecordID', 'Email']
to merge on both columns. Then, callhead()
on the merged DataFrame to see the top rows.π Solution
# Merging on RecordID and Email merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, on=['RecordID', 'Email']) # Let's see the result merged_df.head()
Task 3.4: Merge with Different Key Names
Now, try merging
MarketingChunk1.csv
andMarketingFeedback.csv
onRecordID
usingleft_on
andright_on
. This approach is useful when the columns you want to join on have different names in each DataFrame.π Hint
Use
left_on='RecordID'
andright_on='RecordID'
in yourpd.merge()
call. This tells pandas to useRecordID
from both DataFrames as the merging key.π Solution
# Merge using left_on and right_on merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID') # Inspecting the merged DataFrame merged_df.head()
Task 3.5: Custom Suffixes for Overlapping Columns
When you merged the DataFrames in Task 3.4, you might have noticed the automatic suffixes pandas added to distinguish the
Email
column from each DataFrame. Let's repeat that merge, but this time use custom suffixes and see the result.π Hint
In the
pd.merge()
function, addsuffixes=('_chunk1', '_feedback')
. This will append these suffixes to any overlapping column names.π Solution
# Merging with custom suffixes merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID', suffixes=('_chunk1', '_feedback')) # Checking the first few rows to see the suffixes in action merged_df.head()
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.