Libraries: If you want this lab, consider one of these libraries.
Data

Merging Data from Different Sources in Python Hands-on Practice

In this lab, you'll gain practical skills in merging data using pandas. Learn to concatenate datasets, implement various joins (inner, left, outer) with pd.merge(), and tackle advanced merging scenarios including multiple columns and overlapping column names. This is crucial for effective data analysis and integration.

Get started Contact sales

Lab Info

Level

Beginner

Last updated

Dec 30, 2026

Duration

33m

Challenge

Understanding and Implementing Concatenation
Jupyter Guide

To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd(⌘) + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.

Understanding and Implementing Concatenation

To review the concepts covered in this step, please refer to the Introduction to Data Merging and Concatenation module of the Merging Data from Different Sources in Python course.

Understanding the difference between concatenation and merge is important because it helps in deciding which operation to use based on the structure and requirements of your data. Concatenation is used when we want to combine datasets with the same structure, while merge is used when we want to combine datasets based on shared keys.

In this step, you will practice the concept of concatenation using pandas' concat() function. The goal is to combine two datasets with the same structure. You will also learn how to handle duplicate indices during concatenation using ignore_index and keys parameters.

Task 1.1: Load the Data

First, you need to import the pandas library as pd, then load the MarketingChunk1.csv and MarketingChunk2.csv data files into two pandas DataFrames. Use the pandas read_csv() function to accomplish this.

🔍 Hint

Use the read_csv() function from the pandas library. The file path should be passed as a string to this function.
🔑 Solution

import pandas as pd # Load the data marketing_data_1 = pd.read_csv('MarketingChunk1.csv') marketing_data_2 = pd.read_csv('MarketingChunk2.csv')
Task 1.2: Check the Data

Display the first few rows of each DataFrame to verify they have the same column structure.

🔍 Hint

The DataFrame.head() function will show the first few rows of a DataFrame.
🔑 Solution

Cell 1

marketing_data_1.head()

Cell2

marketing_data_2.head()
Task 1.3: Concatenate the Data

Now, use the pandas concat() function to concatenate the two DataFrames. Print the shape of the new dataframe and it's index.

🔍 Hint

Pass a list of the two DataFrames you want to concatenate to the concat() function. To access shape, use the shape attribute. For access to the index, use the index attribute of the DataFrame.
🔑 Solution

# Concatenate the data combined_data = pd.concat([marketing_data_1, marketing_data_2]) print(combined_data.shape) print(combined_data.index)
Task 1.4: Handle Duplicate Indices

When you concatenated the DataFrames, the original indices were preserved. This means that there are duplicate indices in the combined DataFrame. Use the ignore_index parameter of the concat() function to create a new DataFrame where the indices are reset. Print the index of the new DataFrame.

🔍 Hint

Pass the same list of DataFrames to the concat() function, but this time also pass True to the ignore_index parameter.
🔑 Solution

# Handle duplicate indices combined_data_reset = pd.concat([marketing_data_1, marketing_data_2], ignore_index=True) print(combined_data_reset.index)
Task 1.5: Use Keys Parameter

The keys parameter of the concat() function can be used to create a hierarchical index. Create a new DataFrame where the original DataFrames are identifiable in the combined DataFrame by using the keys parameter.

🔍 Hint

Pass the same list of DataFrames to the concat() function, but this time also pass a list of strings to the keys parameter. These strings will be used as the first level of the index in the combined DataFrame. Print the first few rows of this new DataFrame.
🔑 Solution

# Use keys parameter combined_data_keys = pd.concat([marketing_data_1, marketing_data_2], keys=['First Half', 'Second Half']) combined_data_keys.head()
Challenge

Implementing Joins with pd.merge()
Implementing Joins with pd.merge()

To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.

Understanding how to implement joins is important because it allows you to link data from separate sources based on shared keys or identifiers. This is crucial for combining and analyzing data from diverse sources.

In this step, you will practice implementing different types of joins, such as one-to-one, one-to-many, and inner and outer joins using the pd.merge() function. The goal is to merge two datasets based on shared keys.

Task 2.1: Importing Required Libraries

Before we start with the tasks, we need to import the required libraries. In this task, you will import pandas library which provides us with the functions to manipulate and analyze data.

🔍 Hint

Use the import keyword to import pandas. It is a common practice to import pandas with the alias pd.
🔑 Solution

import pandas as pd
Task 2.2: Loading the Dataset

In this task, you will load the dataset using pandas. The dataset is stored in a CSV file named 'MarketingChunk1.csv'. Display the first few rows.

🔍 Hint

Use the pd.read_csv() function to read the CSV file. The file path should be passed as a string to this function.
🔑 Solution

df = pd.read_csv('MarketingChunk1.csv') df.head()
Task 2.3: Loading a Second Dataset

To perform a merge we need data to merge! The dataset we want to merge is stored in a CSV file named 'MarketingMerge1.csv'. Display the first few rows.

🔍 Hint

Use the pd.read_csv() function to read the CSV file. The file path should be passed as a string to this function.
🔑 Solution

df2 = pd.read_csv('MarketingMerge1.csv') df2.head()
Task 2.4: Performing an Inner Join

In this task, you will perform an Inner join between the two datasets using the RecordID column as the key. Print the length and display the first 10 rows.

Notice there are no NaN values after the join because only matched rows were preserved.

🔍 Hint

Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'inner' to perform an inner join. To display 10 rows, use the head method with 10 as the argument.
🔑 Solution

df_joined = pd.merge(df, df2, on='RecordID', how='inner') print(len(df_joined)) df_joined.head(10)
Task 2.5: Performing a Left Join

In this task, you will perform a left join between the two datasets using the RecordID column as the key. Print the length and display the first 10 rows.

Notice the NaN values where the outer join preserved all records from the MarketingChunk1 data and appended NaN values where MarketingMerge1 had no data.

🔍 Hint

Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'left' to perform an left join. To display 10 rows, use the head method with 10 as the argument.
🔑 Solution

df_joined = pd.merge(df, df2, on='RecordID', how='left') print(len(df_joined)) df_joined.head(10)
Task 2.6: Performing an Outer Join

In this task, you will perform an outer join between the two datasets using the RecordID column as the key. Print the length and display the first 5 rows and the last 5 rows.

Notice the NaN values where the outer join preserved all records even if they didn't match.

🔍 Hint

Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'outer' to perform an outer join. To display the first 5 and last 5 rows, use head() and tail().
🔑 Solution

Cell 1

df_joined = pd.merge(df, df2, on='RecordID', how='outer') print(len(df_joined)) df_joined.head()

Cell 2

df_joined.tail()
Challenge

Managing Merge Keys and Projecting Columns
Advanced Merging Techniques in Pandas

To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.

Let's dive into some practical exercises with pandas, focusing on merging techniques. We're using two datasets: MarketingChunk1.csv and MarketingFeedback.csv, which share RecordID and Email columns. This setup is perfect for practicing merges on multiple columns, and using left_on and right_on. Plus, we'll see how to elegantly handle overlapping column names.

Task 3.1: Load the Datasets

Get started by loading MarketingChunk1.csv and MarketingFeedback.csv into pandas DataFrames.

🔍 Hint

Use pd.read_csv() for each file. Remember, the file paths are just strings.
🔑 Solution

import pandas as pd # Loading the datasets marketing_chunk1_df = pd.read_csv('MarketingChunk1.csv') marketing_feedback_df = pd.read_csv('MarketingFeedback.csv')
Task 3.2: Take a Peek at the Data

Let's take a quick look at the first few rows of both DataFrames to understand what we're working with.

🔍 Hint

head() is your friend here. It'll give you the top 5 rows by default.

🔑 Solution
Cell 1 ```python # Checking out MarketingChunk1 marketing_chunk1_df.head() ``` Cell 2 ```python # And now for MarketingFeedback marketing_feedback_df.head() ```

Task 3.3: Merge on Multiple Columns

Merge the two files on RecordID and Email, then inspect the results.

🔍 Hint

In pd.merge(), use on=['RecordID', 'Email'] to merge on both columns. Then, call head() on the merged DataFrame to see the top rows.
🔑 Solution

# Merging on RecordID and Email merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, on=['RecordID', 'Email']) # Let's see the result merged_df.head()
Task 3.4: Merge with Different Key Names

Now, try merging MarketingChunk1.csv and MarketingFeedback.csv on RecordID using left_on and right_on. This approach is useful when the columns you want to join on have different names in each DataFrame.

🔍 Hint

Use left_on='RecordID' and right_on='RecordID' in your pd.merge() call. This tells pandas to use RecordID from both DataFrames as the merging key.
🔑 Solution

# Merge using left_on and right_on merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID') # Inspecting the merged DataFrame merged_df.head()
Task 3.5: Custom Suffixes for Overlapping Columns

When you merged the DataFrames in Task 3.4, you might have noticed the automatic suffixes pandas added to distinguish the Email column from each DataFrame. Let's repeat that merge, but this time use custom suffixes and see the result.

🔍 Hint

In the pd.merge() function, add suffixes=('_chunk1', '_feedback'). This will append these suffixes to any overlapping column names.
🔑 Solution

# Merging with custom suffixes merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID', suffixes=('_chunk1', '_feedback')) # Checking the first few rows to see the suffixes in action merged_df.head()

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Merging Data from Different Sources in Python Hands-on Practice

Lab Info

Table of Contents

Understanding and Implementing Concatenation

Jupyter Guide

Understanding and Implementing Concatenation

Task 1.1: Load the Data

Task 1.2: Check the Data

Task 1.3: Concatenate the Data

Task 1.4: Handle Duplicate Indices

Task 1.5: Use Keys Parameter

Implementing Joins with pd.merge()

Implementing Joins with pd.merge()

Task 2.1: Importing Required Libraries

Task 2.2: Loading the Dataset

Task 2.3: Loading a Second Dataset

Task 2.4: Performing an Inner Join

Task 2.5: Performing a Left Join

Task 2.6: Performing an Outer Join

Managing Merge Keys and Projecting Columns

Advanced Merging Techniques in Pandas

Task 3.1: Load the Datasets

Task 3.2: Take a Peek at the Data

Task 3.3: Merge on Multiple Columns

Task 3.4: Merge with Different Key Names

Task 3.5: Custom Suffixes for Overlapping Columns

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight