Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Merging Data from Different Sources in Python Hands-on Practice

In this lab, you'll gain practical skills in merging data using pandas. Learn to concatenate datasets, implement various joins (inner, left, outer) with pd.merge(), and tackle advanced merging scenarios including multiple columns and overlapping column names. This is crucial for effective data analysis and integration.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 37m
Published
Clock icon Dec 12, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Understanding and Implementing Concatenation

    Jupyter Guide

    To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd(⌘) + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Understanding and Implementing Concatenation

    To review the concepts covered in this step, please refer to the Introduction to Data Merging and Concatenation module of the Merging Data from Different Sources in Python course.

    Understanding the difference between concatenation and merge is important because it helps in deciding which operation to use based on the structure and requirements of your data. Concatenation is used when we want to combine datasets with the same structure, while merge is used when we want to combine datasets based on shared keys.

    In this step, you will practice the concept of concatenation using pandas' concat() function. The goal is to combine two datasets with the same structure. You will also learn how to handle duplicate indices during concatenation using ignore_index and keys parameters.


    Task 1.1: Load the Data

    First, you need to import the pandas library as pd, then load the MarketingChunk1.csv and MarketingChunk2.csv data files into two pandas DataFrames. Use the pandas read_csv() function to accomplish this.

    πŸ” Hint

    Use the read_csv() function from the pandas library. The file path should be passed as a string to this function.

    πŸ”‘ Solution
    import pandas as pd
    
    # Load the data
    marketing_data_1 = pd.read_csv('MarketingChunk1.csv')
    marketing_data_2 = pd.read_csv('MarketingChunk2.csv')
    

    Task 1.2: Check the Data

    Display the first few rows of each DataFrame to verify they have the same column structure.

    πŸ” Hint

    The DataFrame.head() function will show the first few rows of a DataFrame.

    πŸ”‘ Solution

    Cell 1

    marketing_data_1.head()
    

    Cell2

    marketing_data_2.head()
    

    Task 1.3: Concatenate the Data

    Now, use the pandas concat() function to concatenate the two DataFrames. Print the shape of the new dataframe and it's index.

    πŸ” Hint

    Pass a list of the two DataFrames you want to concatenate to the concat() function. To access shape, use the shape attribute. For access to the index, use the index attribute of the DataFrame.

    πŸ”‘ Solution
    # Concatenate the data
    combined_data = pd.concat([marketing_data_1, marketing_data_2])
    print(combined_data.shape)
    print(combined_data.index)
    

    Task 1.4: Handle Duplicate Indices

    When you concatenated the DataFrames, the original indices were preserved. This means that there are duplicate indices in the combined DataFrame. Use the ignore_index parameter of the concat() function to create a new DataFrame where the indices are reset. Print the index of the new DataFrame.

    πŸ” Hint

    Pass the same list of DataFrames to the concat() function, but this time also pass True to the ignore_index parameter.

    πŸ”‘ Solution
    # Handle duplicate indices
    combined_data_reset = pd.concat([marketing_data_1, marketing_data_2], ignore_index=True)
    print(combined_data_reset.index)
    

    Task 1.5: Use Keys Parameter

    The keys parameter of the concat() function can be used to create a hierarchical index. Create a new DataFrame where the original DataFrames are identifiable in the combined DataFrame by using the keys parameter.

    πŸ” Hint

    Pass the same list of DataFrames to the concat() function, but this time also pass a list of strings to the keys parameter. These strings will be used as the first level of the index in the combined DataFrame. Print the first few rows of this new DataFrame.

    πŸ”‘ Solution
    # Use keys parameter
    combined_data_keys = pd.concat([marketing_data_1, marketing_data_2], keys=['First Half', 'Second Half'])
    combined_data_keys.head()
    
  2. Challenge

    Implementing Joins with pd.merge()

    Implementing Joins with pd.merge()

    To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.

    Understanding how to implement joins is important because it allows you to link data from separate sources based on shared keys or identifiers. This is crucial for combining and analyzing data from diverse sources.

    In this step, you will practice implementing different types of joins, such as one-to-one, one-to-many, and inner and outer joins using the pd.merge() function. The goal is to merge two datasets based on shared keys.


    Task 2.1: Importing Required Libraries

    Before we start with the tasks, we need to import the required libraries. In this task, you will import pandas library which provides us with the functions to manipulate and analyze data.

    πŸ” Hint

    Use the import keyword to import pandas. It is a common practice to import pandas with the alias pd.

    πŸ”‘ Solution
    import pandas as pd
    

    Task 2.2: Loading the Dataset

    In this task, you will load the dataset using pandas. The dataset is stored in a CSV file named 'MarketingChunk1.csv'. Display the first few rows.

    πŸ” Hint

    Use the pd.read_csv() function to read the CSV file. The file path should be passed as a string to this function.

    πŸ”‘ Solution
    df = pd.read_csv('MarketingChunk1.csv')
    df.head()
    

    Task 2.3: Loading a Second Dataset

    To perform a merge we need data to merge! The dataset we want to merge is stored in a CSV file named 'MarketingMerge1.csv'. Display the first few rows.

    πŸ” Hint

    Use the pd.read_csv() function to read the CSV file. The file path should be passed as a string to this function.

    πŸ”‘ Solution
    df2 = pd.read_csv('MarketingMerge1.csv')
    df2.head()
    

    Task 2.4: Performing an Inner Join

    In this task, you will perform an Inner join between the two datasets using the RecordID column as the key. Print the length and display the first 10 rows.

    Notice there are no NaN values after the join because only matched rows were preserved.

    πŸ” Hint

    Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'inner' to perform an inner join. To display 10 rows, use the head method with 10 as the argument.

    πŸ”‘ Solution
    df_joined = pd.merge(df, df2, on='RecordID', how='inner')
    print(len(df_joined))
    df_joined.head(10)
    

    Task 2.5: Performing a Left Join

    In this task, you will perform a left join between the two datasets using the RecordID column as the key. Print the length and display the first 10 rows.

    Notice the NaN values where the outer join preserved all records from the MarketingChunk1 data and appended NaN values where MarketingMerge1 had no data.

    πŸ” Hint

    Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'left' to perform an left join. To display 10 rows, use the head method with 10 as the argument.

    πŸ”‘ Solution
    df_joined = pd.merge(df, df2, on='RecordID', how='left')
    print(len(df_joined))
    df_joined.head(10)
    

    Task 2.6: Performing an Outer Join

    In this task, you will perform an outer join between the two datasets using the RecordID column as the key. Print the length and display the first 5 rows and the last 5 rows.

    Notice the NaN values where the outer join preserved all records even if they didn't match.

    πŸ” Hint

    Use the pd.merge() function to perform the join. The first two arguments should be the two dataframes. The on parameter should be set to the column name that will be used as the key. The how parameter should be set to 'outer' to perform an outer join. To display the first 5 and last 5 rows, use head() and tail().

    πŸ”‘ Solution

    Cell 1

    df_joined = pd.merge(df, df2, on='RecordID', how='outer')
    print(len(df_joined))
    df_joined.head()
    

    Cell 2

    df_joined.tail()
    
  3. Challenge

    Managing Merge Keys and Projecting Columns

    Advanced Merging Techniques in Pandas

    To review the concepts covered in this step, please refer to the Implementing Joins with pd.merge() module of the Merging Data from Different Sources in Python course.

    Let's dive into some practical exercises with pandas, focusing on merging techniques. We're using two datasets: MarketingChunk1.csv and MarketingFeedback.csv, which share RecordID and Email columns. This setup is perfect for practicing merges on multiple columns, and using left_on and right_on. Plus, we'll see how to elegantly handle overlapping column names.


    Task 3.1: Load the Datasets

    Get started by loading MarketingChunk1.csv and MarketingFeedback.csv into pandas DataFrames.

    πŸ” Hint

    Use pd.read_csv() for each file. Remember, the file paths are just strings.

    πŸ”‘ Solution
    import pandas as pd
    
    # Loading the datasets
    marketing_chunk1_df = pd.read_csv('MarketingChunk1.csv')
    marketing_feedback_df = pd.read_csv('MarketingFeedback.csv')
    

    Task 3.2: Take a Peek at the Data

    Let's take a quick look at the first few rows of both DataFrames to understand what we're working with.

    πŸ” Hint

    head() is your friend here. It'll give you the top 5 rows by default.

    πŸ”‘ Solution Cell 1 ```python # Checking out MarketingChunk1 marketing_chunk1_df.head() ``` Cell 2 ```python # And now for MarketingFeedback marketing_feedback_df.head() ```

    Task 3.3: Merge on Multiple Columns

    Merge the two files on RecordID and Email, then inspect the results.

    πŸ” Hint

    In pd.merge(), use on=['RecordID', 'Email'] to merge on both columns. Then, call head() on the merged DataFrame to see the top rows.

    πŸ”‘ Solution
    # Merging on RecordID and Email
    merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, on=['RecordID', 'Email'])
    
    # Let's see the result
    merged_df.head()
    

    Task 3.4: Merge with Different Key Names

    Now, try merging MarketingChunk1.csv and MarketingFeedback.csv on RecordID using left_on and right_on. This approach is useful when the columns you want to join on have different names in each DataFrame.

    πŸ” Hint

    Use left_on='RecordID' and right_on='RecordID' in your pd.merge() call. This tells pandas to use RecordID from both DataFrames as the merging key.

    πŸ”‘ Solution
    # Merge using left_on and right_on
    merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID')
    
    # Inspecting the merged DataFrame
    merged_df.head()
    

    Task 3.5: Custom Suffixes for Overlapping Columns

    When you merged the DataFrames in Task 3.4, you might have noticed the automatic suffixes pandas added to distinguish the Email column from each DataFrame. Let's repeat that merge, but this time use custom suffixes and see the result.

    πŸ” Hint

    In the pd.merge() function, add suffixes=('_chunk1', '_feedback'). This will append these suffixes to any overlapping column names.

    πŸ”‘ Solution
    # Merging with custom suffixes
    merged_df = pd.merge(marketing_chunk1_df, marketing_feedback_df, left_on='RecordID', right_on='RecordID', suffixes=('_chunk1', '_feedback'))
    
    # Checking the first few rows to see the suffixes in action
    merged_df.head()
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.