Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Normalize Data to Make It Appropriate for an Analysis with Pandas Hands-on Practice

In this lab, you'll master data normalization with Pandas and Sklearn in Python. You'll practice standard scaling, Min-Max scaling, and l1, l2, and max normalizations. By creating datasets, applying various techniques, and visualizing the outcomes, you'll gain a deep understanding of data preprocessing methods and their effects on data distribution.

Labs

Path Info

Level
Clock icon Beginner
Duration
Clock icon 34m
Published
Clock icon Dec 12, 2023

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Introduction to Normalization

    Jupyter Guide

    To get started, open the file on the right entitled "Step 1...". You'll complete each task for Step 1 in that Jupyter Notebook file. Remember, you must run the cells (ctrl/cmd(⌘) + Enter) for each task before moving onto the next task in the Jupyter Notebook. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Introduction to Normalization

    To review the concepts covered in this step, please refer to the Some Simple Normalization Techniques module of the Normalize Data to Make It Appropriate for an Analysis with Pandas course.

    Data normalization is important because it makes features of a model have equal weight and makes data more robust for analysis. Here, we will practice normalizing features in a dataset using Pandas and the standard scaling technique.

    This technique is a straightforward way to bring all attributes to the same scale by subtracting the mean and dividing by the standard deviation. You will use sklearn and the pandas library to normalize a randomly sampled dataset. We'll use the StandardScalartransform from sklearn to normalize the data such that its distribution will have a mean of 0 and a standard deviation of 1. Observe the effects by plotting the distribution before and after normalization.


    Task 1.1: Importing Necessary Libraries

    Before we can start working with data, we need to import the necessary libraries. Import pandas, numpy, matplotlib, and the StandardScaler from sklearn.preprocessing.

    πŸ” Hint

    Use the import keyword to import a library. For example, import pandas as pd imports the pandas library and assigns it to the alias pd.

    πŸ”‘ Solution
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    

    Task 1.2: Creating a Random Dataset

    Create a pandas DataFrame with 100 rows and 5 columns. The columns should be filled with random numbers from a normal distribution centered at mu = 2 and with standard deviation sigma = 5. Display the first few rows of the dataframe.

    πŸ” Hint

    Use np.random.normal(mu, sigma, size=(rows, cols)) to generate the array. Use pd.DataFrame(array) to convert the array to a DataFrame. Use df.head() to display the first few rows of the data.

    πŸ”‘ Solution
    array = np.random.normal(2, 5, size=(100, 5))
    df = pd.DataFrame(array)
    df.head()
    

    Task 1.3: Plotting the Distribution Before Normalization

    Plot the distribution of the values in the first column of the DataFrame before normalization. Use DataFrame.plot(kind='density') to create the plot.

    πŸ” Hint

    Using DataFrame.plot(kind='density') will plot the distribution as a smoothed line plot.

    πŸ”‘ Solution
    df.plot(kind='density')
    

    Task 1.4: Normalizing the Data

    Normalize the data in the DataFrame using the StandardScaler. Display the head of the transformed data.

    πŸ” Hint

    Use StandardScaler() to create a scaler. Use scaler.fit(df) to fit the scaler to the data. Use scaler.transform(df) to transform the data.

    πŸ”‘ Solution
    scaler = StandardScaler()
    scaler.fit(df)
    df_normalized = pd.DataFrame(scaler.transform(df))
    df_normalized.head()
    

    Task 1.5: Plotting the Distribution After Normalization

    Plot the normalized data against the original data and compare the normalized vs the old distributions.

    πŸ” Hint

    Use DataFrame.plot(kind="density") to plot the distributions. To overlay the original data, call plot on both original and normalized dataframes with the axis keyword argument set. For example:

    fig, ax = fig.subplots()
    DataFrame.plot(ax=ax, kind="density")
    
    πŸ”‘ Solution
    fig, ax = plt.subplots()
    df.plot(ax=ax, kind='density')
    df_normalized.plot(ax=ax, kind='density')
    
  2. Challenge

    Applying Simple Scaling and Min-Max Scaling Techniques

    Applying Min-max Scaling Technique

    To review the concepts covered in this step, please refer to the Some Simple Normalization Techniques module of the Normalize Data to Make It Appropriate for an Analysis with Pandas course.

    Understanding different normalization techniques is crucial because it allows you to choose the correct method based on the characteristics of your data. In this step, you will apply min-max scaling.

    Let's take your data normalization skills to the next level by practicing another scaling technique: Min-Max Scaling. Min-Max scaling transforms data to fit within the range of 0 and 1. Use sklearn to try out this technique on a randomly sampled DataFrame, and observe their effects on data distribution by plotting the sample dataset before and after normalization.


    Task 2.1: Import Required Libraries

    Before we can start working with data, we need to import the necessary libraries. Import pandas, numpy, matplotlib, and the MinMaxScaler from sklearn.preprocessing.

    πŸ” Hint

    Use the import keyword to import a library. For example, import pandas as pd imports the pandas library and assigns it to the alias pd.

    πŸ”‘ Solution
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import MinMaxScaler
    

    Task 2.2: Creating a Random Dataset

    Create a pandas DataFrame with 100 rows and 5 columns. The columns should be filled with random numbers from a uniform distribution in the range [8, 10]. Display the first few rows of the DataFrame.

    πŸ” Hint

    Use np.random.uniform(low, high, size=(rows, cols)) to generate the array. Use pd.DataFrame(array) to convert the array to a DataFrame. Use df.head() to display the first few rows of the data.

    πŸ”‘ Solution
    array = np.random.uniform(8, 10, size=(100, 5))
    df = pd.DataFrame(array)
    df.head()
    

    Task 2.3: Plotting the Distribution Before Normalization

    Plot the distribution of the values in the first column of the DataFrame before normalization. Use DataFrame.plot(kind='density') to create the plot.

    πŸ” Hint

    Using DataFrame.plot(kind='density') will plot the distribution as a smoothed line plot.

    πŸ”‘ Solution
    df.plot(kind='density')
    

    Task 2.4: Apply Min-Max Scaling

    Normalize the data in the DataFrame using the MinMaxScaler. Display the head of the transformed data.

    πŸ” Hint

    Use MinMaxScaler() to create a scaler. Use scaler.fit(df) to fit the scaler to the data. Use scaler.transform(df) to transform the data.

    πŸ”‘ Solution
    scaler = MinMaxScaler()
    scaler.fit(df)
    df_normalized = pd.DataFrame(scaler.transform(df))
    df_normalized.head()
    

    Task 2.5: Plotting the Distribution After Normalization

    Plot the normalized data against the original data and compare the normalized vs the old distributions.

    πŸ” Hint

    Use DataFrame.plot(kind="density") to plot the distributions. To overlay the original data, call plot on both original and normalized dataframes with the axis keyword argument set. For example:

    fig, ax = fig.subplots()
    DataFrame.plot(ax=ax, kind="density")
    
    πŸ”‘ Solution
    fig, ax = plt.subplots()
    df.plot(ax=ax, kind='density')
    df_normalized.plot(ax=ax, kind='density')
    
  3. Challenge

    Experiment with Gaussian Normalization

    Experiment with Different Normalizations

    To review the concepts covered in this step, please refer to the Different Types of Normalization module of the Normalize Data to Make It Appropriate for an Analysis with Pandas course.

    Exploring different normalization techniques is essential to understand how each method affects your data. In this step, you will experiment with l1, l2, and max normalizations. These techniques are helpful in various data preprocessing scenarios. We'll use Sklearn to apply these normalization techniques to a randomly generated dataset and observe the differences in data distribution before and after normalization.


    Task 3.1: Import Required Libraries

    Similar to the previous tasks, import the necessary libraries. This time, include the Normalizer from sklearn.preprocessing.

    πŸ” Hint

    Remember to import pandas, numpy, matplotlib, and now the Normalizer.

    πŸ”‘ Solution
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import Normalizer
    

    Task 3.2: Creating a Random Dataset

    Create another DataFrame with 100 rows and 5 columns, filled with random numbers from a uniform distribution in the range [5, 15]. Display the first few rows of this new dataframe.

    πŸ” Hint

    You can generate the data using np.random.uniform() and convert it into a DataFrame using pd.DataFrame().

    πŸ”‘ Solution
    array = np.random.uniform(5, 15, size=(100, 5))
    df = pd.DataFrame(array)
    df.head()
    

    Task 3.3: Plotting the Distribution Before Normalization

    Plot the distribution of the values in the DataFrame before applying any normalization.

    πŸ” Hint

    Use DataFrame.plot(kind='density') for a density plot of the DataFrame.

    πŸ”‘ Solution
    df.plot(kind='density')
    

    Task 3.4: Apply l1 and l2 Normalization

    Normalize the data using l1 and l2 normalizations. For each, display the head of the transformed data.

    πŸ” Hint

    Create two normalizers, one with the norm='l1' parameter and the other with norm='l2'. Use fit_transform to apply the normalization.

    πŸ”‘ Solution
    normalizer_l1 = Normalizer(norm='l1')
    df_normalized_l1 = pd.DataFrame(normalizer_l1.fit_transform(df))
    normalizer_l2 = Normalizer(norm='l2')
    df_normalized_l2 = pd.DataFrame(normalizer_l2.fit_transform(df))
    print(df_normalized_l1.head())
    print(df_normalized_l2.head())
    

    Task 3.5: Apply Max Normalization

    Now, apply max normalization to the data and display the first few rows of the transformed data.

    πŸ” Hint

    Max normalization uses the norm='max' parameter in the Normalizer.

    πŸ”‘ Solution
    normalizer_max = Normalizer(norm='max')
    df_normalized_max = pd.DataFrame(normalizer_max.fit_transform(df))
    df_normalized_max.head()
    

    Task 3.6: Plotting the Distributions After Normalization

    Plot the distributions of the normalized data (l1, l2, and max) to compare how each normalization technique has transformed the data.

    πŸ” Hint

    Overlay the plots of the normalized dataframes. Use the ax keyword argument for the DataFrame.plot method to plot all of the normalized distributions on the same plt. For example:

    fig, ax = plt.subplots()
    df.plot(ax=ax, kind='density')
    
    πŸ”‘ Solution
    fig, ax = plt.subplots()
    df_normalized_l1.plot(ax=ax, kind='density')
    df_normalized_l2.plot(ax=ax, kind='density')
    df_normalized_max.plot(ax=ax, kind='density')
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.