Introduction

3

In this guide, you are going to learn about the fundamentals of plotting univariate and bivariate distribution data using the Seaborn library.

By the end of this guide you will be able to implement the following concepts:

- Visualizing a univariate distribution data
- Visualizing a bivariate distribution data
- Visualizing pairwise relationships in a dataset

In this guide we are going to use the following libraries:

**Syntax**

`1 2 3 4 5`

`# Importing necessary libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt`

python

A univariate distribution, as its name suggests, is build upon a quantitative variable. It can be visualized either using a histogram, a kernel density estimation, a rug plot, or combining all of them.

Let us generate some random data and learn to plot a univariate distribution on it using all of the above-mentioned approaches.

**Generating Univariate Data**

`1 2 3 4 5`

`# Setting a random seed to receive same data every time np.random.seed(42) # Generating the random data uni = np.random.rand(2000)`

python

A histogram is built in the Seaborn library using the `distplot`

method. The `distplot`

method is also responsible for the building of other univariate plots.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`# Setting the figure size plt.figure(figsize=(8, 5)) # Building the histogram sns.distplot(uni, bins=100, # Number of bins kde=False, # Whether to fit KDE plot or not <=> alternate of kdeplot rug=False, # Vertical lines on each observation <=> alternate of rugplot color='g') # Labelling the title plt.title('Histogram', weight='bold', fontsize=16) # Displaying the figure plt.show()`

python

In the above figure notice the shape of histogram peaks, using the Kernel Density Estimation (KDE) plot you can fit the best line for the data. KDE plot is built on the histogram bin peaks by default using the `distplot`

method. Therefore, you can either use `kde=True`

or remove `kde=False`

from the method. Also, make sure to pass `hist=False`

to disable the histogram from the plot.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15`

`# Setting the figure size plt.figure(figsize=(8, 5)) # Building the KDE plot sns.distplot(uni, bins=100, hist=False, # Disables the histogram rug=False, color='g') # Labelling the title plt.title('KDE plot', weight='bold', fontsize=16) # Displaying the figure plt.show()`

python

You can also build the above figure using the `kdeplot`

method as shown:

`1 2`

`# Using the kdeplot method sns.kdeplot(uni, color='g')`

python

The rug plot provides the density of observations using vertical lines on an axis (usually the bottom X-axis). In the below code, we take only the initial 50 observations for better understanding.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16`

`# Setting the figure size plt.figure(figsize=(12, 3)) # Building the rug plot sns.distplot(uni[:50], # Initial 50 observations bins=100, hist=False, # Disables the histogram kde=False, # Disables the KDE plot rug=True, color='g') # Labelling the title plt.title('Rug plot', weight='bold', fontsize=16) # Displaying the figure plt.show()`

python

Alternatively, you can also use the `rugplot`

method to achieve the above figure:

`1 2`

`# Using the rugplot method sns.rugplot(uni[:50], color='g')`

python

We can also merge all three plots into one plot. However, KDE plotting consumes more time for a larger dataset so it should be avoided.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16`

`# Setting the figure size plt.figure(figsize=(12, 5)) # Combining all plots sns.distplot(uni, bins=100, hist=True, kde=True, rug=True, color='g') # Labelling the title plt.title('Combining all univariate plots', weight='bold', fontsize=16) # Displaying the figure plt.show()`

python

When you have to find the distribution between the two variables, the following plots are used:

- Scatter plot
- Hexbin plot
- KDE plot

We have already learned about the KDE plot functionality in the univariate section; this time we will learn how to use it for bivariate data.

Let us generate some random bivariate data to be used while discussing these three plots:

`1 2 3 4 5 6 7 8 9`

`# Setting a random seed to receive same data every time np.random.seed(42) var1 = np.random.randn(2000) np.random.seed(100) var2 = np.random.randn(2000) # Generating the random data bi = pd.DataFrame({'Var1': var1, 'Var2': var2})`

python

A scatter plot is helpful to determine the relationship between two variables. Using a scatter plot, we can determine if there is any correlation between the two or not. Through Seaborn, we can visualize a scatter plot as well as visualize the distribution of each variable. This can be achieved using the `jointplot`

method as shown:

`1 2 3 4 5`

`# Scatter plot with bivariate distribution sns.jointplot(x='Var1', y='Var2', data=bi, height=7, color='g') # Displaying the figure plt.show()`

python

As you can observe from the above figure, the top axis has the histogram of `Var1`

and the right axis has the histogram of `Var2`

.

The hexbin plot is similar to that of a histogram plot because it presents the count of each observation falling in the hex bins over a 2D space. The denser a bin, the more the number of observations it holds and vice-versa.

Hexbin plot is implemented in Seaborn using the `jointplot`

method with `kind='hex'`

argument. Let us implement the hex bin on the `bi`

dataset:

`1 2 3 4 5`

`# Hexbin plot with bivariate distribution sns.jointplot(x='Var1', y='Var2', data=bi, kind='hex', height=7, color='g') # Displaying the figure plt.show()`

python

The figure is densest in the center suggesting there is a high number of observations in the vicinity of the coordinate (0, 0).

The KDE plot for a bivariate data can be obtained using the `jointplot`

method with `kind='kde'`

argument. This will provide the best fit lines over the axes and the contour plots inside the axes.

Let us visualize it with our dataset:

`1 2 3 4 5`

`# KDE plot with bivariate distribution sns.jointplot(x='Var1', y='Var2', data=bi, kind='kde', height=7, color='g') # Displaying the figure plt.show()`

python

After observing the contour plot we can suggest that there is high-density data in the center and the data density decreases as we go further out. The same is verified from the KDE plot.

Most of the time, a dataset under study has more than two variables. We cannot use any of the above methods to visualize the relationship among all the variables at the same time. Therefore, Seaborn provides us a different approach to tackle such cases.

Seaborn has the `pairplot`

method through which we can create a matrix of plots (or subplots) using all or specific variables from the dataset.

Let us generate a dataset with four variables to implement the `pairplot`

method:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16`

`# Setting a random seed to receive same data everytime np.random.seed(42) var1 = np.random.rand(2000) np.random.seed(100) var2 = np.random.randn(2000) np.random.seed(67) var3 = np.random.randn(2000) np.random.seed(150) var4 = np.random.rand(2000) # Generating the random data multi = pd.DataFrame({'Var1': var1, 'Var2': var2, 'Var3': var3, 'Var4': var4})`

python

Implementing the `pairplot`

method on the dataset:

`1 2 3 4 5 6`

`# Pair plot with distributions with sns.color_palette("summer"): sns.pairplot(multi) # Displaying the figure plt.show()`

python

We also have control over selecting specific variables while building the pair plot. Also, instead of having the histogram on the parent diagonal, we can replace it with the KDE plot:

`1 2 3 4 5 6`

`# Pair plot with distributions of only first two variables along with KDE with sns.color_palette("summer"): sns.pairplot(multi.iloc[:, :2], diag_kind='kde', height=4) # Displaying the figure plt.show()`

python

In this guide, you have learned about the fundamentals required to visualize the distribution of a univariate, bivariate, and multivariate data.

To learn more about Seaborn, you can refer the following guides:

3

Test your skills. Learn something new. Get help. Repeat.

Start a FREE 10-day trial