Author avatar

Vivek Kumar

Visualizing the Distribution of a Dataset

Vivek Kumar

  • Apr 19, 2019
  • 9 Min read
  • 1,346 Views
  • Apr 19, 2019
  • 9 Min read
  • 1,346 Views
Data
Seaborn

Introduction

In this guide, you are going to learn about the fundamentals of plotting univariate and bivariate distribution data using the Seaborn library.

By the end of this guide you will be able to implement the following concepts:

  1. Visualizing a univariate distribution data
  2. Visualizing a bivariate distribution data
  3. Visualizing pairwise relationships in a dataset

The Baseline

In this guide we are going to use the following libraries:

Syntax

1
2
3
4
5
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
python

Visualizing a Univariate Distribution Data

A univariate distribution, as its name suggests, is build upon a quantitative variable. It can be visualized either using a histogram, a kernel density estimation, a rug plot, or combining all of them.

Let us generate some random data and learn to plot a univariate distribution on it using all of the above-mentioned approaches.

Generating Univariate Data

1
2
3
4
5
# Setting a random seed to receive same data every time
np.random.seed(42)

# Generating the random data
uni = np.random.rand(2000)
python

1. Histogram

A histogram is built in the Seaborn library using the distplot method. The distplot method is also responsible for the building of other univariate plots.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Setting the figure size
plt.figure(figsize=(8, 5))

# Building the histogram
sns.distplot(uni,
             bins=100,		# Number of bins
             kde=False,		# Whether to fit KDE plot or not <=> alternate of kdeplot
             rug=False,		# Vertical lines on each observation <=> alternate of rugplot
             color='g')

# Labelling the title
plt.title('Histogram', weight='bold', fontsize=16)

# Displaying the figure
plt.show()
python

hist

2. Kernel Density Estimation Plot

In the above figure notice the shape of histogram peaks, using the Kernel Density Estimation (KDE) plot you can fit the best line for the data. KDE plot is built on the histogram bin peaks by default using the distplot method. Therefore, you can either use kde=True or remove kde=False from the method. Also, make sure to pass hist=False to disable the histogram from the plot.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Setting the figure size
plt.figure(figsize=(8, 5))

# Building the KDE plot
sns.distplot(uni,
             bins=100,
             hist=False,    # Disables the histogram
             rug=False,
             color='g')

# Labelling the title
plt.title('KDE plot', weight='bold', fontsize=16)

# Displaying the figure
plt.show()
python

KDE

You can also build the above figure using the kdeplot method as shown:

1
2
# Using the kdeplot method
sns.kdeplot(uni, color='g')
python

3. Rug Plot

The rug plot provides the density of observations using vertical lines on an axis (usually the bottom X-axis). In the below code, we take only the initial 50 observations for better understanding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Setting the figure size
plt.figure(figsize=(12, 3))

# Building the rug plot
sns.distplot(uni[:50],    # Initial 50 observations
             bins=100,
             hist=False,    # Disables the histogram
             kde=False,     # Disables the KDE plot
             rug=True,
             color='g')

# Labelling the title
plt.title('Rug plot', weight='bold', fontsize=16)

# Displaying the figure
plt.show()
python

rug

Alternatively, you can also use the rugplot method to achieve the above figure:

1
2
# Using the rugplot method
sns.rugplot(uni[:50], color='g')
python

4. Combining All Univariate Plots

We can also merge all three plots into one plot. However, KDE plotting consumes more time for a larger dataset so it should be avoided.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Setting the figure size
plt.figure(figsize=(12, 5))

# Combining all plots
sns.distplot(uni,
             bins=100,
             hist=True,    
             kde=True,    
             rug=True,
             color='g')

# Labelling the title
plt.title('Combining all univariate plots', weight='bold', fontsize=16)

# Displaying the figure
plt.show()
python

All

Visualizing a Bivariate Distribution Data

When you have to find the distribution between the two variables, the following plots are used:

  1. Scatter plot
  2. Hexbin plot
  3. KDE plot

We have already learned about the KDE plot functionality in the univariate section; this time we will learn how to use it for bivariate data.

Let us generate some random bivariate data to be used while discussing these three plots:

1
2
3
4
5
6
7
8
9
# Setting a random seed to receive same data every time
np.random.seed(42)
var1 = np.random.randn(2000)

np.random.seed(100)
var2 = np.random.randn(2000)

# Generating the random data
bi = pd.DataFrame({'Var1': var1, 'Var2': var2})
python

1. Scatter Plot

A scatter plot is helpful to determine the relationship between two variables. Using a scatter plot, we can determine if there is any correlation between the two or not. Through Seaborn, we can visualize a scatter plot as well as visualize the distribution of each variable. This can be achieved using the jointplot method as shown:

1
2
3
4
5
# Scatter plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, height=7, color='g')

# Displaying the figure
plt.show()
python

Scatter

As you can observe from the above figure, the top axis has the histogram of Var1 and the right axis has the histogram of Var2.

2. Hexbin Plot

The hexbin plot is similar to that of a histogram plot because it presents the count of each observation falling in the hex bins over a 2D space. The denser a bin, the more the number of observations it holds and vice-versa.

Hexbin plot is implemented in Seaborn using the jointplot method with kind='hex' argument. Let us implement the hex bin on the bi dataset:

1
2
3
4
5
# Hexbin plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, kind='hex', height=7, color='g')

# Displaying the figure
plt.show()
python

hex

The figure is densest in the center suggesting there is a high number of observations in the vicinity of the coordinate (0, 0).

3. KDE Plot

The KDE plot for a bivariate data can be obtained using the jointplot method with kind='kde' argument. This will provide the best fit lines over the axes and the contour plots inside the axes.

Let us visualize it with our dataset:

1
2
3
4
5
# KDE plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, kind='kde', height=7, color='g')

# Displaying the figure
plt.show()
python

KDE

After observing the contour plot we can suggest that there is high-density data in the center and the data density decreases as we go further out. The same is verified from the KDE plot.

Visualizing Pairwise Relationships in a Dataset

Most of the time, a dataset under study has more than two variables. We cannot use any of the above methods to visualize the relationship among all the variables at the same time. Therefore, Seaborn provides us a different approach to tackle such cases.

Seaborn has the pairplot method through which we can create a matrix of plots (or subplots) using all or specific variables from the dataset.

Let us generate a dataset with four variables to implement the pairplot method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Setting a random seed to receive same data everytime
np.random.seed(42)
var1 = np.random.rand(2000)

np.random.seed(100)
var2 = np.random.randn(2000)

np.random.seed(67)
var3 = np.random.randn(2000)

np.random.seed(150)
var4 = np.random.rand(2000)

# Generating the random data
multi = pd.DataFrame({'Var1': var1, 'Var2': var2,
                  'Var3': var3, 'Var4': var4})
python

Implementing the pairplot method on the dataset:

1
2
3
4
5
6
# Pair plot with distributions
with sns.color_palette("summer"):
    sns.pairplot(multi)

# Displaying the figure
plt.show()
python

Pairplot

We also have control over selecting specific variables while building the pair plot. Also, instead of having the histogram on the parent diagonal, we can replace it with the KDE plot:

1
2
3
4
5
6
# Pair plot with distributions of only first two variables along with KDE
with sns.color_palette("summer"):
    sns.pairplot(multi.iloc[:, :2], diag_kind='kde', height=4)

# Displaying the figure
plt.show()
python

PairKDE

Conclusion

In this guide, you have learned about the fundamentals required to visualize the distribution of a univariate, bivariate, and multivariate data.

To learn more about Seaborn, you can refer the following guides:

1