 Vivek Kumar

# Visualizing the Distribution of a Dataset

• Apr 19, 2019
• 1,820 Views
• Apr 19, 2019
• 1,820 Views
Data
Seaborn

## Introduction

In this guide, you are going to learn about the fundamentals of plotting univariate and bivariate distribution data using the Seaborn library.

By the end of this guide you will be able to implement the following concepts:

1. Visualizing a univariate distribution data
2. Visualizing a bivariate distribution data
3. Visualizing pairwise relationships in a dataset

## The Baseline

In this guide we are going to use the following libraries:

Syntax

``````1
2
3
4
5
``````# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt``````
python

## Visualizing a Univariate Distribution Data

A univariate distribution, as its name suggests, is build upon a quantitative variable. It can be visualized either using a histogram, a kernel density estimation, a rug plot, or combining all of them.

Let us generate some random data and learn to plot a univariate distribution on it using all of the above-mentioned approaches.

Generating Univariate Data

``````1
2
3
4
5
``````# Setting a random seed to receive same data every time
np.random.seed(42)

# Generating the random data
uni = np.random.rand(2000)``````
python

### 1. Histogram

A histogram is built in the Seaborn library using the `distplot` method. The `distplot` method is also responsible for the building of other univariate plots.

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
``````# Setting the figure size
plt.figure(figsize=(8, 5))

# Building the histogram
sns.distplot(uni,
bins=100,		# Number of bins
kde=False,		# Whether to fit KDE plot or not <=> alternate of kdeplot
rug=False,		# Vertical lines on each observation <=> alternate of rugplot
color='g')

# Labelling the title
plt.title('Histogram', weight='bold', fontsize=16)

# Displaying the figure
plt.show()``````
python ### 2. Kernel Density Estimation Plot

In the above figure notice the shape of histogram peaks, using the Kernel Density Estimation (KDE) plot you can fit the best line for the data. KDE plot is built on the histogram bin peaks by default using the `distplot` method. Therefore, you can either use `kde=True` or remove `kde=False` from the method. Also, make sure to pass `hist=False` to disable the histogram from the plot.

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
``````# Setting the figure size
plt.figure(figsize=(8, 5))

# Building the KDE plot
sns.distplot(uni,
bins=100,
hist=False,    # Disables the histogram
rug=False,
color='g')

# Labelling the title
plt.title('KDE plot', weight='bold', fontsize=16)

# Displaying the figure
plt.show()``````
python You can also build the above figure using the `kdeplot` method as shown:

``````1
2
``````# Using the kdeplot method
sns.kdeplot(uni, color='g')``````
python

### 3. Rug Plot

The rug plot provides the density of observations using vertical lines on an axis (usually the bottom X-axis). In the below code, we take only the initial 50 observations for better understanding.

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
``````# Setting the figure size
plt.figure(figsize=(12, 3))

# Building the rug plot
sns.distplot(uni[:50],    # Initial 50 observations
bins=100,
hist=False,    # Disables the histogram
kde=False,     # Disables the KDE plot
rug=True,
color='g')

# Labelling the title
plt.title('Rug plot', weight='bold', fontsize=16)

# Displaying the figure
plt.show()``````
python Alternatively, you can also use the `rugplot` method to achieve the above figure:

``````1
2
``````# Using the rugplot method
sns.rugplot(uni[:50], color='g')``````
python

### 4. Combining All Univariate Plots

We can also merge all three plots into one plot. However, KDE plotting consumes more time for a larger dataset so it should be avoided.

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
``````# Setting the figure size
plt.figure(figsize=(12, 5))

# Combining all plots
sns.distplot(uni,
bins=100,
hist=True,
kde=True,
rug=True,
color='g')

# Labelling the title
plt.title('Combining all univariate plots', weight='bold', fontsize=16)

# Displaying the figure
plt.show()``````
python ## Visualizing a Bivariate Distribution Data

When you have to find the distribution between the two variables, the following plots are used:

1. Scatter plot
2. Hexbin plot
3. KDE plot

We have already learned about the KDE plot functionality in the univariate section; this time we will learn how to use it for bivariate data.

Let us generate some random bivariate data to be used while discussing these three plots:

``````1
2
3
4
5
6
7
8
9
``````# Setting a random seed to receive same data every time
np.random.seed(42)
var1 = np.random.randn(2000)

np.random.seed(100)
var2 = np.random.randn(2000)

# Generating the random data
bi = pd.DataFrame({'Var1': var1, 'Var2': var2})``````
python

### 1. Scatter Plot

A scatter plot is helpful to determine the relationship between two variables. Using a scatter plot, we can determine if there is any correlation between the two or not. Through Seaborn, we can visualize a scatter plot as well as visualize the distribution of each variable. This can be achieved using the `jointplot` method as shown:

``````1
2
3
4
5
``````# Scatter plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, height=7, color='g')

# Displaying the figure
plt.show()``````
python As you can observe from the above figure, the top axis has the histogram of `Var1` and the right axis has the histogram of `Var2`.

### 2. Hexbin Plot

The hexbin plot is similar to that of a histogram plot because it presents the count of each observation falling in the hex bins over a 2D space. The denser a bin, the more the number of observations it holds and vice-versa.

Hexbin plot is implemented in Seaborn using the `jointplot` method with `kind='hex'` argument. Let us implement the hex bin on the `bi` dataset:

``````1
2
3
4
5
``````# Hexbin plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, kind='hex', height=7, color='g')

# Displaying the figure
plt.show()``````
python The figure is densest in the center suggesting there is a high number of observations in the vicinity of the coordinate (0, 0).

### 3. KDE Plot

The KDE plot for a bivariate data can be obtained using the `jointplot` method with `kind='kde'` argument. This will provide the best fit lines over the axes and the contour plots inside the axes.

Let us visualize it with our dataset:

``````1
2
3
4
5
``````# KDE plot with bivariate distribution
sns.jointplot(x='Var1', y='Var2', data=bi, kind='kde', height=7, color='g')

# Displaying the figure
plt.show()``````
python After observing the contour plot we can suggest that there is high-density data in the center and the data density decreases as we go further out. The same is verified from the KDE plot.

## Visualizing Pairwise Relationships in a Dataset

Most of the time, a dataset under study has more than two variables. We cannot use any of the above methods to visualize the relationship among all the variables at the same time. Therefore, Seaborn provides us a different approach to tackle such cases.

Seaborn has the `pairplot` method through which we can create a matrix of plots (or subplots) using all or specific variables from the dataset.

Let us generate a dataset with four variables to implement the `pairplot` method:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
``````# Setting a random seed to receive same data everytime
np.random.seed(42)
var1 = np.random.rand(2000)

np.random.seed(100)
var2 = np.random.randn(2000)

np.random.seed(67)
var3 = np.random.randn(2000)

np.random.seed(150)
var4 = np.random.rand(2000)

# Generating the random data
multi = pd.DataFrame({'Var1': var1, 'Var2': var2,
'Var3': var3, 'Var4': var4})``````
python

Implementing the `pairplot` method on the dataset:

``````1
2
3
4
5
6
``````# Pair plot with distributions
with sns.color_palette("summer"):
sns.pairplot(multi)

# Displaying the figure
plt.show()``````
python We also have control over selecting specific variables while building the pair plot. Also, instead of having the histogram on the parent diagonal, we can replace it with the KDE plot:

``````1
2
3
4
5
6
``````# Pair plot with distributions of only first two variables along with KDE
with sns.color_palette("summer"):
sns.pairplot(multi.iloc[:, :2], diag_kind='kde', height=4)

# Displaying the figure
plt.show()``````
python ## Conclusion

In this guide, you have learned about the fundamentals required to visualize the distribution of a univariate, bivariate, and multivariate data.