Author avatar

Vivek Kumar

Plotting with Categorical Data

Vivek Kumar

  • Apr 19, 2019
  • 9 Min read
  • 1,921 Views
  • Apr 19, 2019
  • 9 Min read
  • 1,921 Views
Data
Seaborn

Introduction

In this guide, we are going to learn about the fundamentals of plotting categorical data using Seaborn library.

By the end of this guide you will be able to implement the following concepts:

  1. Categorical scatter plots
  2. Categorical distribution plots
  3. Categorical estimate plots

The Baseline

In this guide we are going to use the following libraries:

Syntax

1
2
3
4
5
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
python

Categorical Scatter Plots

Consider the energy consumption of 600 houses for three consecutive days. If we need to plot a figure between the quantitative variable (energy consumed (in kWh)) and the qualitative variable (days (Mon, Tue, Wed)), then we can proceed with the catplot method. The catplot method by default doesn't distinguish between the overlapping values. However, to observe the number of duplicate records, we can pass an argument kind=swarm to the catplot method so that it adds some jitter to the duplicate values and helps to distinguish them.

Let us understand both the scenarios using given code:

1
2
3
4
5
6
7
8
9
10
11
# Generating random energy data
np.random.seed(42)
energy = np.linspace(50, 400, 200, dtype=int)
np.random.shuffle(energy)
energy = np.concatenate((energy, energy, energy))
np.random.shuffle(energy)

# Building a dataset
data = pd.DataFrame({'Days': ['Mon']*200 + ['Tue']*200 + ['Wed']*200,
                    'Energy': energy})
data.head()
python
DaysEnergy
0Mon143
1Mon171
2Mon71
3Mon132
4Mon123

Case I: Keeping the Duplicate Records

1
2
3
4
5
6
7
8
9
10
# Plotting
sns.catplot(x='Days', y='Energy', data=data, aspect=2)
plt.title('With duplicate records', weight='bold', fontsize=16)
plt.show()

# ===============
# Alternative: stripplot
# ===============

# sns.stripplot(x='Days', y='Energy', data=data)
python

strip

Case II: Separating the Duplicate Records

1
2
3
4
5
6
7
8
9
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='swarm', aspect=2)
plt.title('Separating duplicate records', weight='bold', fontsize=16)
plt.show()

# ===============
# Alternative: swarmplot
# ===============
# sns.swarmplot(x='Days', y='Energy', data=data)
python

swarm

We can even create a specific visualization based on certain criteria. For instance, say out of 600 houses, 280 houses have more than five family members. We can separate such 280 houses in our visualization using the hue argument as shown:

1
2
3
4
5
6
7
8
9
10
11
12
# Allocating 280 houses randomly with more than five members
members = np.array(['Yes']*280 + ['No']*320)
np.random.seed(42)
np.random.shuffle(members)

# Adding data back to the dataset
data['MembersMoreThanFive'] = members

# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='swarm', hue='MembersMoreThanFive', aspect=2)
plt.title('Describing Hue parameter', weight='bold', fontsize=16)
plt.show()
python

Hue

Categorical Distribution Plots

We have learned the visualization of a quantitative variable against a qualitative variable in the previous section. Now, let us learn about how can we define the distribution of quantitative data spread across qualitative data. Such distributions are possible using either of the following plots: 1. Box Plot 2. Boxen Plot 3. Violin Plot

Let us learn to implement each one of them sequentially.

1. Box Plot

A box plot is used to know how a piece of quantitative data is spread across its 25th, 50th and 75th percentile. It provides information about the outliers, median as well as the minimum and maximum value within the data. Let us take our previous dataset and visualize the box plot across each consecutive days. This can be achieved again by using the catplot method as shown:

1
2
3
4
5
6
7
8
9
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='box', aspect=2)
plt.title('Box Plot', weight='bold', fontsize=16)
plt.show()

# ==============
# Alternative: boxplot
# ==============
# sns.boxplot(x='Days', y='Energy', data=data)
python

box

We can also generate the box plot based on a certain case. For instance, let us visualize the boxplot for each separating on the houses having more than five members.

1
2
3
4
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='box', hue='MembersMoreThanFive', aspect=2)
plt.title('Box Plot with Hue parameter', weight='bold', fontsize=16)
plt.show()
python

Box Hue

From the above figures, we can suggest that there are no outliers in the data and data has little skewness.

2. Boxen Plot

The boxen plot is very similar to the box plot in functioning. However, it presents more information on the percentiles and, hence, is suitable for large datasets.

Let us extend our previous dataset with 51,000 records and implement boxen plot on it:

1
2
3
4
5
6
7
8
9
10
11
12
# Generating random energy data
np.random.seed(42)
energy = np.linspace(100, 100000, 51000)

# Building a 51,000 records dataset
data = pd.DataFrame({'Days': ['Mon']*17000 + ['Tue']*17000 + ['Wed']*17000,
                    'Energy': energy})

# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='boxen', aspect=2)
plt.title('Boxen Plot', weight='bold', fontsize=16)
plt.show()
python

boxen

As we can observe in the above figure that there is more information presented by the boxen plots through their percentiles as compared to a box plot. We can also use the hue parameter along with the boxen plot as shown:

1
2
3
4
5
6
7
8
9
10
11
12
# Allocating 30000 houses randomly with more than five members
members = np.array(['Yes']*30000 + ['No']*21000)
np.random.seed(42)
np.random.shuffle(members)

# Adding data back to the dataset
data['MembersMoreThanFive'] = members

# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='boxen', hue='MembersMoreThanFive', aspect=2)
plt.title('Boxen Plot with Hue parameter', weight='bold', fontsize=16)
plt.show()
python

BoxenHue

3. Violin Plot

The violin plot is a combination of a box plot and a kernel density estimation procedure. Therefore, we can obtain two different kinds of information through one plot.

Let us use the energy dataset of 51000 houses to represent the violin plot:

Case I: Without the Hue Parameter

1
2
3
4
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='violin', aspect=2)
plt.title('Violin Plot', weight='bold', fontsize=16)
plt.show()
python

Violin

In the center of each violin plot, you can observe the box plot and the left-right boundaries represent the symmetric kernel density estimation.

Case II: With the Hue Parameter

1
2
3
4
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='violin', hue='MembersMoreThanFive', aspect=2)
plt.title('Violin Plot with the Hue parameter', weight='bold', fontsize=16)
plt.show()
python

ViolinHue

Case III: With the Hue and the Split Parameters

We can also combine two different hue violin plots of similar day together using the split argument as shown:

1
2
3
4
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='violin', hue='MembersMoreThanFive', split=True, aspect=2)
plt.title('Violin Plot with the Hue and Split parameters', weight='bold', fontsize=16)
plt.show()
python

ViolinSplit

Categorical Estimate Plots

If we need the estimate between the quantitative and qualitative data then instead of using the distribution plots we can use the estimate plots. Bar plot and point plot are two such estimate plots.

Let us implement the bar plot using the catplot method on the energy dataset of 51,000 houses:

1
2
3
4
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='bar', hue='MembersMoreThanFive', palette="ch:10.75")
plt.title('Bar Plot', weight='bold', fontsize=16)
plt.show()
python

Bar

Similarly, a point plot connects the estimates of similar hue as shown on the same dataset:

1
2
3
4
5
# Plotting
sns.catplot(x='Days', y='Energy', data=data, kind='point', hue='MembersMoreThanFive', 
            markers=["^", "o"], linestyles=["-", "--"])
plt.title('Point Plot', weight='bold', fontsize=16)
plt.show()
python

Point

Here, both of the lines overlap each other suggesting that the estimates for both hues are the same.

Conclusion

In this guide, you have learned about the fundamental visualizations used for categorical data using scatter plots, distribution plots, and estimate plots.

To learn more about Seaborn, you can refer the following guides:

0