Author avatar

Vivek Kumar

Visualizing the Regression Variants

Vivek Kumar

  • Apr 19, 2019
  • 9 Min read
  • 89 Views
  • Apr 19, 2019
  • 9 Min read
  • 89 Views
Data
Seaborn

Introduction

In this guide, you are going to learn about the fundamentals of plotting regression and its variants, along with their derivative features using the Seaborn library.

By the end of this guide you will be able to implement the following concepts:

  1. Visualizing a linear regression
  2. Visualizing a polynomial and a logistic regression
  3. Handling the plot aesthetics

The Baseline

In this guide we are going to use the following libraries:

Syntax

1
2
3
4
5
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
python

Visualizing a Linear Regression

As per the Merriam-Webster dictionary, linear regression is:

The process of finding a straight line (as by least squares) that best approximates a set of points on a graph.

A simple linear regression has one dependent (target) variable and one independent variable. Using the independent variable, we have to estimate the future values of the dependent variable.

To implement any of the regression variants, you can use the Scikit-Learn library. Here, let us try to find a best fit line between a dependent and an independent variable.

Generating Data for Simple Linear Regression

1
2
3
4
5
6
# Initializing the data
X = np.array([1, 5, 10, 13, 16, 20, 23, 26, 29, 20])
y = np.array([1, 6, 6, 10, 5, 9, 8, 11, 7, 25])

# Storing the data in a DataFrame
sl = pd.DataFrame({'Independent': X, 'Dependent': y})
python

Visualizing the Best Fit Line along with the Confidence Interval

The best fit line is plotted using the lmplot method available in the Seaborn.

1
2
3
4
5
6
7
8
9
10
11
# Plotting
with sns.color_palette('summer'):
    sns.lmplot(x="Independent", y="Dependent", data=sl)

# Labelling the title, x-label and y-label
plt.title('Simple Linear Regression Plot', weight='bold', fontsize=18)
plt.xlabel('Independent', fontsize=16)
plt.ylabel('Dependent', fontsize=16)

# Displaying the plot
plt.show()
python

SLR

From the above graph, the following observations can be made:

  1. There is a large confidence interval area around the best fit line, shown in the shaded area.
  2. There is an outlier at coordinate (20, 25) which has affected the position of the line as well as the confidence interval.

Verification of Simple Linear Regression

We are able to plot the best fit line on the given dataset and we also figured out that one data point has affected the result. However, if it was a large dataset then arriving at such a conclusion could have become a bit more difficult. Therefore, in such cases, it's advised to use residual plots.

If the values around the y=0 in a residual plot (implemented by the residplot method in the Seaborn) are scattered randomly then the linear regression model is a good one. However, if there is any pattern in the residuals plotted then the model is a bad one. The stronger the pattern, the worse the model is.

Let us build the residual plot on the above dataset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Setting the figure size
plt.figure(figsize=(8, 5))

# Plotting
with sns.color_palette('summer'):
    sns.residplot(x="Independent", y="Dependent", data=sl)

# Labelling the title, x-label and y-label
plt.title('Residual Plot', weight='bold', fontsize=18)
plt.xlabel('Fitted values', fontsize=16)
plt.ylabel('Residuals', fontsize=16)

# Displaying the plot
plt.show()
python

Resid

As you can observe, there is no pattern across y=0. However, the outlier data point is separated from the other points which, if not considered, can improve the model.

Visualizing the Best Fit Line along without the Outlier

First, we are going to remove the outlier from the data. To do that, we can use the following code:

1
2
# Dropping the last record consisting of an outlier
sl = sl.drop(9)
python

Next, we will plot the simple linear regression along with its residual plot.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# You can further use `robust=True` argument inside the `lmplot` method.
# This will further reduce the effect of outliers on the model.

# Plotting
with sns.color_palette('summer'):
    sns.lmplot(x="Independent", y="Dependent", data=sl)

# Labelling the title, x-label and y-label
plt.title('Simple Linear Regression Plot', weight='bold', fontsize=18)
plt.xlabel('Independent', fontsize=16)
plt.ylabel('Dependent', fontsize=16)

# Displaying the plot
plt.show()
python

SLR_NO

1
2
3
4
5
6
7
8
9
10
11
# Plotting
with sns.color_palette('summer'):
    sns.residplot(x="Independent", y="Dependent", data=sl)

# Labelling the title, x-label and y-label
plt.title('Residual Plot', weight='bold', fontsize=18)
plt.xlabel('Fitted values', fontsize=16)
plt.ylabel('Residuals', fontsize=16)

# Displaying the plot
plt.show()
python

Resid

Visualizing a Polynomial and a Logistic Regression

Let us now learn to visualize a polynomial and a logistic regression using one dependent and one independent variable.

1. Polynomial Regression Plot

Polynomial regression is needed when a straight line cannot fit all of the data points and prediction results tend to get worse. For instance let's create a polynomial data using the most commonly used signal, a sinusoidal wave.

1
2
3
4
5
# Initializing the data
X, y = np.arange(30), np.sin(np.arange(30))

# Storing the data in a DataFrame
pr = pd.DataFrame({'Independent': X, 'Dependent': y})
python

To visualize the polynomial regression on this data, we can again use the lmplot method but this time we also have to define the degree (using the order argument) of the polynomial regression model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Plotting
with sns.color_palette('summer'):
    sns.lmplot(x="Independent", y="Dependent", data=pr, order=12, aspect=2)

# Limiting the x and y axis
plt.axis([0, 9, -2.5, 2.5])

# Labelling the title, x-label and y-label
plt.title('Polynomial Regression Plot', weight='bold', fontsize=18)
plt.xlabel('Independent', fontsize=16)
plt.ylabel('Dependent', fontsize=16)

# Displaying the plot
plt.show()
python

Poly

In the above figure, we have used the 12th order polynomial regression.

Note: Higher the order of a polynomial regression model, higher are the chances of overfitting.

Overcoming overfitting is a big challenge while constructing the polynomial regression model. Therefore, always spend a good amount of time in choosing the best degree.

2. Logistic Regression Plot

Logistic regression provides a probability between the happening of two or more events. For now, let us consider only two events, first, if a cab driver is a woman and, second, if the cab driver is a man. We consider the independent variable as the distance covered by their cabs and the dependent variable as the choice between the gender where W represents a woman and M represents a man.

Generating the Dataset

1
2
3
4
5
6
# Getting the tips dataset from Seaborn
tips = sns.load_dataset("tips")
tips["tip"] = (tips.tip / tips.total_bill) > .15

# Framing the above data into our scenario
lr = pd.DataFrame({'Distance': tips['total_bill'], 'Probability': tips['tip']})
python

Visualizing the Logistic Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Plotting
with sns.color_palette('summer'):
    sns.lmplot(x="Distance", y="Probability", data=lr, 
               logistic=True, aspect=1.5, y_jitter=0.05)

# Labelling the title, x-label and y-label
plt.title('Logistic Regression Plot', weight='bold', fontsize=18)
plt.xlabel('Distance covered (in miles) in a day', fontsize=16)
plt.ylabel('Probability', fontsize=16)

# Labelling the class
plt.text(1, -0.01, 'M', weight='bold', fontsize=16)
plt.text(1, 1, 'W', weight='bold', fontsize=16)

# Displaying the plot
plt.show()
python

Logistic

In the above figure, let's take the probability boundary at 0.5. Now, using the given data we can suggest that a person who is covering 40 miles a day is probably a man, whereas a person covering 12 miles a day is probably a woman.

Handling the Plot Aesthetics

The plot aesthetics are governed by the following attributes:

  1. ax: Used to allocate an axis to a figure.
  2. height: Used to increase the size of the figure.
  3. aspect: Used to handle the shape of the figure.

Conclusion

In this guide, you have learned about the fundamentals of plotting simple linear regression, polynomial regression, and logistic regression.

To learn more about Seaborn, you can refer the following guides:

0