Introduction

0

In this guide, you are going to learn about the fundamentals of plotting regression and its variants, along with their derivative features using the Seaborn library.

By the end of this guide you will be able to implement the following concepts:

- Visualizing a linear regression
- Visualizing a polynomial and a logistic regression
- Handling the plot aesthetics

In this guide we are going to use the following libraries:

**Syntax**

`1 2 3 4 5`

`# Importing necessary libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt`

python

As per the Merriam-Webster dictionary, linear regression is:

The process of finding a straight line (as by least squares) that best approximates a set of points on a graph.

A simple linear regression has one dependent (target) variable and one independent variable. Using the independent variable, we have to estimate the future values of the dependent variable.

To implement any of the regression variants, you can use the `Scikit-Learn`

library. Here, let us try to find a best fit line between a dependent and an independent variable.

`1 2 3 4 5 6`

`# Initializing the data X = np.array([1, 5, 10, 13, 16, 20, 23, 26, 29, 20]) y = np.array([1, 6, 6, 10, 5, 9, 8, 11, 7, 25]) # Storing the data in a DataFrame sl = pd.DataFrame({'Independent': X, 'Dependent': y})`

python

The best fit line is plotted using the `lmplot`

method available in the Seaborn.

`1 2 3 4 5 6 7 8 9 10 11`

`# Plotting with sns.color_palette('summer'): sns.lmplot(x="Independent", y="Dependent", data=sl) # Labelling the title, x-label and y-label plt.title('Simple Linear Regression Plot', weight='bold', fontsize=18) plt.xlabel('Independent', fontsize=16) plt.ylabel('Dependent', fontsize=16) # Displaying the plot plt.show()`

python

From the above graph, the following observations can be made:

- There is a large confidence interval area around the best fit line, shown in the shaded area.
- There is an outlier at coordinate (20, 25) which has affected the position of the line as well as the confidence interval.

We are able to plot the best fit line on the given dataset and we also figured out that one data point has affected the result. However, if it was a large dataset then arriving at such a conclusion could have become a bit more difficult. Therefore, in such cases, it's advised to use residual plots.

If the values around the `y=0`

in a residual plot (implemented by the `residplot`

method in the Seaborn) are scattered randomly then the linear regression model is a good one. However, if there is any pattern in the residuals plotted then the model is a bad one. The stronger the pattern, the worse the model is.

Let us build the residual plot on the above dataset:

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`# Setting the figure size plt.figure(figsize=(8, 5)) # Plotting with sns.color_palette('summer'): sns.residplot(x="Independent", y="Dependent", data=sl) # Labelling the title, x-label and y-label plt.title('Residual Plot', weight='bold', fontsize=18) plt.xlabel('Fitted values', fontsize=16) plt.ylabel('Residuals', fontsize=16) # Displaying the plot plt.show()`

python

As you can observe, there is no pattern across `y=0`

. However, the outlier data point is separated from the other points which, if not considered, can improve the model.

First, we are going to remove the outlier from the data. To do that, we can use the following code:

`1 2`

`# Dropping the last record consisting of an outlier sl = sl.drop(9)`

python

Next, we will plot the simple linear regression along with its residual plot.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`# You can further use `robust=True` argument inside the `lmplot` method. # This will further reduce the effect of outliers on the model. # Plotting with sns.color_palette('summer'): sns.lmplot(x="Independent", y="Dependent", data=sl) # Labelling the title, x-label and y-label plt.title('Simple Linear Regression Plot', weight='bold', fontsize=18) plt.xlabel('Independent', fontsize=16) plt.ylabel('Dependent', fontsize=16) # Displaying the plot plt.show()`

python

`1 2 3 4 5 6 7 8 9 10 11`

`# Plotting with sns.color_palette('summer'): sns.residplot(x="Independent", y="Dependent", data=sl) # Labelling the title, x-label and y-label plt.title('Residual Plot', weight='bold', fontsize=18) plt.xlabel('Fitted values', fontsize=16) plt.ylabel('Residuals', fontsize=16) # Displaying the plot plt.show()`

python

Let us now learn to visualize a polynomial and a logistic regression using one dependent and one independent variable.

Polynomial regression is needed when a straight line cannot fit all of the data points and prediction results tend to get worse. For instance let's create a polynomial data using the most commonly used signal, a sinusoidal wave.

`1 2 3 4 5`

`# Initializing the data X, y = np.arange(30), np.sin(np.arange(30)) # Storing the data in a DataFrame pr = pd.DataFrame({'Independent': X, 'Dependent': y})`

python

To visualize the polynomial regression on this data, we can again use the `lmplot`

method but this time we also have to define the degree (using the `order`

argument) of the polynomial regression model.

`1 2 3 4 5 6 7 8 9 10 11 12 13 14`

`# Plotting with sns.color_palette('summer'): sns.lmplot(x="Independent", y="Dependent", data=pr, order=12, aspect=2) # Limiting the x and y axis plt.axis([0, 9, -2.5, 2.5]) # Labelling the title, x-label and y-label plt.title('Polynomial Regression Plot', weight='bold', fontsize=18) plt.xlabel('Independent', fontsize=16) plt.ylabel('Dependent', fontsize=16) # Displaying the plot plt.show()`

python

In the above figure, we have used the 12th order polynomial regression.

Note: Higher the order of a polynomial regression model, higher are the chances of overfitting.

Overcoming overfitting is a big challenge while constructing the polynomial regression model. Therefore, always spend a good amount of time in choosing the best degree.

Logistic regression provides a probability between the happening of two or more events.
For now, let us consider only two events, first, if a cab driver is a woman and, second, if the cab driver is a man. We consider the independent variable as the distance covered by their cabs and the dependent variable as the choice between the gender where `W`

represents a woman and `M`

represents a man.

`1 2 3 4 5 6`

`# Getting the tips dataset from Seaborn tips = sns.load_dataset("tips") tips["tip"] = (tips.tip / tips.total_bill) > .15 # Framing the above data into our scenario lr = pd.DataFrame({'Distance': tips['total_bill'], 'Probability': tips['tip']})`

python

`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16`

`# Plotting with sns.color_palette('summer'): sns.lmplot(x="Distance", y="Probability", data=lr, logistic=True, aspect=1.5, y_jitter=0.05) # Labelling the title, x-label and y-label plt.title('Logistic Regression Plot', weight='bold', fontsize=18) plt.xlabel('Distance covered (in miles) in a day', fontsize=16) plt.ylabel('Probability', fontsize=16) # Labelling the class plt.text(1, -0.01, 'M', weight='bold', fontsize=16) plt.text(1, 1, 'W', weight='bold', fontsize=16) # Displaying the plot plt.show()`

python

In the above figure, let's take the probability boundary at 0.5. Now, using the given data we can suggest that a person who is covering 40 miles a day is probably a man, whereas a person covering 12 miles a day is probably a woman.

The plot aesthetics are governed by the following attributes:

: Used to allocate an axis to a figure.`ax`

: Used to increase the size of the figure.`height`

: Used to handle the shape of the figure.`aspect`

In this guide, you have learned about the fundamentals of plotting simple linear regression, polynomial regression, and logistic regression.

To learn more about Seaborn, you can refer the following guides:

0

Test your skills. Learn something new. Get help. Repeat.

Start a FREE 10-day trial