In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package. Logistic regression is part of the classification technique of machine learning, which solves many problems in data science.
Logistic regression is also one of the simplest and most commonly used models. It implements a baseline for any binary classification problem with outcomes that are either True/False or Yes/No. It can predict whether mail is spam or predict diabetes in an individual, but it can't predict things like house prices. Another category is multinomial classification, in which more than two classes are determined, such as whether the weather will be sunny, rainy or humid, the species of animals, etc.
In this guide, we will explore a known multi-class problem using the iris dataset from the UCI Machine Learning Repository to predict the species of flowers based on their given dimensions.
You can download the dataset here. Please note that you will need to have a Kaggel account to get the dataset.
So let us begin our journey!
The iris dataset contains observations of three iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.
There are 50 observations of each species for a total of 150 observations with 4 features each (sepal length, sepal width, petal length, petal width). Based on the combination of the mentioned four features, data scientist R. A. Fisher developed a linear discriminant model to distinguish the species from each other.
Before we start, we need to define some ML terms.
Attributes (features): An attribute is a list of specifications in a dataset used to determine classification. In this case, the attributes are the petals' and sepals' length and width.
1# import libraries
2from subprocess import check_output
3import numpy as np # linear algebra
4import pandas as pd # data processing
5import warnings
6warnings.filterwarnings('ignore') #ignore warnings
7from math import ceil
8
9#Visualization
10import matplotlib.pyplot as plt
11import seaborn as sb
12from sklearn.metrics import confusion_matrix #Confusion matrix
13from sklearn.metrics import accuracy_score # Accuracy score
14
15# Spliting training and testing
16from sklearn.model_selection import train_test_split
17
18#Advanced optimization
19from scipy import optimize as op
1# Loading the data
2data_iris = pd.read_csv('Iris.csv') # If your input csv file is placed with working directory
3# data_iris=pd.read_csv('../input/Iris.csv') # Enter the path of the directory where input csv is stored
1data_iris.head() # first 5 entries of the dataset
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
1data_iris.tail() # last 5 entries of dataset
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
1data_iris.info()
The following command gives the descriptive statistics (percentile, mean, standard deviation) for all 150 observations.
1data_iris.describe()
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 75.500000 | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 43.445368 | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 1.000000 | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 38.250000 | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 75.500000 | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 112.750000 | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 150.000000 | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
1data_iris['Species'].value_counts()
1Iris-setosa 50
2Iris-virginica 50
3Iris-versicolor 50
4Name: Species, dtype: int64
Check for missing values.
1data_iris.isnull().values.any()
1False
Great! There is no missing data. Now let's describe the statistical details for each flower species.
1print('Iris-setosa')
2setosa = data_iris['Species'] == 'Iris-setosa'
3print(data_iris[setosa].describe())
4print('\nIris-versicolor')
5versicolor = data_iris['Species'] == 'Iris-versicolor'
6print(data_iris[versicolor].describe())
7print('\nIris-virginica')
8virginica = data_iris['Species'] == 'Iris-virginica'
9print(data_iris[virginica].describe())
1# The Histogram representation of the univariate plots for each measurement
2
3np = data_iris.drop('Id', axis=1) # dropping the Id
4np.hist(edgecolor='blue', linewidth=1.2)
5fig=plt.gcf()
6fig.set_size_inches(12,6)
7plt.show()
1# ploting scatter plot with respect to petal length
2
3petalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "PetalLengthCm", "PetalWidthCm")
4plt.legend(loc='upper left');
5plt.title("Petal Length VS Width");
1# Plotting scatter plot with respect to sepal length
2sepalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "SepalLengthCm", "SepalWidthCm")
3plt.legend(loc='upper right');
4plt.title("Sepal Length VS Width")
Here we can see that the petal features are giving a better cluster division. Let's check the bivariate relation between each pair of features.
1# Using seaborn pairplot to see the bivariate relation between each pair of features
2
3import seaborn as sns
4sns.set_palette('husl')
5
6nl = data_iris.drop('Id', axis=1) # dropping the Id
7b = sns.pairplot(nl,hue="Species",diag_kind="kde", markers='+',size =3 );
8plt.show()
Takeaways from the above visualization:
Iris-Setosa
(in pink) is distinctly different from those of the other two species.1# Data setup
2import numpy as np
3Species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
4# Number of examples
5m = data_iris.shape[0]
6# Features
7n = 4
8# Number of classes
9k = 3
10
11X = np.ones((m,n + 1))
12y = np.array((m,1))
13X[:,1] = data_iris['PetalLengthCm'].values
14X[:,2] = data_iris['PetalWidthCm'].values
15X[:,3] = data_iris['SepalLengthCm'].values
16X[:,4] = data_iris['SepalWidthCm'].values
17
18# Labels
19y = data_iris['Species'].values
20
21# Mean normalization
22for j in range(n):
23 X[:, j] = (X[:, j] - X[:,j].mean())
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)
2# it shows 80% of data is split for training and 20% of the data goes to testing.
3
4X = data_iris.drop(['Id', 'Species'], axis=1)
5y = data_iris['Species']
6# print(X.head())
7print(X_train.shape)
8# print(y.head())
9print(y_test.shape)
1(120, 5)
2(30,)
We can transform a logistic model to a predictor by using a link function. We will be using the sigmoid function for our model. The image shows the S-curve and the equation of the sigmoid function.
Regularization addresses the error of over-fitting or under-fitting the model while training the data. Over-fitting occurs when the variance is high and the model is complicated with lots of unnecessary curves and angles. The model is best fitted for the training data, but it performs poorly while testing the data. Under-fitting occurs when the variance is low and the model is too simple. It will give the best accuracy while testing the data, but it will not fit the training data well. When we apply regularization, the model adjusts θj without neglecting any features. The regularization parameter is added at the end of cost function and gradient descent function equations.
There are multiple ways to write mathematical equations into code format. Choose the method you are comfortable with, and make sure it justifies the mathematical expression correctly.
1# Sigmoid function
2def sigmoid(z):
3 return 1.0 / (1 + np.exp(-z))
4#____________________________________________________________________________#
5# Regularized cost function
6def reglrCostFunction(theta, X, y, lambda_s = 0.1):
7 m = len(y)
8 h = sigmoid(X.dot(theta))
9 J = (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h)))
10 reg = (lambda_s/(2 * m)) * np.sum(theta**2)
11 J = J + reg
12
13 return J
14#____________________________________________________________________________#
15# Regularized gradient function
16def reglrGradient(theta, X, y, lambda_s = 0.1):
17 m, n = X.shape
18 theta = theta.reshape((n, 1))
19 y = y.reshape((m, 1))
20 h = sigmoid(X.dot(theta))
21 reg = lambda_s * theta /m
22 gd = ((1 / m) * X.T.dot(h - y))
23 gd = gd + reg
24
25 return gd
26#____________________________________________________________________________#
27# Optimizing logistic regression (theta)
28def logisticRegression(X, y, theta):
29 result = op.minimize(fun = reglrCostFunction, x0 = theta, args = (X, y),
30 method = 'TNC', jac = reglrGradient)
31
32 return result.x
1# Training
2all_theta = np.zeros((k, n + 1))
3
4# One vs all
5i = 0
6for flower in Species:
7 tmp_y = np.array(y_train == flower, dtype = int)
8 optTheta = logisticRegression(X_train, tmp_y, np.zeros((n + 1,1)))
9 all_theta[i] = optTheta
10 i += 1
11
1# Predictions
2Prob = sigmoid(X_test.dot(all_theta.T)) # probability for each flower
3pred = [Species[np.argmax(Prob[i, :])] for i in range(X_test.shape[0])]
4
5print(" Test Accuracy ", accuracy_score(y_test, pred) * 100 , '%')
1Test Accuracy 96.66666666666667 %
Our model is performing well, as the accuracy is approximately 97%. Let's find out the correlation between features in the dataset using a heat map. A high positive or negative value will show that the features have a high correlation.
1# Confusion Matrix
2cnfm = confusion_matrix(y_test, pred, labels = Species)
3
4sb.heatmap(cnfm, annot = True, xticklabels = Species, yticklabels = Species);
1# classification report
2from sklearn.metrics import classification_report
3print(classification_report(y_test, pred))
Our math looks fine. It is pretty clear that our functions are working, and the output is perfect. However, there is nothing wrong with verifying truth. Let's check whether we get the same results with scikit-learn logistic regression.
.fit()
method..predict()
method.1from sklearn.linear_model import LogisticRegression
2from sklearn import metrics
3logreg = LogisticRegression()
4logreg.fit(X_train, y_train)
5y_pred = logreg.predict(X_test)
6print('Test Accuracy for Scikit-Learn model:', metrics.accuracy_score(y_test, y_pred)* 100,'%')
1Test Accuracy for Scikit-Learn model: 96.66666666666667 %
1from sklearn.metrics import classification_report
2print(classification_report(y_test, y_pred))
The accuracy of the logistic regression model using functions and using scikit-learn is a match! (Which makes sense.)
Although we have accuracy nearing 97%, there is still room for improvement. It is always preferred to use the sklearn logistic regression model in production, as it takes less processing time (compare lines of code), and the package uses a highly optimized solver.
So, when does it come in handy to build algorithms from scratch? When you have to design a model to fit more complex problems or the problem of a new domain.
For this guide, we have selected a simple dataset with fewer features. Just keep in mind that the performance and selection of techniques are dependent on the volume and variety of the data.
One final tip: don't worry if the algorithm is not as optimized and fancy as other exciting packages. Those packages are the result of a strong base and consistent improvement in development.
I hope you liked this guide. If you have any queries, feel free to contact me at CodeAlphabet.