In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package. Logistic regression is part of the classification technique of machine learning, which solves many problems in data science.
Logistic regression is also one of the simplest and most commonly used models. It implements a baseline for any binary classification problem with outcomes that are either True/False or Yes/No. It can predict whether mail is spam or predict diabetes in an individual, but it can't predict things like house prices. Another category is multinomial classification, in which more than two classes are determined, such as whether the weather will be sunny, rainy or humid, the species of animals, etc.
In this guide, we will explore a known multi-class problem using the iris dataset from the UCI Machine Learning Repository to predict the species of flowers based on their given dimensions.
So let us begin our journey!
The iris dataset contains observations of three iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.
There are 50 observations of each species for a total of 150 observations with 4 features each (sepal length, sepal width, petal length, petal width). Based on the combination of the mentioned four features, data scientist R. A. Fisher developed a linear discriminant model to distinguish the species from each other.
Before we start, we need to define some ML terms.
Attributes (features): An attribute is a list of specifications in a dataset used to determine classification. In this case, the attributes are the petals' and sepals' length and width.
1# import libraries 2from subprocess import check_output 3import numpy as np # linear algebra 4import pandas as pd # data processing 5import warnings 6warnings.filterwarnings('ignore') #ignore warnings 7from math import ceil 8 9#Visualization 10import matplotlib.pyplot as plt 11import seaborn as sb 12from sklearn.metrics import confusion_matrix #Confusion matrix 13from sklearn.metrics import accuracy_score # Accuracy score 14 15# Spliting training and testing 16from sklearn.model_selection import train_test_split 17 18#Advanced optimization 19from scipy import optimize as op
1# Loading the data 2data_iris = pd.read_csv('Iris.csv') # If your input csv file is placed with working directory 3# data_iris=pd.read_csv('../input/Iris.csv') # Enter the path of the directory where input csv is stored
1data_iris.head() # first 5 entries of the dataset
1data_iris.tail() # last 5 entries of dataset
The following command gives the descriptive statistics (percentile, mean, standard deviation) for all 150 observations.
1Iris-setosa 50 2Iris-virginica 50 3Iris-versicolor 50 4Name: Species, dtype: int64
Check for missing values.
Great! There is no missing data. Now let's describe the statistical details for each flower species.
1print('Iris-setosa') 2setosa = data_iris['Species'] == 'Iris-setosa' 3print(data_iris[setosa].describe()) 4print('\nIris-versicolor') 5versicolor = data_iris['Species'] == 'Iris-versicolor' 6print(data_iris[versicolor].describe()) 7print('\nIris-virginica') 8virginica = data_iris['Species'] == 'Iris-virginica' 9print(data_iris[virginica].describe())
1# The Histogram representation of the univariate plots for each measurement 2 3np = data_iris.drop('Id', axis=1) # dropping the Id 4np.hist(edgecolor='blue', linewidth=1.2) 5fig=plt.gcf() 6fig.set_size_inches(12,6) 7plt.show()
1# ploting scatter plot with respect to petal length 2 3petalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "PetalLengthCm", "PetalWidthCm") 4plt.legend(loc='upper left'); 5plt.title("Petal Length VS Width");
1# Plotting scatter plot with respect to sepal length 2sepalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "SepalLengthCm", "SepalWidthCm") 3plt.legend(loc='upper right'); 4plt.title("Sepal Length VS Width")
Here we can see that the petal features are giving a better cluster division. Let's check the bivariate relation between each pair of features.
1# Using seaborn pairplot to see the bivariate relation between each pair of features 2 3import seaborn as sns 4sns.set_palette('husl') 5 6nl = data_iris.drop('Id', axis=1) # dropping the Id 7b = sns.pairplot(nl,hue="Species",diag_kind="kde", markers='+',size =3 ); 8plt.show()
Takeaways from the above visualization:
Iris-Setosa(in pink) is distinctly different from those of the other two species.
1# Data setup 2import numpy as np 3Species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] 4# Number of examples 5m = data_iris.shape 6# Features 7n = 4 8# Number of classes 9k = 3 10 11X = np.ones((m,n + 1)) 12y = np.array((m,1)) 13X[:,1] = data_iris['PetalLengthCm'].values 14X[:,2] = data_iris['PetalWidthCm'].values 15X[:,3] = data_iris['SepalLengthCm'].values 16X[:,4] = data_iris['SepalWidthCm'].values 17 18# Labels 19y = data_iris['Species'].values 20 21# Mean normalization 22for j in range(n): 23 X[:, j] = (X[:, j] - X[:,j].mean())
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11) 2# it shows 80% of data is split for training and 20% of the data goes to testing. 3 4X = data_iris.drop(['Id', 'Species'], axis=1) 5y = data_iris['Species'] 6# print(X.head()) 7print(X_train.shape) 8# print(y.head()) 9print(y_test.shape)
1(120, 5) 2(30,)
We can transform a logistic model to a predictor by using a link function. We will be using the sigmoid function for our model. The image shows the S-curve and the equation of the sigmoid function.
Regularization addresses the error of over-fitting or under-fitting the model while training the data. Over-fitting occurs when the variance is high and the model is complicated with lots of unnecessary curves and angles. The model is best fitted for the training data, but it performs poorly while testing the data. Under-fitting occurs when the variance is low and the model is too simple. It will give the best accuracy while testing the data, but it will not fit the training data well. When we apply regularization, the model adjusts θj without neglecting any features. The regularization parameter is added at the end of cost function and gradient descent function equations.
There are multiple ways to write mathematical equations into code format. Choose the method you are comfortable with, and make sure it justifies the mathematical expression correctly.
1# Sigmoid function 2def sigmoid(z): 3 return 1.0 / (1 + np.exp(-z)) 4#____________________________________________________________________________# 5# Regularized cost function 6def reglrCostFunction(theta, X, y, lambda_s = 0.1): 7 m = len(y) 8 h = sigmoid(X.dot(theta)) 9 J = (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h))) 10 reg = (lambda_s/(2 * m)) * np.sum(theta**2) 11 J = J + reg 12 13 return J 14#____________________________________________________________________________# 15# Regularized gradient function 16def reglrGradient(theta, X, y, lambda_s = 0.1): 17 m, n = X.shape 18 theta = theta.reshape((n, 1)) 19 y = y.reshape((m, 1)) 20 h = sigmoid(X.dot(theta)) 21 reg = lambda_s * theta /m 22 gd = ((1 / m) * X.T.dot(h - y)) 23 gd = gd + reg 24 25 return gd 26#____________________________________________________________________________# 27# Optimizing logistic regression (theta) 28def logisticRegression(X, y, theta): 29 result = op.minimize(fun = reglrCostFunction, x0 = theta, args = (X, y), 30 method = 'TNC', jac = reglrGradient) 31 32 return result.x
1# Training 2all_theta = np.zeros((k, n + 1)) 3 4# One vs all 5i = 0 6for flower in Species: 7 tmp_y = np.array(y_train == flower, dtype = int) 8 optTheta = logisticRegression(X_train, tmp_y, np.zeros((n + 1,1))) 9 all_theta[i] = optTheta 10 i += 1 11
1# Predictions 2Prob = sigmoid(X_test.dot(all_theta.T)) # probability for each flower 3pred = [Species[np.argmax(Prob[i, :])] for i in range(X_test.shape)] 4 5print(" Test Accuracy ", accuracy_score(y_test, pred) * 100 , '%')
1Test Accuracy 96.66666666666667 %
Our model is performing well, as the accuracy is approximately 97%. Let's find out the correlation between features in the dataset using a heat map. A high positive or negative value will show that the features have a high correlation.
1# Confusion Matrix 2cnfm = confusion_matrix(y_test, pred, labels = Species) 3 4sb.heatmap(cnfm, annot = True, xticklabels = Species, yticklabels = Species);
1# classification report 2from sklearn.metrics import classification_report 3print(classification_report(y_test, pred))
Our math looks fine. It is pretty clear that our functions are working, and the output is perfect. However, there is nothing wrong with verifying truth. Let's check whether we get the same results with scikit-learn logistic regression.
1from sklearn.linear_model import LogisticRegression 2from sklearn import metrics 3logreg = LogisticRegression() 4logreg.fit(X_train, y_train) 5y_pred = logreg.predict(X_test) 6print('Test Accuracy for Scikit-Learn model:', metrics.accuracy_score(y_test, y_pred)* 100,'%')
1Test Accuracy for Scikit-Learn model: 96.66666666666667 %
1from sklearn.metrics import classification_report 2print(classification_report(y_test, y_pred))
The accuracy of the logistic regression model using functions and using scikit-learn is a match! (Which makes sense.)
Although we have accuracy nearing 97%, there is still room for improvement. It is always preferred to use the sklearn logistic regression model in production, as it takes less processing time (compare lines of code), and the package uses a highly optimized solver.
So, when does it come in handy to build algorithms from scratch? When you have to design a model to fit more complex problems or the problem of a new domain.
For this guide, we have selected a simple dataset with fewer features. Just keep in mind that the performance and selection of techniques are dependent on the volume and variety of the data.
One final tip: don't worry if the algorithm is not as optimized and fancy as other exciting packages. Those packages are the result of a strong base and consistent improvement in development.
I hope you liked this guide. If you have any queries, feel free to contact me at CodeAlphabet.