In this guide, we are going to implement a logistic regression model from scratch and compare its accuracy with the scikit-learn logistic regression package. Logistic regression is part of the classification technique of machine learning, which solves many problems in data science.
Logistic regression is also one of the simplest and most commonly used models. It implements a baseline for any binary classification problem with outcomes that are either True/False or Yes/No. It can predict whether mail is spam or predict diabetes in an individual, but it can't predict things like house prices. Another category is multinomial classification, in which more than two classes are determined, such as whether the weather will be sunny, rainy or humid, the species of animals, etc.
In this guide, we will explore a known multi-class problem using the iris dataset from the UCI Machine Learning Repository to predict the species of flowers based on their given dimensions.
So let us begin our journey!
The iris dataset contains observations of three iris species: Iris-setosa, Iris-versicolor, and Iris-virginica.
There are 50 observations of each species for a total of 150 observations with 4 features each (sepal length, sepal width, petal length, petal width). Based on the combination of the mentioned four features, data scientist R. A. Fisher developed a linear discriminant model to distinguish the species from each other.
Before we start, we need to define some ML terms.
Attributes (features): An attribute is a list of specifications in a dataset used to determine classification. In this case, the attributes are the petals' and sepals' length and width.
Target variable: In the context of ML, this is the variable that is or should be the output. Here the target variables are the three flower species.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# import libraries from subprocess import check_output import numpy as np # linear algebra import pandas as pd # data processing import warnings warnings.filterwarnings('ignore') #ignore warnings from math import ceil #Visualization import matplotlib.pyplot as plt import seaborn as sb from sklearn.metrics import confusion_matrix #Confusion matrix from sklearn.metrics import accuracy_score # Accuracy score # Spliting training and testing from sklearn.model_selection import train_test_split #Advanced optimization from scipy import optimize as op
1 2 3
# Loading the data data_iris = pd.read_csv('Iris.csv') # If your input csv file is placed with working directory # data_iris=pd.read_csv('../input/Iris.csv') # Enter the path of the directory where input csv is stored
data_iris.head() # first 5 entries of the dataset
data_iris.tail() # last 5 entries of dataset
The following command gives the descriptive statistics (percentile, mean, standard deviation) for all 150 observations.
1 2 3 4
Iris-setosa 50 Iris-virginica 50 Iris-versicolor 50 Name: Species, dtype: int64
Check for missing values.
Great! There is no missing data. Now let's describe the statistical details for each flower species.
1 2 3 4 5 6 7 8 9
print('Iris-setosa') setosa = data_iris['Species'] == 'Iris-setosa' print(data_iris[setosa].describe()) print('\nIris-versicolor') versicolor = data_iris['Species'] == 'Iris-versicolor' print(data_iris[versicolor].describe()) print('\nIris-virginica') virginica = data_iris['Species'] == 'Iris-virginica' print(data_iris[virginica].describe())
1 2 3 4 5 6 7
# The Histogram representation of the univariate plots for each measurement np = data_iris.drop('Id', axis=1) # dropping the Id np.hist(edgecolor='blue', linewidth=1.2) fig=plt.gcf() fig.set_size_inches(12,6) plt.show()
1 2 3 4 5
# ploting scatter plot with respect to petal length petalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "PetalLengthCm", "PetalWidthCm") plt.legend(loc='upper left'); plt.title("Petal Length VS Width");
1 2 3 4
# Plotting scatter plot with respect to sepal length sepalPlt = sb.FacetGrid(data_iris, hue="Species", size=6).map(plt.scatter, "SepalLengthCm", "SepalWidthCm") plt.legend(loc='upper right'); plt.title("Sepal Length VS Width")
Here we can see that the petal features are giving a better cluster division. Let's check the bivariate relation between each pair of features.
1 2 3 4 5 6 7 8
# Using seaborn pairplot to see the bivariate relation between each pair of features import seaborn as sns sns.set_palette('husl') nl = data_iris.drop('Id', axis=1) # dropping the Id b = sns.pairplot(nl,hue="Species",diag_kind="kde", markers='+',size =3 ); plt.show()
Takeaways from the above visualization:
Iris-Setosa(in pink) is distinctly different from those of the other two species.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# Data setup import numpy as np Species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] # Number of examples m = data_iris.shape # Features n = 4 # Number of classes k = 3 X = np.ones((m,n + 1)) y = np.array((m,1)) X[:,1] = data_iris['PetalLengthCm'].values X[:,2] = data_iris['PetalWidthCm'].values X[:,3] = data_iris['SepalLengthCm'].values X[:,4] = data_iris['SepalWidthCm'].values # Labels y = data_iris['Species'].values # Mean normalization for j in range(n): X[:, j] = (X[:, j] - X[:,j].mean())
1 2 3 4 5 6 7 8 9
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11) # it shows 80% of data is split for training and 20% of the data goes to testing. X = data_iris.drop(['Id', 'Species'], axis=1) y = data_iris['Species'] # print(X.head()) print(X_train.shape) # print(y.head()) print(y_test.shape)
(120, 5) (30,)
We can transform a logistic model to a predictor by using a link function. We will be using the sigmoid function for our model. The image shows the S-curve and the equation of the sigmoid function.
Regularization addresses the error of over-fitting or under-fitting the model while training the data. Over-fitting occurs when the variance is high and the model is complicated with lots of unnecessary curves and angles. The model is best fitted for the training data, but it performs poorly while testing the data. Under-fitting occurs when the variance is low and the model is too simple. It will give the best accuracy while testing the data, but it will not fit the training data well. When we apply regularization, the model adjusts θj without neglecting any features. The regularization parameter is added at the end of cost function and gradient descent function equations.
There are multiple ways to write mathematical equations into code format. Choose the method you are comfortable with, and make sure it justifies the mathematical expression correctly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# Sigmoid function def sigmoid(z): return 1.0 / (1 + np.exp(-z)) #____________________________________________________________________________# # Regularized cost function def reglrCostFunction(theta, X, y, lambda_s = 0.1): m = len(y) h = sigmoid(X.dot(theta)) J = (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h))) reg = (lambda_s/(2 * m)) * np.sum(theta**2) J = J + reg return J #____________________________________________________________________________# # Regularized gradient function def reglrGradient(theta, X, y, lambda_s = 0.1): m, n = X.shape theta = theta.reshape((n, 1)) y = y.reshape((m, 1)) h = sigmoid(X.dot(theta)) reg = lambda_s * theta /m gd = ((1 / m) * X.T.dot(h - y)) gd = gd + reg return gd #____________________________________________________________________________# # Optimizing logistic regression (theta) def logisticRegression(X, y, theta): result = op.minimize(fun = reglrCostFunction, x0 = theta, args = (X, y), method = 'TNC', jac = reglrGradient) return result.x
1 2 3 4 5 6 7 8 9 10 11
# Training all_theta = np.zeros((k, n + 1)) # One vs all i = 0 for flower in Species: tmp_y = np.array(y_train == flower, dtype = int) optTheta = logisticRegression(X_train, tmp_y, np.zeros((n + 1,1))) all_theta[i] = optTheta i += 1
1 2 3 4 5
# Predictions Prob = sigmoid(X_test.dot(all_theta.T)) # probability for each flower pred = [Species[np.argmax(Prob[i, :])] for i in range(X_test.shape)] print(" Test Accuracy ", accuracy_score(y_test, pred) * 100 , '%')
Test Accuracy 96.66666666666667 %
Our model is performing well, as the accuracy is approximately 97%. Let's find out the correlation between features in the dataset using a heat map. A high positive or negative value will show that the features have a high correlation.
1 2 3 4
# Confusion Matrix cnfm = confusion_matrix(y_test, pred, labels = Species) sb.heatmap(cnfm, annot = True, xticklabels = Species, yticklabels = Species);
1 2 3
# classification report from sklearn.metrics import classification_report print(classification_report(y_test, pred))
Our math looks fine. It is pretty clear that our functions are working, and the output is perfect. However, there is nothing wrong with verifying truth. Let's check whether we get the same results with scikit-learn logistic regression.
1 2 3 4 5 6
from sklearn.linear_model import LogisticRegression from sklearn import metrics logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) print('Test Accuracy for Scikit-Learn model:', metrics.accuracy_score(y_test, y_pred)* 100,'%')
Test Accuracy for Scikit-Learn model: 96.66666666666667 %
from sklearn.metrics import classification_report print(classification_report(y_test, y_pred))
The accuracy of the logistic regression model using functions and using scikit-learn is a match! (Which makes sense.)
Although we have accuracy nearing 97%, there is still room for improvement. It is always preferred to use the sklearn logistic regression model in production, as it takes less processing time (compare lines of code), and the package uses a highly optimized solver.
So, when does it come in handy to build algorithms from scratch? When you have to design a model to fit more complex problems or the problem of a new domain.
For this guide, we have selected a simple dataset with fewer features. Just keep in mind that the performance and selection of techniques are dependent on the volume and variety of the data.
One final tip: don't worry if the algorithm is not as optimized and fancy as other exciting packages. Those packages are the result of a strong base and consistent improvement in development.
I hope you liked this guide. If you have any queries, feel free to contact me at CodeAlphabet.