Author avatar

Vivek Kumar

Building Classification Models with Scikit-Learn

Vivek Kumar

  • May 13, 2019
  • 10 Min read
  • 7 Views
  • May 13, 2019
  • 10 Min read
  • 7 Views
Data
scikit-learn

Introduction

Scikit-learn is an extensively used, open-source python library which implements a range of operations in machine learning, i.e., pre-processing, cross-validation, and visualization algorithms using a unified interface. In this guide, you'll get a gist about a few of the classification algorithms available in Scikit-learn.

Brief on Scikit-Learn

To install Scikit-learn library, you can execute the following command in the command prompt.

pip install scikit-learn

Unique features of Scikit-learn include:

  1. A simple tool for data mining and data analysis. It can be used in various classification, regression and clustering algorithms like support vector machines, random forests, gradient boosting, k-means, etc.
  2. It is open source, hence, anyone can access and can customize it.
  3. It is built on the top of libraries like NumPy, SciPy, and Matplotlib.

The Baseline

Throughout this guide, we will be using the following packages and modules:

1
2
3
4
5
6
import numpy as np
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
python

Now, let us set up the scenario for the classification problems.

Consider a case where students have written an exam for three subjects and the results have been announced. The junior class students want to predict their respective results based on the data of their seniors.

Let us learn to implement classifiers present in Scikit-learn keeping the following information in mind:

  1. There are three columns, which represent the marks of students secured in three subjects.
  2. The last column rows hold the record whether the student is passed in the exam or not.
  3. We should remove the first three rows which consist of garbage data.

Let us now create the train data to be used for classification:

1
2
3
4
X = [[18, 80, 44], [17, 70, 43], [16, 60, 38], [15, 54, 37], [16, 65, 40],
    [19, 90, 47], [17, 64, 39], [17, 70, 40], [15, 55, 37], [17, 75, 42], [18, 85, 43], [18, 75, 41], [18, 77, 41]]

Y = ['pass', 'pass', 'fail', 'fail', 'pass', 'pass', 'fail', 'fail', 'fail', 'pass', 'pass', 'fail', 'fail']
python

Here, X represents independent variables, where the variables (columns) are not dependent on any other variable (column), and Y represents the dependent variable, which is dependent on X i.e., the set of independent variables.

In the next step, we will initialize the test data to be used for testing the classification models.

1
2
test_data = [[19, 70, 43], [14, 75, 42], [18,65,40]]
test_labels = ['pass','pass','pass']
python

We assume the above data (X, Y, test_data and test_labels) are stored in a different Pandas DataFrame.

Introduction to Classification Algorithms

In supervised machine learning, there are two categories of algorithms:

  1. Regression
  2. Classification

Classification mainly deals with classifying the datasets, where the output could be binary, ordinal, or nominal.

In this guide, we are going to briefly discuss four classifiers available in Scikit-learn:

  1. Decision tree
  2. Random forest
  3. Support Vector Machine (SVM)
  4. Logistic regression

Decision Tree

Decision trees fall in the class of supervised learning algorithms. Decision trees are assigned to the information based learning algorithms which use different measures of information gain for learning. You can use decision trees for issues where you have continuous but also categorical input and target features.

The main job of decision trees is to find those illustrative features which contain the most information regarding the target feature and then split the dataset along the values of these features such that the target feature values for the resulting sub-datasets are as pure as possible.

Building an optimal decision tree is important using decision tree classifier. In general, decision trees can be constructed from a given set of attributes. While some of the trees are more accurate than others, finding the efficient tree is computationally infeasible because of the large size of the search space.

There are many efficient algorithms which have been developed to construct a far more accurate decision tree. These algorithms usually employ a greedy approach that grows a decision tree by making a series of locally ideal decisions about which attribute to be used for partitioning the data. A few of the examples of greedy decision tree induction algorithms includes ID3, C4.5, and CART.

1
2
3
4
5
6
7
8
# Implementing a basic decision tree using DecisionTreeClassifier method
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(X,Y)
dtc_prediction = decision_tree.predict(test_data)
dtc_prediction

# Output:
# ['pass', 'pass', 'fail']
python

Random Forest

Random forest also falls in the category of a supervised learning algorithm. It is one of the most feasible and easy to use algorithms. As you know, a forest is comprised of trees, the same case is here too. Random forests create decision trees on randomly selected data samples, gets a prediction from each tree and selects the best solution by means of voting. We can use random forest for knowing the feature importance of each feature.

Random forest is a machine learning algorithm which uses an ensemble method. It is said that random forest is made up of numerous decision trees and helps to tackle the problem of overfitting in decision trees. These decision trees are randomly constructed by selecting random features from the given dataset.

The model is constructed on the basis of the maximum number of votes received from the decision trees.

1
2
3
4
5
6
7
8
# Building a basic random forest using the RandomForestClassifier method
random_forest = RandomForestClassifier()
random_forest.fit(X,Y)
rfc_prediction = random_forest.predict(test_data)
rfc_prediction

# Output:
# ['pass', 'pass', 'fail']
python

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification and regression problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is the count of the number of features) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyperplane that differentiates the two classes very well.

In SVM, it is beneficial to have a linear hyperplane between two classes. For easily separable classes, a linear optimum hyperplane does the task. However, if the classes cannot be separated by a linear hyperplane, say, a class of datapoint spread as a circle and holding another class of datapoint within their circle. Since such classes cannot be separated by a linear hyperplane in the current state, to tackle this problem, SVM uses a method called the kernel trick. The kernel trick acts as a function which takes low-dimensional input space and transforms it into a higher dimensional space. It is mostly useful in non-linear separation problems like the one stated above. Simply put, it does some complex data transformations to find out the way to separate the data based on the defined classes.

1
2
3
4
5
6
7
8
# Building a basic SVM model using the SVC method
support_vector = SVC()
support_vector.fit(X,Y)
s_prediction = support_vector.predict(test_data)
s_prediction

# Output:
# ['pass', 'pass', 'fail']
python

Logistic Regression

Logistic regression is one of the most commonly used machine learning algorithms, next to linear regression. In many ways, linear regression and logistic regression are similar. But, the biggest difference is that linear regression algorithms are used to predict/forecast continuous values but logistic regression is used for classification tasks.

The output of logistic regression is a sigmoid curve, or S-curve, where the value on the independent variable would determine the dependent variable. In binary logistic regression, there are only two possible outcomes like 0 and 1. It uses a threshold value to make prediction easier. So, if the predicted probability is lesser than the threshold value, the outcome is taken as 0 and if it is greater than the value, the outcome is taken as 1.

1
2
3
4
5
6
7
8
# Building a basic Logistic Regression model using the LogisticRegression method
logistic = LogisticRegression()
logistic.fit(X,Y)
l_prediction = logistic.predict(test_data)
l_prediction

# Output:
# ['fail', 'pass', 'fail']
python

Evaluating the Classification Algorithms Using Accuracy Metric

After getting a brief idea about the classification algorithms, let's see how to evaluate the algorithms.

In machine learning, fitting the model is not important but finding the correct model is very important. Therefore, using the error metrics to find how well a model has performed is highly desired. Error metrics define how well the testing data is fitting the model.

There are many error metrics for classification. The most important model evaluation error metric is Accuracy.

To find the accuracy of a model, Scikit-Learn provides accuracy_score() method. Here, we show the accuracy score of Decision Tree Classifier built earlier:

1
2
3
4
5
6
# Evaluating the tree model using the accuracy score
tree_acc = accuracy_score(dtc_prediction, test_labels)
tree_acc

# Output:
# 0.666
python

Conclusion

By going through this guide, you have gained an insight about the Scikit-Learn classifiers and how to evaluate them.

0