Author avatar

Vaibhav Sharma

A Deep Learning Model to Perform Binary Classification

Vaibhav Sharma

  • May 17, 2019
  • 9 Min read
  • 2,203 Views
  • May 17, 2019
  • 9 Min read
  • 2,203 Views
Data
Deep Learning

Introduction

Binary classification is one of the most common and frequently tackled problems in the machine learning domain. In it's simplest form the user tries to classify an entity into one of the two possible categories. For example, give the attributes of the fruits like weight, color, peel texture, etc. that classify the fruits as either peach or apple. Through the effective use of Neural Networks (Deep Learning Models), binary classification problems can be solved to a fairly high degree.

In this guide, we will see how we are going to classify the molecules as being either active or inactive based on the physical properties like the mass of the molecule, radius of gyration, electro-negativity, etc. The data set has been created just for the sake of this tutorial and is only indicative. To avoid confusion, the properties will be listed just as prop_1, prop_2 instead of mass, the radius of gyration, etc.

The Keras library, that comes along with the Tensorflow library, will be employed to generate the Deep Learning model.

Importing Data

Let us have a look at the sample of the dataset we will be working with

1
2
3
import pandas as pd
df = pd.read_csv('molecular_activity.csv')
print(df.head())
python

Output

1
2
3
4
5
6
   prop_1  prop_2  prop_3  prop_4  Activity
0    4.06   71.01   57.20    5.82         1
1    3.63   65.62   52.68    5.44         1
2    3.63   68.90   58.29    6.06         1
3    4.11   75.59   62.81    6.44         1
4    4.00   70.86   58.05    6.06         1

As mentioned before, prop_1, prop_2, prop_3, and prop_4 are the properties associated with the molecules and Activity can be thought of as antibiotic activity or anti-inflammatory activity. If the activity is 1 then the molecule is active or else it is not. Whole data set is provided in the appendix for anyone who wants to replicate the example.

Splitting Dataset into Train and Test Feature Matrix and Dependent Vector

The dataset we imported needs pre-processing before it can be fed into the neural network. The first step will be to split it into independent features and dependent vector. For our molecular activity dataset, prop_1, prop_2, prop_3, and prop_4 are the independent features while Activity is the dependent variable.

1
2
3
4
5
properties = list(df.columns.values)
properties.remove('Activity')
print(properties)
X = df[properties]
y = df['Activity']
python

The above code first creates the list using the column names available in the dataset and assigns it to the variable properties. Subsequently, the dependent variable name (Activity) is removed from properties. X Matrix is defined by taking up all the data in the data frame (df) apart from that or Activity. Similarly y vector is created by taking the Activity data from the df.

1
print(X.head())
python

Output

1
2
3
4
5
6
   prop_1  prop_2  prop_3  prop_4
0    4.06   71.01   57.20    5.82
1    3.63   65.62   52.68    5.44
2    3.63   68.90   58.29    6.06
3    4.11   75.59   62.81    6.44
4    4.00   70.86   58.05    6.06
1
print(y.head())
python

Output

1
2
3
4
5
0    1
1    1
2    1
3    1
4    1

The next step will be to divide the data into test and train sets. This is achieved using test_train_split function provided in the model_selection class of sklearn module.

1
2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
python

The above code splits the data set such that seventy percent of the randomly selected data is put into the train set and rest of the thirty percent of data is kept aside as the test set that will be used for the validation purposes.

Model Creation, Compilation, Fitting, and Evaluation

1
2
3
4
5
6
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(4,)),
    keras.layers.Dense(16, activation=tf.nn.relu),
	keras.layers.Dense(16, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
])
python

The above code creates a Neural Network that has three layers. There are two layers of 16 nodes each and one output node. The last node uses the sigmoid activation function that will squeeze all the values between 0 and 1 into the form of a sigmoid curve. The other two layers use ReLU (Rectified Linear Units) as the activation function. ReLU is a half rectified function; that is, for all the inputs less than 0 (e.g. -120,-6.7, -0.0344, 0) the value is 0 while for anything positive (e.g. 10,15, 34) the value is retained. One output unit is used since for each record values in X, a probability will be predicted. If it is high ( >0.9) than the molecule is definitely active. If it is less ( <0.2) then it is definitely not active.

1
2
3
4
5
6
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=1)
test_loss, test_acc = model.evaluate(X_test, y_test)
python

The above code compiles the network. It uses Adam, a momentum-based optimizer. The loss function used is binary_crossentropy. For binary classification problems that give output in the form of probability, binary_crossentropy is usually the optimizer of choice. mean_squared_error may also be used instead of binary_crossentropy as well. Metrics used is accuracy. The model is trained for 50 epochs with a batch size of 1. Finally, the trained model was evaluated for the test set to check the accuracy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.read_csv('molecular_activity.csv')
properties = list(df.columns.values)
properties.remove('Activity')
X = df[properties]
y = df['Activity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(4,)),
    keras.layers.Dense(16, activation=tf.nn.relu),
	keras.layers.Dense(16, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=1)

test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)
python

Output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Epoch 1/50
378/378 [==============================] - 1s 2ms/sample - loss: 0.6704 - acc: 0.6958
Epoch 2/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.5604 - acc: 0.7672
Epoch 3/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.5554 - acc: 0.7725
Epoch 4/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.5536 - acc: 0.7751
Epoch 5/50
'
'
'
Epoch 44/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4138 - acc: 0.8360
Epoch 45/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4214 - acc: 0.8280
Epoch 46/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4268 - acc: 0.8333
Epoch 47/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4130 - acc: 0.8280
Epoch 48/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4146 - acc: 0.8307
Epoch 49/50
378/378 [==============================] - 0s 1ms/sample - loss: 0.4161 - acc: 0.8333
Epoch 50/50
378/378 [==============================] - 1s 1ms/sample - loss: 0.4111 - acc: 0.8254
162/162 [==============================] - 0s 421us/sample - loss: 0.3955 - acc: 0.8333
Test accuracy: 0.8333333
python

The test accuracy predicted by the model is over 83%. It can further be increased by trying to optimize the epochs, the number of layers or the number of nodes per layer.

Now, let us use the trained model to predict the probability values for the new data set. The below code passes two feature arrays to the trained model and gives out the probability.

1
2
a= np.array([[4.02,70.86,62.05,7.0],[2.99,60.30,57.46,6.06]])
print(model.predict(a))
python

Output

1
2
[[0.8603756 ]
 [0.05907778]]
python

Conclusion

In this example, we developed a working Neural Network for the binary classification problem. The same problem can also be solved using other algorithms such as Logistic Regression, Naive Bayes, K-Nearest Neighbours. The choice of the algorithm to choose needs to be driven by the problem at hand and factors like, how much data size is available, computation power, etc. Deep Networks or Neural Networks are generally recommended if the available data size is large.

Appendix

I have compiled the complete data set which can be found at my GitHub.

30