Binary classification is one of the most common and frequently tackled problems in the machine learning domain. In it's simplest form the user tries to classify an entity into one of the two possible categories. For example, give the attributes of the fruits like weight, color, peel texture, etc. that classify the fruits as either peach or apple. Through the effective use of Neural Networks (Deep Learning Models), binary classification problems can be solved to a fairly high degree.
In this guide, we will see how we are going to classify the molecules as being either active or inactive based on the physical properties like the mass of the molecule, radius of gyration, electro-negativity, etc. The data set has been created just for the sake of this tutorial and is only indicative. To avoid confusion, the properties will be listed just as prop_1, prop_2 instead of mass, the radius of gyration, etc.
The Keras library, that comes along with the Tensorflow library, will be employed to generate the Deep Learning model.
Let us have a look at the sample of the dataset we will be working with
1import pandas as pd
2df = pd.read_csv('molecular_activity.csv')
3print(df.head())
Output
1 prop_1 prop_2 prop_3 prop_4 Activity
20 4.06 71.01 57.20 5.82 1
31 3.63 65.62 52.68 5.44 1
42 3.63 68.90 58.29 6.06 1
53 4.11 75.59 62.81 6.44 1
64 4.00 70.86 58.05 6.06 1
As mentioned before, prop_1, prop_2, prop_3, and prop_4 are the properties associated with the molecules and Activity can be thought of as antibiotic activity or anti-inflammatory activity. If the activity is 1 then the molecule is active or else it is not. Whole data set is provided in the appendix for anyone who wants to replicate the example.
The dataset we imported needs pre-processing before it can be fed into the neural network. The first step will be to split it into independent features and dependent vector. For our molecular activity dataset, prop_1, prop_2, prop_3, and prop_4 are the independent features while Activity is the dependent variable.
1properties = list(df.columns.values)
2properties.remove('Activity')
3print(properties)
4X = df[properties]
5y = df['Activity']
The above code first creates the list using the column names available in the dataset and assigns it to the variable properties. Subsequently, the dependent variable name (Activity) is removed from properties. X Matrix is defined by taking up all the data in the data frame (df) apart from that or Activity. Similarly y vector is created by taking the Activity data from the df.
1print(X.head())
Output
1 prop_1 prop_2 prop_3 prop_4
20 4.06 71.01 57.20 5.82
31 3.63 65.62 52.68 5.44
42 3.63 68.90 58.29 6.06
53 4.11 75.59 62.81 6.44
64 4.00 70.86 58.05 6.06
1print(y.head())
Output
10 1
21 1
32 1
43 1
54 1
The next step will be to divide the data into test and train sets. This is achieved using test_train_split function provided in the model_selection class of sklearn module.
1from sklearn.model_selection import train_test_split
2X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
The above code splits the data set such that seventy percent of the randomly selected data is put into the train set and rest of the thirty percent of data is kept aside as the test set that will be used for the validation purposes.
1model = keras.Sequential([
2 keras.layers.Flatten(input_shape=(4,)),
3 keras.layers.Dense(16, activation=tf.nn.relu),
4 keras.layers.Dense(16, activation=tf.nn.relu),
5 keras.layers.Dense(1, activation=tf.nn.sigmoid),
6])
The above code creates a Neural Network that has three layers. There are two layers of 16 nodes each and one output node. The last node uses the sigmoid activation function that will squeeze all the values between 0 and 1 into the form of a sigmoid curve. The other two layers use ReLU (Rectified Linear Units) as the activation function. ReLU is a half rectified function; that is, for all the inputs less than 0 (e.g. -120,-6.7, -0.0344, 0) the value is 0 while for anything positive (e.g. 10,15, 34) the value is retained. One output unit is used since for each record values in X, a probability will be predicted. If it is high ( >0.9) than the molecule is definitely active. If it is less ( <0.2) then it is definitely not active.
1model.compile(optimizer='adam',
2 loss='binary_crossentropy',
3 metrics=['accuracy'])
4
5model.fit(X_train, y_train, epochs=50, batch_size=1)
6test_loss, test_acc = model.evaluate(X_test, y_test)
The above code compiles the network. It uses Adam, a momentum-based optimizer. The loss function used is binary_crossentropy. For binary classification problems that give output in the form of probability, binary_crossentropy is usually the optimizer of choice. mean_squared_error may also be used instead of binary_crossentropy as well. Metrics used is accuracy. The model is trained for 50 epochs with a batch size of 1. Finally, the trained model was evaluated for the test set to check the accuracy.
1import pandas as pd
2import tensorflow as tf
3from tensorflow import keras
4from sklearn.model_selection import train_test_split
5import numpy as np
6df = pd.read_csv('molecular_activity.csv')
7properties = list(df.columns.values)
8properties.remove('Activity')
9X = df[properties]
10y = df['Activity']
11
12X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
13
14model = keras.Sequential([
15 keras.layers.Flatten(input_shape=(4,)),
16 keras.layers.Dense(16, activation=tf.nn.relu),
17 keras.layers.Dense(16, activation=tf.nn.relu),
18 keras.layers.Dense(1, activation=tf.nn.sigmoid),
19])
20
21model.compile(optimizer='adam',
22 loss='binary_crossentropy',
23 metrics=['accuracy'])
24
25model.fit(X_train, y_train, epochs=50, batch_size=1)
26
27test_loss, test_acc = model.evaluate(X_test, y_test)
28print('Test accuracy:', test_acc)
Output
1Epoch 1/50
2378/378 [==============================] - 1s 2ms/sample - loss: 0.6704 - acc: 0.6958
3Epoch 2/50
4378/378 [==============================] - 0s 1ms/sample - loss: 0.5604 - acc: 0.7672
5Epoch 3/50
6378/378 [==============================] - 0s 1ms/sample - loss: 0.5554 - acc: 0.7725
7Epoch 4/50
8378/378 [==============================] - 0s 1ms/sample - loss: 0.5536 - acc: 0.7751
9Epoch 5/50
10'
11'
12'
13Epoch 44/50
14378/378 [==============================] - 0s 1ms/sample - loss: 0.4138 - acc: 0.8360
15Epoch 45/50
16378/378 [==============================] - 0s 1ms/sample - loss: 0.4214 - acc: 0.8280
17Epoch 46/50
18378/378 [==============================] - 0s 1ms/sample - loss: 0.4268 - acc: 0.8333
19Epoch 47/50
20378/378 [==============================] - 0s 1ms/sample - loss: 0.4130 - acc: 0.8280
21Epoch 48/50
22378/378 [==============================] - 0s 1ms/sample - loss: 0.4146 - acc: 0.8307
23Epoch 49/50
24378/378 [==============================] - 0s 1ms/sample - loss: 0.4161 - acc: 0.8333
25Epoch 50/50
26378/378 [==============================] - 1s 1ms/sample - loss: 0.4111 - acc: 0.8254
27162/162 [==============================] - 0s 421us/sample - loss: 0.3955 - acc: 0.8333
28Test accuracy: 0.8333333
The test accuracy predicted by the model is over 83%. It can further be increased by trying to optimize the epochs, the number of layers or the number of nodes per layer.
Now, let us use the trained model to predict the probability values for the new data set. The below code passes two feature arrays to the trained model and gives out the probability.
1a= np.array([[4.02,70.86,62.05,7.0],[2.99,60.30,57.46,6.06]])
2print(model.predict(a))
Output
1[[0.8603756 ]
2 [0.05907778]]
In this example, we developed a working Neural Network for the binary classification problem. The same problem can also be solved using other algorithms such as Logistic Regression, Naive Bayes, K-Nearest Neighbours. The choice of the algorithm to choose needs to be driven by the problem at hand and factors like, how much data size is available, computation power, etc. Deep Networks or Neural Networks are generally recommended if the available data size is large.
I have compiled the complete data set which can be found at my GitHub.