Natural language processing is a big domain in data science and artificial intelligence. It includes several applications, such as sentiment analysis, machine translation, speech recognition, chatbots creation, market intelligence, and text classification. PyTorch is a popular and powerful deep learning library that has rich capabilities to perform natural language processing tasks. In this guide, you will explore and learn the natural language processing technique of text classification with PyTorch.
To begin, load the required libraries. The first package you’ll import is the torch
library, which is used to define tensors and perform mathematical operations. The second library to import is the torchtext
library, which is the NLP library in PyTorch that contains data processing utilities.
1import torch
2import torchtext
The next step is to load the dataset. The torchtext
library contains the module torchtext.data
, which has several datasets to use to perform natural language processing tasks. In this guide, you will carry out text classification using the inbuilt SogouNews dataset. It’s a supervised learning news dataset which has five labels: 0
for Sports, 1
for Finance, 2
for Entertainment, 3
for Automobile, and 4
for Technology.
The lines of code below load the dataset. Setting NGRAMS = 2
will ensure that text in the dataset will be a list of single words plus bi-grams string.
1from torchtext.datasets import text_classification
2NGRAMS = 2
3import os
4if not os.path.isdir('./.data'):
5 os.mkdir('./.data')
6train_dataset, test_dataset = text_classification.DATASETS['SogouNews'](
7 root='./.data', ngrams=NGRAMS, vocab=None)
8BATCH_SIZE = 16
9device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Output:
1sogou_news_csv.tar.gz: 384MB [00:04, 94.7MB/s]
2450000lines [08:18, 902.49lines/s]
3450000lines [17:33, 427.10lines/s]
460000lines [02:19, 428.77lines/s]
You have loaded the data and the next step is to set up the model architecture.
The model architecture consists of an embedding layer followed by a linear layer. The nn.EmbeddingBag
function computes the mean value of a bag of embeddings without padding since the text lengths are saved in offsets.
The first two lines of code below import the required modules. The next step is to implement the TextSentiment()
model class in the nn.Module
, which is done with the third line of code.
Then, set up the architecture with the __init__()
method. The self
argument under the __init__()
method represents the instance of the object itself. The parameter vocab_size
represents the size of vocabulary with the default being 20,000. The argument embed_dim
represents the dimensions of word embeddings. The num_class
arguments represents the number of classes in the target variable.
You have defined the layers, but you also need to define how they interact with each other. This is done with the def forward()
function. In simple terms, this means you go from input to output in a forward manner.
1import torch.nn as nn
2import torch.nn.functional as F
3class TextSentiment(nn.Module):
4 def __init__(self, vocab_size, embed_dim, num_class):
5 super().__init__()
6 self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
7 self.fc = nn.Linear(embed_dim, num_class)
8 self.init_weights()
9
10 def init_weights(self):
11 initrange = 0.5
12 self.embedding.weight.data.uniform_(-initrange, initrange)
13 self.fc.weight.data.uniform_(-initrange, initrange)
14 self.fc.bias.data.zero_()
15
16 def forward(self, text, offsets):
17 embedded = self.embedding(text, offsets)
18 return self.fc(embedded)
The model architecture is set, and the next step is to define the important arguments discussed above that will be used in model building. This is done with the code below.
1VOCAB_SIZE = len(train_dataset.get_vocab())
2EMBED_DIM = 32
3NUN_CLASS = len(train_dataset.get_labels())
4model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)
The text in the dataset will have different lengths, which makes it necessary to write a function that will generate data batches. This task is performed by the generate_batch()
function in the code below.
1def generate_batch(batch):
2 label = torch.tensor([entry[0] for entry in batch])
3 text = [entry[1] for entry in batch]
4 offsets = [0] + [len(entry) for entry in text]
5 offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
6 text = torch.cat(text)
7 return text, offsets, label
The next step is to define the training function. Use the helper function def train_func(sub_train_)
to do this in the code below.
Import the DataLoader
module from torch.utils.data
utility to make it easy to load data in parallel.
The main arguments are:
batch_size
: The number of samples to load per batch. The default value is 1.
collate_fn
: Merges a list of tensors to form a mini-batch of Tensor(s). The generate_batch
function that was created earlier becomes an input to this argument. 3: shuffle
: Is set to True
to have the data reshuffled at every epoch. The default option is False
.
You'll also create the helper function to evaluate the train and test datasets.
1from torch.utils.data import DataLoader
2def train_func(sub_train_):
3 # Train the model
4 train_loss = 0
5 train_acc = 0
6 data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
7 collate_fn=generate_batch)
8 for i, (text, offsets, cls) in enumerate(data):
9 optimizer.zero_grad()
10 text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
11 output = model(text, offsets)
12 loss = criterion(output, cls)
13 train_loss += loss.item()
14 loss.backward()
15 optimizer.step()
16 train_acc += (output.argmax(1) == cls).sum().item()
17
18 # Adjust the learning rate
19 scheduler.step()
20
21 return train_loss / len(sub_train_), train_acc / len(sub_train_)
22
23def test(data_):
24 loss = 0
25 acc = 0
26 data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
27 for text, offsets, cls in data:
28 text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
29 with torch.no_grad():
30 output = model(text, offsets)
31 loss = criterion(output, cls)
32 loss += loss.item()
33 acc += (output.argmax(1) == cls).sum().item()
34
35 return loss / len(data_), acc / len(data_)
Before building the model, you must split the data, which is done with the random_split
module from the torch.utils.data.dataset
utility function. Use the stochastic gradient descent optimizer to optimize the network, and considering that this is a classification problem, use the cross entropy as loss function. The lr
argument specifies the learning rate of the optimizer function. Finally, the epochs performance and the time taken is printed.
1import time
2from torch.utils.data.dataset import random_split
3N_EPOCHS = 5
4min_valid_loss = float('inf')
5
6criterion = torch.nn.CrossEntropyLoss().to(device)
7optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
8scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
9
10train_len = int(len(train_dataset) * 0.95)
11sub_train_, sub_valid_ = \
12 random_split(train_dataset, [train_len, len(train_dataset) - train_len])
13
14for epoch in range(N_EPOCHS):
15
16 start_time = time.time()
17 train_loss, train_acc = train_func(sub_train_)
18 valid_loss, valid_acc = test(sub_valid_)
19
20 secs = int(time.time() - start_time)
21 mins = secs / 60
22 secs = secs % 60
23
24 print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
25 print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
26 print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
Output:
1Epoch: 1 | time in 18 minutes, 17 seconds
2 Loss: 0.0128(train) | Acc: 93.8%(train)
3 Loss: 0.0000(valid) | Acc: 95.0%(valid)
4Epoch: 2 | time in 17 minutes, 59 seconds
5 Loss: 0.0080(train) | Acc: 96.1%(train)
6 Loss: 0.0000(valid) | Acc: 95.9%(valid)
7Epoch: 3 | time in 18 minutes, 9 seconds
8 Loss: 0.0065(train) | Acc: 96.8%(train)
9 Loss: 0.0000(valid) | Acc: 96.3%(valid)
10Epoch: 4 | time in 18 minutes, 19 seconds
11 Loss: 0.0056(train) | Acc: 97.2%(train)
12 Loss: 0.0000(valid) | Acc: 95.9%(valid)
13Epoch: 5 | time in 18 minutes, 17 seconds
14 Loss: 0.0048(train) | Acc: 97.6%(train)
15 Loss: 0.0000(valid) | Acc: 96.4%(valid)
You can see from the output above that the model is achieving very good accuracy of 97.6% in the fifth epoch. Next, evaluate the model performance on test data.
The lines of code below check and display the result on test data.
1print('Model result on test data...')
2test_loss, test_acc = test(test_dataset)
3print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')
Output:
1Model result on test data...
2 Loss: 0.0000(test) | Acc: 96.2%(test)
The model accuracy on the test data is 96.2%, which is consistent with the model performance on training data. This shows that the model is achieving good results.
In this guide, you learned how to build a text classification model with the high-performing, deep-learning library PyTorch. You learned the architecture and key components of building a text classification algorithm using the torch
and torchtext
packages of PyTorch.
To learn more about data science using Python, please refer to the following guides.