Natural Language Processing with PyTorch

Jun 24, 2020 • 11 Minute Read

Introduction

Natural language processing is a big domain in data science and artificial intelligence. It includes several applications, such as sentiment analysis, machine translation, speech recognition, chatbots creation, market intelligence, and text classification. PyTorch is a popular and powerful deep learning library that has rich capabilities to perform natural language processing tasks. In this guide, you will explore and learn the natural language processing technique of text classification with PyTorch.

Data and Required Libraries

To begin, load the required libraries. The first package you’ll import is the torch library, which is used to define tensors and perform mathematical operations. The second library to import is the torchtext library, which is the NLP library in PyTorch that contains data processing utilities.

      import torch
import torchtext
    

The next step is to load the dataset. The torchtext library contains the module torchtext.data, which has several datasets to use to perform natural language processing tasks. In this guide, you will carry out text classification using the inbuilt SogouNews dataset. It’s a supervised learning news dataset which has five labels: 0 for Sports, 1 for Finance, 2 for Entertainment, 3 for Automobile, and 4 for Technology.

The lines of code below load the dataset. Setting NGRAMS = 2 will ensure that text in the dataset will be a list of single words plus bi-grams string.

      from torchtext.datasets import text_classification
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
    os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['SogouNews'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    

Output:

      sogou_news_csv.tar.gz: 384MB [00:04, 94.7MB/s] 
450000lines [08:18, 902.49lines/s] 
450000lines [17:33, 427.10lines/s] 
60000lines [02:19, 428.77lines/s]
    

You have loaded the data and the next step is to set up the model architecture.

Model Architecture

The model architecture consists of an embedding layer followed by a linear layer. The nn.EmbeddingBag function computes the mean value of a bag of embeddings without padding since the text lengths are saved in offsets.

The first two lines of code below import the required modules. The next step is to implement the TextSentiment() model class in the nn.Module, which is done with the third line of code.

Then, set up the architecture with the __init__() method. The self argument under the __init__() method represents the instance of the object itself. The parameter vocab_size represents the size of vocabulary with the default being 20,000. The argument embed_dim represents the dimensions of word embeddings. The num_class arguments represents the number of classes in the target variable.

You have defined the layers, but you also need to define how they interact with each other. This is done with the def forward() function. In simple terms, this means you go from input to output in a forward manner.

      import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)
    

The model architecture is set, and the next step is to define the important arguments discussed above that will be used in model building. This is done with the code below.

      VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)
    

Batch Generation

The text in the dataset will have different lengths, which makes it necessary to write a function that will generate data batches. This task is performed by the generate_batch() function in the code below.

      def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label
    

Define the Training Function

The next step is to define the training function. Use the helper function def train_func(sub_train_) to do this in the code below.

Import the DataLoader module from torch.utils.data utility to make it easy to load data in parallel. The main arguments are:

batch_size: The number of samples to load per batch. The default value is 1.
collate_fn: Merges a list of tensors to form a mini-batch of Tensor(s). The generate_batch function that was created earlier becomes an input to this argument.

3: shuffle: Is set to True to have the data reshuffled at every epoch. The default option is False.

You'll also create the helper function to evaluate the train and test datasets.

      from torch.utils.data import DataLoader
def train_func(sub_train_):
    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()

    # Adjust the learning rate
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)
    

Model Building

Before building the model, you must split the data, which is done with the random_split module from the torch.utils.data.dataset utility function. Use the stochastic gradient descent optimizer to optimize the network, and considering that this is a classification problem, use the cross entropy as loss function. The lr argument specifies the learning rate of the optimizer function. Finally, the epochs performance and the time taken is printed.

      import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
    

Output:

      Epoch: 1  | time in 18 minutes, 17 seconds
	Loss: 0.0128(train)	|	Acc: 93.8%(train)
	Loss: 0.0000(valid)	|	Acc: 95.0%(valid)
Epoch: 2  | time in 17 minutes, 59 seconds
	Loss: 0.0080(train)	|	Acc: 96.1%(train)
	Loss: 0.0000(valid)	|	Acc: 95.9%(valid)
Epoch: 3  | time in 18 minutes, 9 seconds
	Loss: 0.0065(train)	|	Acc: 96.8%(train)
	Loss: 0.0000(valid)	|	Acc: 96.3%(valid)
Epoch: 4  | time in 18 minutes, 19 seconds
	Loss: 0.0056(train)	|	Acc: 97.2%(train)
	Loss: 0.0000(valid)	|	Acc: 95.9%(valid)
Epoch: 5  | time in 18 minutes, 17 seconds
	Loss: 0.0048(train)	|	Acc: 97.6%(train)
	Loss: 0.0000(valid)	|	Acc: 96.4%(valid)
    

You can see from the output above that the model is achieving very good accuracy of 97.6% in the fifth epoch. Next, evaluate the model performance on test data.

Model Evaluation

The lines of code below check and display the result on test data.

      print('Model result on test data...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')
    

Output:

      Model result on test data...
	Loss: 0.0000(test)	|	Acc: 96.2%(test)
    

The model accuracy on the test data is 96.2%, which is consistent with the model performance on training data. This shows that the model is achieving good results.

Conclusion

In this guide, you learned how to build a text classification model with the high-performing, deep-learning library PyTorch. You learned the architecture and key components of building a text classification algorithm using the torch and torchtext packages of PyTorch.

To learn more about data science using Python, please refer to the following guides.