Author avatar

Deepika Singh

Natural Language Processing with PyTorch

Deepika Singh

  • Jun 24, 2020
  • 11 Min read
  • Jun 24, 2020
  • 11 Min read


Natural language processing is a big domain in data science and artificial intelligence. It includes several applications, such as sentiment analysis, machine translation, speech recognition, chatbots creation, market intelligence, and text classification. PyTorch is a popular and powerful deep learning library that has rich capabilities to perform natural language processing tasks. In this guide, you will explore and learn the natural language processing technique of text classification with PyTorch.

Data and Required Libraries

To begin, load the required libraries. The first package you’ll import is the torch library, which is used to define tensors and perform mathematical operations. The second library to import is the torchtext library, which is the NLP library in PyTorch that contains data processing utilities.

1import torch
2import torchtext

The next step is to load the dataset. The torchtext library contains the module, which has several datasets to use to perform natural language processing tasks. In this guide, you will carry out text classification using the inbuilt SogouNews dataset. It’s a supervised learning news dataset which has five labels: 0 for Sports, 1 for Finance, 2 for Entertainment, 3 for Automobile, and 4 for Technology.

The lines of code below load the dataset. Setting NGRAMS = 2 will ensure that text in the dataset will be a list of single words plus bi-grams string.

1from torchtext.datasets import text_classification
3import os
4if not os.path.isdir('./.data'):
5    os.mkdir('./.data')
6train_dataset, test_dataset = text_classification.DATASETS['SogouNews'](
7    root='./.data', ngrams=NGRAMS, vocab=None)
9device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


1sogou_news_csv.tar.gz: 384MB [00:04, 94.7MB/s] 
2450000lines [08:18, 902.49lines/s] 
3450000lines [17:33, 427.10lines/s] 
460000lines [02:19, 428.77lines/s] 

You have loaded the data and the next step is to set up the model architecture.

Model Architecture

The model architecture consists of an embedding layer followed by a linear layer. The nn.EmbeddingBag function computes the mean value of a bag of embeddings without padding since the text lengths are saved in offsets.

The first two lines of code below import the required modules. The next step is to implement the TextSentiment() model class in the nn.Module, which is done with the third line of code.

Then, set up the architecture with the __init__() method. The self argument under the __init__() method represents the instance of the object itself. The parameter vocab_size represents the size of vocabulary with the default being 20,000. The argument embed_dim represents the dimensions of word embeddings. The num_class arguments represents the number of classes in the target variable.

You have defined the layers, but you also need to define how they interact with each other. This is done with the def forward() function. In simple terms, this means you go from input to output in a forward manner.

1import torch.nn as nn
2import torch.nn.functional as F
3class TextSentiment(nn.Module):
4    def __init__(self, vocab_size, embed_dim, num_class):
5        super().__init__()
6        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
7        self.fc = nn.Linear(embed_dim, num_class)
8        self.init_weights()
10    def init_weights(self):
11        initrange = 0.5
12, initrange)
13, initrange)
16    def forward(self, text, offsets):
17        embedded = self.embedding(text, offsets)
18        return self.fc(embedded)

The model architecture is set, and the next step is to define the important arguments discussed above that will be used in model building. This is done with the code below.

1VOCAB_SIZE = len(train_dataset.get_vocab())
3NUN_CLASS = len(train_dataset.get_labels())
4model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

Batch Generation

The text in the dataset will have different lengths, which makes it necessary to write a function that will generate data batches. This task is performed by the generate_batch() function in the code below.

1def generate_batch(batch):
2    label = torch.tensor([entry[0] for entry in batch])
3    text = [entry[1] for entry in batch]
4    offsets = [0] + [len(entry) for entry in text]
5    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
6    text =
7    return text, offsets, label

Define the Training Function

The next step is to define the training function. Use the helper function def train_func(sub_train_) to do this in the code below.

Import the DataLoader module from utility to make it easy to load data in parallel. The main arguments are:

  1. batch_size: The number of samples to load per batch. The default value is 1.

  2. collate_fn: Merges a list of tensors to form a mini-batch of Tensor(s). The generate_batch function that was created earlier becomes an input to this argument.

3: shuffle: Is set to True to have the data reshuffled at every epoch. The default option is False.

You'll also create the helper function to evaluate the train and test datasets.

1from import DataLoader
2def train_func(sub_train_):
3    # Train the model
4    train_loss = 0
5    train_acc = 0
6    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
7                      collate_fn=generate_batch)
8    for i, (text, offsets, cls) in enumerate(data):
9        optimizer.zero_grad()
10        text, offsets, cls =,,
11        output = model(text, offsets)
12        loss = criterion(output, cls)
13        train_loss += loss.item()
14        loss.backward()
15        optimizer.step()
16        train_acc += (output.argmax(1) == cls).sum().item()
18    # Adjust the learning rate
19    scheduler.step()
21    return train_loss / len(sub_train_), train_acc / len(sub_train_)
23def test(data_):
24    loss = 0
25    acc = 0
26    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
27    for text, offsets, cls in data:
28        text, offsets, cls =,,
29        with torch.no_grad():
30            output = model(text, offsets)
31            loss = criterion(output, cls)
32            loss += loss.item()
33            acc += (output.argmax(1) == cls).sum().item()
35    return loss / len(data_), acc / len(data_)

Model Building

Before building the model, you must split the data, which is done with the random_split module from the utility function. Use the stochastic gradient descent optimizer to optimize the network, and considering that this is a classification problem, use the cross entropy as loss function. The lr argument specifies the learning rate of the optimizer function. Finally, the epochs performance and the time taken is printed.

1import time
2from import random_split
4min_valid_loss = float('inf')
6criterion = torch.nn.CrossEntropyLoss().to(device)
7optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
8scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
10train_len = int(len(train_dataset) * 0.95)
11sub_train_, sub_valid_ = \
12    random_split(train_dataset, [train_len, len(train_dataset) - train_len])
14for epoch in range(N_EPOCHS):
16    start_time = time.time()
17    train_loss, train_acc = train_func(sub_train_)
18    valid_loss, valid_acc = test(sub_valid_)
20    secs = int(time.time() - start_time)
21    mins = secs / 60
22    secs = secs % 60
24    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
25    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
26    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')


1Epoch: 1  | time in 18 minutes, 17 seconds
2	Loss: 0.0128(train)	|	Acc: 93.8%(train)
3	Loss: 0.0000(valid)	|	Acc: 95.0%(valid)
4Epoch: 2  | time in 17 minutes, 59 seconds
5	Loss: 0.0080(train)	|	Acc: 96.1%(train)
6	Loss: 0.0000(valid)	|	Acc: 95.9%(valid)
7Epoch: 3  | time in 18 minutes, 9 seconds
8	Loss: 0.0065(train)	|	Acc: 96.8%(train)
9	Loss: 0.0000(valid)	|	Acc: 96.3%(valid)
10Epoch: 4  | time in 18 minutes, 19 seconds
11	Loss: 0.0056(train)	|	Acc: 97.2%(train)
12	Loss: 0.0000(valid)	|	Acc: 95.9%(valid)
13Epoch: 5  | time in 18 minutes, 17 seconds
14	Loss: 0.0048(train)	|	Acc: 97.6%(train)
15	Loss: 0.0000(valid)	|	Acc: 96.4%(valid)

You can see from the output above that the model is achieving very good accuracy of 97.6% in the fifth epoch. Next, evaluate the model performance on test data.

Model Evaluation

The lines of code below check and display the result on test data.

1print('Model result on test data...')
2test_loss, test_acc = test(test_dataset)
3print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')


1Model result on test data...
2	Loss: 0.0000(test)	|	Acc: 96.2%(test)

The model accuracy on the test data is 96.2%, which is consistent with the model performance on training data. This shows that the model is achieving good results.


In this guide, you learned how to build a text classification model with the high-performing, deep-learning library PyTorch. You learned the architecture and key components of building a text classification algorithm using the torch and torchtext packages of PyTorch.

To learn more about data science using Python, please refer to the following guides.