Encoders and Decoders for Neural Machine Translation

By Gaurav Singhal

Nov 19, 2020 • 11 Minute Read

Introduction

There are over 7,000 languages in the world. However, only 23 languages in total are most spoken around the globe, including English, Mandarin Chinese, Hindi, and Spanish. As the world is connecting faster, language translation bridges the communication gap.

Image Source

Google Translate can translate not only text but also speech and images in real-time. You can use it on your laptop, mobile, or even smartwatch. This guide will show the technology behind this magic.

Before moving further, I would recommend reading these guides on RNN and LSTM.

To follow along with this guide, download and unzip the spa-eng.zip file here. You will only use the spa.txt file for this process.

Let's get started.

The Power of Sequence2Sequence (seq2seq) Modeling

There are multiple tasks that can be solved by using seq2seq modeling, including text summarization, speech recognition, image and video captioning, and question answering. It can also be used in genomics for DNA sequence modeling. A seq2seq model has two parts: an encoder and a decoder. Both work separately and come together to form a huge neural network model.

This architecture has the ability to handle the input and output sequences of variable length. The below image shows the types of RNN models and their use cases.

Encoder and Decoder

The following sections will cover encoder-decoder in-depth.

Encoder

The encoder is at the feeding end; it understands the sequence and reduces the dimension of the input sequence. The sequence has a fixed size known as the context vector. This context vector acts like input to the decoder, which generates an output sequence when reaching the end token. Hence, you can call these seq2seq models encoder-decoder models.

This architecture can handle input and output sequences of variable length.

Decoder

If you use LSTM for the encoder, use the same for the decoder. But it's slightly more complex than the encoder network. You can say the decoder is in an "aware state." It knows what words you have generated so far and what the previous hidden state was. The first layer of the decoder is initialized by using the context vector 'C' from the encoder network to generate the output. Then a special token is applied at the start to indicate the output generation. It applies a similar token at the end. The first output word is generated by running the stacked LSTM layers. A SoftMax activation function applies to the last layer. Its job is to introduce non-linearity in the network. Now this word is passed through the remaining layers and the generation sequence is repeated.

Multiple factors depend upon improving the accuracy of the encoder-decoder model. The hyper-parameters such as optimizers, cross-entropy loss, learning rate, etc., play an important role in improving the model's performance.

Importing the Libraries and Loading the Dataset.

This example will cover the simple implementation of seq2seq modeling in Keras. I would suggest running the model on GPU. You can take advantage of Google Colab's free GPU feature.

Go to Edit, then Notebook Settings, make changes, and save .

Mount your drive first:

          from google.colab import drive
drive.mount('/content/drive')
    

Copy and paste the authentication code and press enter.

Set up an environment, install the libraries, and define the parameters:

          import tensorflow as tf
from tensorflow import keras
from keras.layers import *
from keras.models import *
from keras.utils import *
from keras.initializers import *
from keras.optimizers import *
    

Define the parameter and set up the path for the spa.txt file you downloaded earlier on your drive. Define batch size, epochs to train for, LSTM latent dimensionality for the encoder, and the number of samples.

          batch_size = 64  
epochs = 100  
latent_dim = 256  
num_samples = 10000  
# set the data_path accordingly
data_path = "/content/drive/My Drive/spa.txt"
    

Change the data_path accordingly.

Pre-processing

You won't be required to conduct in-depth text pre-processing steps. But if you want to know more about noises associated and text pre-processing, kindly refer to this Importance of Text Processing guide. You can use tokenization; its job is to convert the input sentence into a sequence of integers. To achieve this, pass your data by using Keras’s Tokenizer() class.

Next, vectorize the data. It will read each line and append a list to it. The top three lines are below .

          input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
    

This example sets the parameter to 10,000 samples. The first two lines of the code below will put the English text in the input_text and Spanish text in target_text.

          for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")
    ############### A ###############
    target_text = "\t" + target_text + "\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    ############### B ###############
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
print(input_characters)
print(target_characters)
    

The next step is to define the start and the end of sequence character using tab ( \t ) at the start of the character and \n at the end of the character.

Along with the English and Spanish text, you'll also want a list of their unit characters. The corresponding list output is below.

Define the parameters. They are important while building the model and feature engineering.

          input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print("No.of samples:", len(input_texts))
print("No.of unique input tokens:", num_encoder_tokens)
print("No.of unique output tokens:", num_decoder_tokens)
print("Maximum seq length for inputs:", max_encoder_seq_length)
print("Maximum seq length for outputs:", max_decoder_seq_length)
    

Now that you have a list of the characters, perform index mapping to input and target it.

          input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

print(input_token_index)
print(target_token_index)
    

Notice that each character is now associated with an integer value.

Refer the Keras documentation on pre-processing for more detail.

Feature Engineering

To generate feature vectors, on-hot encoding is used. Turn 3D numpy arrays to store one-hot encoding. To generate the feature's variables, encoder_input_data, decoder_input_data, decoder_target_data are used. encoder_input_data and decoder_input_data contain one-hot vectorization of English and Spanish sentences, respectively.

The first dimension, input_texts, states the number of sample texts (10,000 in this case). The second dimension, max_encoder_seq_length (English) and max_decoder_seq_length (Spanish), is the longest encoder/decoder sequence length within the samples. The third dimension, num_encoder_tokens (English) and num_decoder_tokens (Spanish), contains unique characters in input_charaters and output_characters.

The decoder_target_data is like decoder_input_data, the only difference is that the decoder_target_data is offset by one timestamp. The decoder_target_data[:, t, :] is the same as decoder_input_data[:, t + 1, :] .

Now that everything is set, build the model and put the above variables and feature vectors to their proper encoder-decoder model.

          encoder_input_data = np.zeros(
  (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)

decoder_input_data = np.zeros(
  (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

decoder_target_data = np.zeros(
  (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0
    

Conclusion

The fundamental idea of this guide was to give a brief understanding of the seq2seq model, encoder, and decoder. This guide will help you take this to the next level by teaching you how to build a model using LSTM RNN.

You can now choose any language of your choice. Just download the language you want to translate and define a proper path of the data. Before moving further, make sure you understand LSTM well. Feel free to ask at Codealphabet if you have any queries regarding this guide.

Gaurav S.

Guarav is a Data Scientist with a strong background in computer science and mathematics. He has extensive research experience in data structures, statistical data analysis, and mathematical modeling. With a solid background in Web development he works with Python, JAVA, Django, HTML, Struts, Hibernate, Vaadin, Web Scrapping, Angular, and React. His data science skills include Python, Matplotlib, Tensorflows, Pandas, Numpy, Keras, CNN, ANN, NLP, Recommenders, Predictive analysis. He has built systems that have used both basic machine learning algorithms and complex deep neural network. He has worked in many data science projects, some of them are product recommendation, user sentiments, twitter bots, information retrieval, predictive analysis, data mining, image segmentation, SVMs, RandomForest etc.

More about this author