Machine Translation (mostly) from Scratch using `PyTorch`#

Peter Nadel (primary author), Kyle Monahan, Joseph Robertson

In this workshop, we’ll build a machine translator using the neural net framework, PyTorch. We will implement the transformer architecture to translate between French and English. We’ll then see an example using another dataset.

Adapted from: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

To make things run a bit faster, go to Runtime > Change runtime Type and select GPU under Hardware Accelerator.

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Data#

For this example, we’ll use the eng-fra.txt from the data.zip file linked from the PyTorch page. This file, and all of those which we will look at in this notebook, will be arranged in the following way:

sentence_i in lang1\tsentence_i in lang2\n.

It should be noted that this is where this file comes from: https://www.manythings.org/anki/. There are several other languages here, all with varying corpus sizes. I assume the PyTorch folks chose French/English translation because of the size of the corpus (~14000 aligned sentences). Later in the notebook, I’ll switch out this large aligned corpus with one of the smaller corpuses to anticipate working with Dakota.

!wget 'https://tufts.box.com/shared/static/v5370zthsaiy5m5xqptv9clsndgyyx1i.zip'
!mv v5370zthsaiy5m5xqptv9clsndgyyx1i.zip data.zip

!unzip data.zip

data_path = "data/eng-fra.txt"

Before we can dig into this file, we need to make a class that will help us keep track of all of the words in our corpus. In particular, we need this case to do two things:

Give each word a unique ID
One-hot encode each word at the index of its ID
This class will also give us the opportunity to encode our start of sentence (SOS) and end of sentence (EOS) tokens, which we’ll place at the beginning and end of each sentence.

Let’s start by tracking how many times each word occurs.

SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        word_list = sentence.replace("\t", " ").split(" ")
        for word in word_list:
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1  # increments on each new word
        else:
            self.word2count[word] += 1  # increments on individual word

ex_data = open(data_path).readlines()
ex_data[300]

ex = Lang("ex")
ex.addSentence(ex_data[300])

ex.word2index

ex.word2count

ex.index2word

We have a slight problem here. I'm is not a word. Instead, it’s two words. In fact, we want our model to be able to understand contractions like this, but we’ll first need to strip out all of the punctuation.

Too, because this data is unicode encoded, we’ll need to convert it to ASCII. This step will be especially important when we need to work with languages that do not use the Latin alphabet.

The code below does both.

def unicodeToAscii(s):
    return "".join(
        c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn"
    )


def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

# split up I'm and normalized the text
# we'll turn i m -> i am soon
normalizeString(ex_data[300])

Now we can set up a method to read in the whole file, pair up the aligned sentences and read them into our Lang class. We’ll keep it as general as possible so we can swap in another dataset later.

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = (
        open(f"data/{lang1}-{lang2}.txt", encoding="utf-8").read().strip().split("\n")
    )

    # Split every line on the tab character and normalize
    pairs = [[normalizeString(s) for s in l.split("\t")] for l in lines]

    # Reverse pairs for when we want to go from lang2 to lang1
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

We also want a way to control the size of the sentence that we’ll pass into our translator. Right now, we want to train a translator quickly so we’ll set it to be small, but we can increase this for a better translator (that will take longer to train).

MAX_LENGTH = 50

# dealing with most contractions
eng_prefixes = (
    "i am ",
    "i m ",
    "he is",
    "he s ",
    "she is",
    "she s ",
    "you are",
    "you re ",
    "we are",
    "we re ",
    "they are",
    "they re ",
)


def filterPair(p):
    return (
        len(p[0].split(" ")) < MAX_LENGTH
        and len(p[1].split(" ")) < MAX_LENGTH
        and p[1].startswith(eng_prefixes)
    )


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The full data processing pipeline is as follows:

Read text file and split into lines, then split those lines into pairs
Normalize each pair and filter by length
Make word lists from the sentence pairs

We can stack everything together in the function below.

def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData("eng", "fra", True)
print(random.choice(pairs))

The Seq2Seq Model#

Useful vocabulary for this section

Recurrent Neural Net (RNN) - a network that uses its output sequence as an input for subsequent steps.
Seq2Seq network or Encoder Decoder network - a model consisting of two RNNs: (1) an encoder that reads an input sequence and outputs a vector encoding of the sequence and (2) a decoder that reads the encoded vector and outputs a sequence.
Hidden state - a layer of arbitrary size which comes does not come at the beginning or end of the network.
Gated Recurrent Unit (GRU) - a layer of a neural net which constructs a hidden state at a given time step t from the hidden state of time t-1. It uses a tanh activation and passes all parameters through a sigmoid function.

The Seq2Seq network allows us to input an arbitrarily sized sentence in any language into the encoder and have the vector representation produced by the encoder be decoded into another language by the decoder.

This architecture, though now used in other contexts, is ideal for machine translation. Even though words may come in different orders or are represented by multiple words in a target language, the transformer will be able to render a translation because it is decoding an vector encoded in a multilingual space.

Finally, this architecture is easy(ish) to implement in PyTorch as we can first build the encoder and then the decoder then put them together.

The Encoder#

This piece of our Seq2Seq network will be a RNN that outputs some value for every word from the input sentence. For every input word, it will putput a vector and a hidden state, which will be used for the next input word.

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

To break this down a bit:

Input is embedded by nn.Embedding
Input embedding is activated with nn.GRU and previous hidden state
nn.GRU outputs an output embedding and another hidden state

The Decoder#

This piece of out Seq2Seq network will be another RNN that take the encoder output (the embedding held by the variable output above) and will output a sequence of words that will constitute the translation. We’ll touch on two different type of decoders: a simple decoder and an attention decoder.

Simple decoder#

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Let’s break this down again:

Input (output of encoder) is embedded by nn.Embedding
This embedding is activated by F.ReLu
An output and hidden state are created by passing the activated embedding through nn.GRU
The hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding

Attention Decoder#

What is attention

As we see above, the only language data being passed from the encoder to the decoder is the single vector output of nn.GRU. Attention allows the decoder to pay attention to different parts of the encoder’s output. First, we’ll calculate attention weights with a nn.Linear layer using the decoder’s input and the hidden state as inputs. These will be multiplied by the encoder output to create a combination (called attn_applied below) that will contain information about a specific part of the input and then help the decoder to choose the correct output words.

Note: There are also other forms of attention, for example ‘local attention’.

See here for a diagram: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1
        )
        attn_applied = torch.bmm(
            attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)
        )

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

We can break this down the same way as with the simple decoder:

First, the input is embedded using nn.Embedding
Then, we pass this input into our attention layer using nn.Linear with the input as our x and the previous hidden state as the A in the linear transformation: \(y = xA^{T} + b\) (this should look familiar from linear regression)
From here, we follow the same steps as in the simple decoder. The attention layer is then combined with the original input embedding and activated by F.ReLu
As above, an output and hidden state are created by passing the activated embedding through nn.GRU
The hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding

Training#

Now that we understand the architecture of our network, we can begin training. First, we’ll need to convert the indices of our sentence pairs into tensors which can be input into our encoder.

def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(" ")]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)  # appending the EOS token!
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

Training function#

There is one final concept we need to explore before training our Seq2Seq network: Teacher forcing. Teacher forcing is when we use the true target outputs (in this case, the correct English language indices) as the each next input, instead of using the decoder’s guess for that input. This allows the network to converge faster but can be abused if the dataset is not robust enough. We see that teacher-forced networks have much better understanding of grammar rules, but can stray from the correct translation easier. In fact, it has learned to represent the output grammar well, but not how to create the translation.

PyTorch lets us implement teacher forcing with a simple if statement. Too, we can use teacher_forcing_ratio to control how much teacher forcing we want to use.

teacher_forcing_ratio = 0.5


def train(
    input_tensor,
    target_tensor,
    encoder,
    decoder,
    encoder_optimizer,
    decoder_optimizer,
    criterion,
    max_length=MAX_LENGTH,
):
    encoder_hidden = encoder.initHidden()  # initialized the encoder

    # zeroes encoder/decoder gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # input tensors from `tensorsFromPair`
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    # encoder forward pass
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    # add SOS token to beginning of decoder input
    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    # decoder forward pass
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    # backward pass for whole Seq2Seq network
    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

# utilities for timing

import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return "%dm %ds" % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return "%s (- %s)" % (asMinutes(s), asMinutes(rs))

# utilities for plotting

import matplotlib.pyplot as plt

# plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

# evaluation will do a forward pass in each RNN and return the words which are most likely
# to be the translation of the input sentence as determined by softmax probabilities


def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append("<EOS>")
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[: di + 1]

def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print(">", pair[0])
        print("=", pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = " ".join(output_words)
        print("<", output_sentence)
        print("")

Training will proceed as follows:

Start the timer
Initialize optimizers and loss function
Create training batch
Record loss for plotting
Evaluate our model

def trainIters(
    encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01
):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(
            input_tensor,
            target_tensor,
            encoder,
            decoder,
            encoder_optimizer,
            decoder_optimizer,
            criterion,
        )
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print(
                "%s (%d %d%%) %.4f"
                % (
                    timeSince(start, iter / n_iters),
                    iter,
                    iter / n_iters * 100,
                    print_loss_avg,
                )
            )

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

Training results#

hidden_size = 1024
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(
    device
)

trainIters(encoder1, attn_decoder1, 750000, print_every=50)

evaluateRandomly(encoder1, attn_decoder1)

Trying a new dataset#

As I mentioned above, now that we have implemented the transformer architecture, we can now try on a new dataset, one that is comparable in size to the forthcoming Dakota dataset. In this example, I’ll use Breton, the language spoken by people living in Brittany, currently a part of France but was at one point a sovereign nation with its own language. Breton is one of the few remaining Celtic languages, the speakers of which, similar to the native Americans of this continent thousands of years later, were murdered and forcibly assimilated into Roman society during the conquests of Julius Caesar. Though speakers of Breton are few, they are tenacious.

!unzip bre-eng.zip

# there's some extra cleaning we need to do to make it look like the eng-fre dataset...
!head bre.txt

# making a new file which we can pass into the data prep process
raw = open("bre.txt").readlines()
clean = [re.sub("(?=\tCC).*", "", r) for r in raw]
clean[:10]

with open("eng-bre.txt", "w") as f:
    for c in clean:
        f.write(c)

!head eng-bre.txt

# just need to move this file into the data directory
!cp eng-bre.txt data

MAX_LENGTH = 50


def prepareData(lang1, lang2, trim=True, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    if trim:
        pairs = filterPairs(pairs)
        print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData("eng", "bre", False, True)
print(random.choice(pairs))

hidden_size = 64
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(
    device
)

trainIters(encoder1, attn_decoder1, 5000, print_every=500)

evaluateRandomly(encoder1, attn_decoder1)

# can save our models for later using pickle
import pickle

pickle.dump(encoder1, open("breton_english_encoder.p", "wb"))
pickle.dump(attn_decoder1, open("breton_english_decoder.p", "wb"))

Conclusion#

Transformers take advantage of the sequential nature of the textual data which they encode and decode. That said, transformers are able to encode and decode much more than just aligned sentence pairs. In fact, they can be used with any sequential data, which ends up being most data in general. For instance, a system like ChatGPT learns how to answer questions through the same process. The question is encoded by an encoder and then the answer is decoded from that vector space by a decoder. Stable diffusion and other image generation models take in text and encode them and then decode an image.

Too, encoding embeddings can be swapped out with pre-trained word embeddings from word2vec or GloVe. These encoding layers are amazing resources for research and can offer new insights into large datasets across the social sciences and the humanities.

Machine Translation (mostly) from Scratch using PyTorch#