Machine Translation (mostly) from Scratch using PyTorch#
Peter Nadel (primary author), Kyle Monahan, Joseph Robertson
In this workshop, we’ll build a machine translator using the neural net framework, PyTorch. We will implement the transformer architecture to translate between French and English. We’ll then see an example using another dataset.
Adapted from: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
To make things run a bit faster, go to Runtime > Change runtime Type and select GPU under Hardware Accelerator.
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Data#
For this example, we’ll use the eng-fra.txt from the data.zip file linked from the PyTorch page. This file, and all of those which we will look at in this notebook, will be arranged in the following way:
sentence_i in lang1\tsentence_i in lang2\n.
It should be noted that this is where this file comes from: https://www.manythings.org/anki/. There are several other languages here, all with varying corpus sizes. I assume the PyTorch folks chose French/English translation because of the size of the corpus (~14000 aligned sentences). Later in the notebook, I’ll switch out this large aligned corpus with one of the smaller corpuses to anticipate working with Dakota.
!wget 'https://tufts.box.com/shared/static/v5370zthsaiy5m5xqptv9clsndgyyx1i.zip'
!mv v5370zthsaiy5m5xqptv9clsndgyyx1i.zip data.zip
!unzip data.zip
data_path = "data/eng-fra.txt"
Before we can dig into this file, we need to make a class that will help us keep track of all of the words in our corpus. In particular, we need this case to do two things:
Give each word a unique ID
One-hot encode each word at the index of its ID
This class will also give us the opportunity to encode our start of sentence (SOS) and end of sentence (EOS) tokens, which we’ll place at the beginning and end of each sentence.
Let’s start by tracking how many times each word occurs.
SOS_token = 0
EOS_token = 1
class Lang:
def __init__(self, name):
self.name = name
self.word2index = {}
self.word2count = {}
self.index2word = {0: "SOS", 1: "EOS"}
self.n_words = 2 # Count SOS and EOS
def addSentence(self, sentence):
word_list = sentence.replace("\t", " ").split(" ")
for word in word_list:
self.addWord(word)
def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.n_words
self.word2count[word] = 1
self.index2word[self.n_words] = word
self.n_words += 1 # increments on each new word
else:
self.word2count[word] += 1 # increments on individual word
ex_data = open(data_path).readlines()
ex_data[300]
ex = Lang("ex")
ex.addSentence(ex_data[300])
ex.word2index
ex.word2count
ex.index2word
We have a slight problem here. I'm is not a word. Instead, it’s two words. In fact, we want our model to be able to understand contractions like this, but we’ll first need to strip out all of the punctuation.
Too, because this data is unicode encoded, we’ll need to convert it to ASCII. This step will be especially important when we need to work with languages that do not use the Latin alphabet.
The code below does both.
def unicodeToAscii(s):
return "".join(
c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn"
)
def normalizeString(s):
s = unicodeToAscii(s.lower().strip())
s = re.sub(r"([.!?])", r" \1", s)
s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
return s
# split up I'm and normalized the text
# we'll turn i m -> i am soon
normalizeString(ex_data[300])
Now we can set up a method to read in the whole file, pair up the aligned sentences and read them into our Lang class. We’ll keep it as general as possible so we can swap in another dataset later.
def readLangs(lang1, lang2, reverse=False):
print("Reading lines...")
# Read the file and split into lines
lines = (
open(f"data/{lang1}-{lang2}.txt", encoding="utf-8").read().strip().split("\n")
)
# Split every line on the tab character and normalize
pairs = [[normalizeString(s) for s in l.split("\t")] for l in lines]
# Reverse pairs for when we want to go from lang2 to lang1
if reverse:
pairs = [list(reversed(p)) for p in pairs]
input_lang = Lang(lang2)
output_lang = Lang(lang1)
else:
input_lang = Lang(lang1)
output_lang = Lang(lang2)
return input_lang, output_lang, pairs
We also want a way to control the size of the sentence that we’ll pass into our translator. Right now, we want to train a translator quickly so we’ll set it to be small, but we can increase this for a better translator (that will take longer to train).
MAX_LENGTH = 50
# dealing with most contractions
eng_prefixes = (
"i am ",
"i m ",
"he is",
"he s ",
"she is",
"she s ",
"you are",
"you re ",
"we are",
"we re ",
"they are",
"they re ",
)
def filterPair(p):
return (
len(p[0].split(" ")) < MAX_LENGTH
and len(p[1].split(" ")) < MAX_LENGTH
and p[1].startswith(eng_prefixes)
)
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]
The full data processing pipeline is as follows:
Read text file and split into lines, then split those lines into pairs
Normalize each pair and filter by length
Make word lists from the sentence pairs
We can stack everything together in the function below.
def prepareData(lang1, lang2, reverse=False):
input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
print("Counting words...")
for pair in pairs:
input_lang.addSentence(pair[0])
output_lang.addSentence(pair[1])
print("Counted words:")
print(input_lang.name, input_lang.n_words)
print(output_lang.name, output_lang.n_words)
return input_lang, output_lang, pairs
input_lang, output_lang, pairs = prepareData("eng", "fra", True)
print(random.choice(pairs))
The Seq2Seq Model#
Useful vocabulary for this section
Recurrent Neural Net (RNN) - a network that uses its output sequence as an input for subsequent steps.
Seq2Seq network or Encoder Decoder network - a model consisting of two RNNs: (1) an encoder that reads an input sequence and outputs a vector encoding of the sequence and (2) a decoder that reads the encoded vector and outputs a sequence.
Hidden state - a layer of arbitrary size which comes does not come at the beginning or end of the network.
Gated Recurrent Unit (GRU) - a layer of a neural net which constructs a hidden state at a given time step
tfrom the hidden state of timet-1. It uses atanhactivation and passes all parameters through a sigmoid function.
The Seq2Seq network allows us to input an arbitrarily sized sentence in any language into the encoder and have the vector representation produced by the encoder be decoded into another language by the decoder.
This architecture, though now used in other contexts, is ideal for machine translation. Even though words may come in different orders or are represented by multiple words in a target language, the transformer will be able to render a translation because it is decoding an vector encoded in a multilingual space.
Finally, this architecture is easy(ish) to implement in PyTorch as we can first build the encoder and then the decoder then put them together.
The Encoder#
This piece of our Seq2Seq network will be a RNN that outputs some value for every word from the input sentence. For every input word, it will putput a vector and a hidden state, which will be used for the next input word.
class EncoderRNN(nn.Module):
def __init__(self, input_size, hidden_size):
super(EncoderRNN, self).__init__()
self.hidden_size = hidden_size
self.embedding = nn.Embedding(input_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size)
def forward(self, input, hidden):
embedded = self.embedding(input).view(1, 1, -1)
output = embedded
output, hidden = self.gru(output, hidden)
return output, hidden
def initHidden(self):
return torch.zeros(1, 1, self.hidden_size, device=device)
To break this down a bit:
Input is embedded by
nn.EmbeddingInput embedding is activated with
nn.GRUand previous hidden statenn.GRUoutputs an output embedding and another hidden state
The Decoder#
This piece of out Seq2Seq network will be another RNN that take the encoder output (the embedding held by the variable output above) and will output a sequence of words that will constitute the translation. We’ll touch on two different type of decoders: a simple decoder and an attention decoder.
Simple decoder#
class DecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size):
super(DecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.embedding = nn.Embedding(output_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size)
self.out = nn.Linear(hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
output = self.embedding(input).view(1, 1, -1)
output = F.relu(output)
output, hidden = self.gru(output, hidden)
output = self.softmax(self.out(output[0]))
return output, hidden
def initHidden(self):
return torch.zeros(1, 1, self.hidden_size, device=device)
Let’s break this down again:
Input (output of encoder) is embedded by
nn.EmbeddingThis embedding is activated by
F.ReLuAn output and hidden state are created by passing the activated embedding through
nn.GRUThe hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding
Attention Decoder#
What is attention
As we see above, the only language data being passed from the encoder to the decoder is the single vector output of nn.GRU. Attention allows the decoder to pay attention to different parts of the encoder’s output. First, we’ll calculate attention weights with a nn.Linear layer using the decoder’s input and the hidden state as inputs. These will be multiplied by the encoder output to create a combination (called attn_applied below) that will contain information about a specific part of the input and then help the decoder to choose the correct output words.
Note: There are also other forms of attention, for example ‘local attention’.
See here for a diagram: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
class AttnDecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
super(AttnDecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.output_size = output_size
self.dropout_p = dropout_p
self.max_length = max_length
self.embedding = nn.Embedding(self.output_size, self.hidden_size)
self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
self.dropout = nn.Dropout(self.dropout_p)
self.gru = nn.GRU(self.hidden_size, self.hidden_size)
self.out = nn.Linear(self.hidden_size, self.output_size)
def forward(self, input, hidden, encoder_outputs):
embedded = self.embedding(input).view(1, 1, -1)
embedded = self.dropout(embedded)
attn_weights = F.softmax(
self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1
)
attn_applied = torch.bmm(
attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)
)
output = torch.cat((embedded[0], attn_applied[0]), 1)
output = self.attn_combine(output).unsqueeze(0)
output = F.relu(output)
output, hidden = self.gru(output, hidden)
output = F.log_softmax(self.out(output[0]), dim=1)
return output, hidden, attn_weights
def initHidden(self):
return torch.zeros(1, 1, self.hidden_size, device=device)
We can break this down the same way as with the simple decoder:
First, the input is embedded using
nn.EmbeddingThen, we pass this input into our attention layer using
nn.Linearwith the input as our x and the previous hidden state as the A in the linear transformation: \(y = xA^{T} + b\) (this should look familiar from linear regression)From here, we follow the same steps as in the simple decoder. The attention layer is then combined with the original input embedding and activated by
F.ReLuAs above, an output and hidden state are created by passing the activated embedding through
nn.GRUThe hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding
Training#
Now that we understand the architecture of our network, we can begin training. First, we’ll need to convert the indices of our sentence pairs into tensors which can be input into our encoder.
def indexesFromSentence(lang, sentence):
return [lang.word2index[word] for word in sentence.split(" ")]
def tensorFromSentence(lang, sentence):
indexes = indexesFromSentence(lang, sentence)
indexes.append(EOS_token) # appending the EOS token!
return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)
def tensorsFromPair(pair):
input_tensor = tensorFromSentence(input_lang, pair[0])
target_tensor = tensorFromSentence(output_lang, pair[1])
return (input_tensor, target_tensor)
Training function#
There is one final concept we need to explore before training our Seq2Seq network: Teacher forcing. Teacher forcing is when we use the true target outputs (in this case, the correct English language indices) as the each next input, instead of using the decoder’s guess for that input. This allows the network to converge faster but can be abused if the dataset is not robust enough. We see that teacher-forced networks have much better understanding of grammar rules, but can stray from the correct translation easier. In fact, it has learned to represent the output grammar well, but not how to create the translation.
PyTorch lets us implement teacher forcing with a simple if statement. Too, we can use teacher_forcing_ratio to control how much teacher forcing we want to use.
teacher_forcing_ratio = 0.5
def train(
input_tensor,
target_tensor,
encoder,
decoder,
encoder_optimizer,
decoder_optimizer,
criterion,
max_length=MAX_LENGTH,
):
encoder_hidden = encoder.initHidden() # initialized the encoder
# zeroes encoder/decoder gradients
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()
# input tensors from `tensorsFromPair`
input_length = input_tensor.size(0)
target_length = target_tensor.size(0)
encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
loss = 0
# encoder forward pass
for ei in range(input_length):
encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
encoder_outputs[ei] = encoder_output[0, 0]
# add SOS token to beginning of decoder input
decoder_input = torch.tensor([[SOS_token]], device=device)
decoder_hidden = encoder_hidden
# decoder forward pass
use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
if use_teacher_forcing:
# Teacher forcing: Feed the target as the next input
for di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
loss += criterion(decoder_output, target_tensor[di])
decoder_input = target_tensor[di] # Teacher forcing
else:
# Without teacher forcing: use its own predictions as the next input
for di in range(target_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
topv, topi = decoder_output.topk(1)
decoder_input = topi.squeeze().detach() # detach from history as input
loss += criterion(decoder_output, target_tensor[di])
if decoder_input.item() == EOS_token:
break
# backward pass for whole Seq2Seq network
loss.backward()
encoder_optimizer.step()
decoder_optimizer.step()
return loss.item() / target_length
# utilities for timing
import time
import math
def asMinutes(s):
m = math.floor(s / 60)
s -= m * 60
return "%dm %ds" % (m, s)
def timeSince(since, percent):
now = time.time()
s = now - since
es = s / (percent)
rs = es - s
return "%s (- %s)" % (asMinutes(s), asMinutes(rs))
# utilities for plotting
import matplotlib.pyplot as plt
# plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np
def showPlot(points):
plt.figure()
fig, ax = plt.subplots()
# this locator puts ticks at regular intervals
loc = ticker.MultipleLocator(base=0.2)
ax.yaxis.set_major_locator(loc)
plt.plot(points)
# evaluation will do a forward pass in each RNN and return the words which are most likely
# to be the translation of the input sentence as determined by softmax probabilities
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
with torch.no_grad():
input_tensor = tensorFromSentence(input_lang, sentence)
input_length = input_tensor.size()[0]
encoder_hidden = encoder.initHidden()
encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
for ei in range(input_length):
encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
encoder_outputs[ei] += encoder_output[0, 0]
decoder_input = torch.tensor([[SOS_token]], device=device) # SOS
decoder_hidden = encoder_hidden
decoded_words = []
decoder_attentions = torch.zeros(max_length, max_length)
for di in range(max_length):
decoder_output, decoder_hidden, decoder_attention = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
decoder_attentions[di] = decoder_attention.data
topv, topi = decoder_output.data.topk(1)
if topi.item() == EOS_token:
decoded_words.append("<EOS>")
break
else:
decoded_words.append(output_lang.index2word[topi.item()])
decoder_input = topi.squeeze().detach()
return decoded_words, decoder_attentions[: di + 1]
def evaluateRandomly(encoder, decoder, n=10):
for i in range(n):
pair = random.choice(pairs)
print(">", pair[0])
print("=", pair[1])
output_words, attentions = evaluate(encoder, decoder, pair[0])
output_sentence = " ".join(output_words)
print("<", output_sentence)
print("")
Training will proceed as follows:
Start the timer
Initialize optimizers and loss function
Create training batch
Record loss for plotting
Evaluate our model
def trainIters(
encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01
):
start = time.time()
plot_losses = []
print_loss_total = 0 # Reset every print_every
plot_loss_total = 0 # Reset every plot_every
encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)]
criterion = nn.NLLLoss()
for iter in range(1, n_iters + 1):
training_pair = training_pairs[iter - 1]
input_tensor = training_pair[0]
target_tensor = training_pair[1]
loss = train(
input_tensor,
target_tensor,
encoder,
decoder,
encoder_optimizer,
decoder_optimizer,
criterion,
)
print_loss_total += loss
plot_loss_total += loss
if iter % print_every == 0:
print_loss_avg = print_loss_total / print_every
print_loss_total = 0
print(
"%s (%d %d%%) %.4f"
% (
timeSince(start, iter / n_iters),
iter,
iter / n_iters * 100,
print_loss_avg,
)
)
if iter % plot_every == 0:
plot_loss_avg = plot_loss_total / plot_every
plot_losses.append(plot_loss_avg)
plot_loss_total = 0
showPlot(plot_losses)
Training results#
hidden_size = 1024
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(
device
)
trainIters(encoder1, attn_decoder1, 750000, print_every=50)
evaluateRandomly(encoder1, attn_decoder1)
Trying a new dataset#
As I mentioned above, now that we have implemented the transformer architecture, we can now try on a new dataset, one that is comparable in size to the forthcoming Dakota dataset. In this example, I’ll use Breton, the language spoken by people living in Brittany, currently a part of France but was at one point a sovereign nation with its own language. Breton is one of the few remaining Celtic languages, the speakers of which, similar to the native Americans of this continent thousands of years later, were murdered and forcibly assimilated into Roman society during the conquests of Julius Caesar. Though speakers of Breton are few, they are tenacious.
!unzip bre-eng.zip
# there's some extra cleaning we need to do to make it look like the eng-fre dataset...
!head bre.txt
# making a new file which we can pass into the data prep process
raw = open("bre.txt").readlines()
clean = [re.sub("(?=\tCC).*", "", r) for r in raw]
clean[:10]
with open("eng-bre.txt", "w") as f:
for c in clean:
f.write(c)
!head eng-bre.txt
# just need to move this file into the data directory
!cp eng-bre.txt data
MAX_LENGTH = 50
def prepareData(lang1, lang2, trim=True, reverse=False):
input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
print("Read %s sentence pairs" % len(pairs))
if trim:
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
print("Counting words...")
for pair in pairs:
input_lang.addSentence(pair[0])
output_lang.addSentence(pair[1])
print("Counted words:")
print(input_lang.name, input_lang.n_words)
print(output_lang.name, output_lang.n_words)
return input_lang, output_lang, pairs
input_lang, output_lang, pairs = prepareData("eng", "bre", False, True)
print(random.choice(pairs))
hidden_size = 64
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(
device
)
trainIters(encoder1, attn_decoder1, 5000, print_every=500)
evaluateRandomly(encoder1, attn_decoder1)
# can save our models for later using pickle
import pickle
pickle.dump(encoder1, open("breton_english_encoder.p", "wb"))
pickle.dump(attn_decoder1, open("breton_english_decoder.p", "wb"))
Conclusion#
Transformers take advantage of the sequential nature of the textual data which they encode and decode. That said, transformers are able to encode and decode much more than just aligned sentence pairs. In fact, they can be used with any sequential data, which ends up being most data in general. For instance, a system like ChatGPT learns how to answer questions through the same process. The question is encoded by an encoder and then the answer is decoded from that vector space by a decoder. Stable diffusion and other image generation models take in text and encode them and then decode an image.
Too, encoding embeddings can be swapped out with pre-trained word embeddings from word2vec or GloVe. These encoding layers are amazing resources for research and can offer new insights into large datasets across the social sciences and the humanities.