{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Translation (mostly) from Scratch using `PyTorch`\n",
    "\n",
    "Peter Nadel (primary author), Kyle Monahan, Joseph Robertson\n",
    "\n",
    "In this workshop, we'll build a machine translator using the neural net framework, `PyTorch`. We will implement the transformer architecture to translate between French and English. We'll then see an example using another dataset.\n",
    "\n",
    "Adapted from: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make things run a bit faster, go to Runtime > Change runtime Type and select GPU under Hardware Accelerator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import unicode_literals, print_function, division\n",
    "from io import open\n",
    "import unicodedata\n",
    "import string\n",
    "import re\n",
    "import random\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch import optim\n",
    "import torch.nn.functional as F\n",
    "\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "For this example, we'll use the `eng-fra.txt` from the `data.zip` file linked from the `PyTorch` page. This file, and all of those which we will look at in this notebook, will be arranged in the following way:\n",
    "\n",
    "`sentence_i in lang1\\tsentence_i in lang2\\n`.\n",
    "\n",
    "It should be noted that this is where this file comes from: https://www.manythings.org/anki/. There are several other languages here, all with varying corpus sizes. I assume the `PyTorch` folks chose French/English translation because of the size of the corpus (~14000 aligned sentences). Later in the notebook, I'll switch out this large aligned corpus with one of the smaller corpuses to anticipate working with Dakota."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://tufts.box.com/shared/static/v5370zthsaiy5m5xqptv9clsndgyyx1i.zip'\n",
    "!mv v5370zthsaiy5m5xqptv9clsndgyyx1i.zip data.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!unzip data.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = \"data/eng-fra.txt\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can dig into this file, we need to make a class that will help us keep track of all of the words in our corpus. In particular, we need this case to do two things:\n",
    "\n",
    "*   Give each word a unique ID\n",
    "*   One-hot encode each word at the index of its ID\n",
    "*   This class will also give us the opportunity to encode our start of sentence (SOS) and end of sentence (EOS) tokens, which we'll place at the beginning and end of each sentence.\n",
    "\n",
    "Let's start by tracking how many times each word occurs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SOS_token = 0\n",
    "EOS_token = 1\n",
    "\n",
    "\n",
    "class Lang:\n",
    "    def __init__(self, name):\n",
    "        self.name = name\n",
    "        self.word2index = {}\n",
    "        self.word2count = {}\n",
    "        self.index2word = {0: \"SOS\", 1: \"EOS\"}\n",
    "        self.n_words = 2  # Count SOS and EOS\n",
    "\n",
    "    def addSentence(self, sentence):\n",
    "        word_list = sentence.replace(\"\\t\", \" \").split(\" \")\n",
    "        for word in word_list:\n",
    "            self.addWord(word)\n",
    "\n",
    "    def addWord(self, word):\n",
    "        if word not in self.word2index:\n",
    "            self.word2index[word] = self.n_words\n",
    "            self.word2count[word] = 1\n",
    "            self.index2word[self.n_words] = word\n",
    "            self.n_words += 1  # increments on each new word\n",
    "        else:\n",
    "            self.word2count[word] += 1  # increments on individual word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex_data = open(data_path).readlines()\n",
    "ex_data[300]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex = Lang(\"ex\")\n",
    "ex.addSentence(ex_data[300])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex.word2index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex.word2count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ex.index2word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a slight problem here. `I'm` is not a word. Instead, it's two words. In fact, we want our model to be able to understand contractions like this, but we'll first need to strip out all of the punctuation.\n",
    "\n",
    "Too, because this data is `unicode` encoded, we'll need to convert it to `ASCII`. This step will be especially important when we need to work with languages that do not use the Latin alphabet.\n",
    "\n",
    "The code below does both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def unicodeToAscii(s):\n",
    "    return \"\".join(\n",
    "        c for c in unicodedata.normalize(\"NFD\", s) if unicodedata.category(c) != \"Mn\"\n",
    "    )\n",
    "\n",
    "\n",
    "def normalizeString(s):\n",
    "    s = unicodeToAscii(s.lower().strip())\n",
    "    s = re.sub(r\"([.!?])\", r\" \\1\", s)\n",
    "    s = re.sub(r\"[^a-zA-Z.!?]+\", r\" \", s)\n",
    "    return s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split up I'm and normalized the text\n",
    "# we'll turn i m -> i am soon\n",
    "normalizeString(ex_data[300])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can set up a method to read in the whole file, pair up the aligned sentences and read them into our `Lang` class. We'll keep it as general as possible so we can swap in another dataset later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def readLangs(lang1, lang2, reverse=False):\n",
    "    print(\"Reading lines...\")\n",
    "\n",
    "    # Read the file and split into lines\n",
    "    lines = (\n",
    "        open(f\"data/{lang1}-{lang2}.txt\", encoding=\"utf-8\").read().strip().split(\"\\n\")\n",
    "    )\n",
    "\n",
    "    # Split every line on the tab character and normalize\n",
    "    pairs = [[normalizeString(s) for s in l.split(\"\\t\")] for l in lines]\n",
    "\n",
    "    # Reverse pairs for when we want to go from lang2 to lang1\n",
    "    if reverse:\n",
    "        pairs = [list(reversed(p)) for p in pairs]\n",
    "        input_lang = Lang(lang2)\n",
    "        output_lang = Lang(lang1)\n",
    "    else:\n",
    "        input_lang = Lang(lang1)\n",
    "        output_lang = Lang(lang2)\n",
    "\n",
    "    return input_lang, output_lang, pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want a way to control the size of the sentence that we'll pass into our translator. Right now, we want to train a translator quickly so we'll set it to be small, but we can increase this for a better translator (that will take longer to train)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_LENGTH = 50\n",
    "\n",
    "# dealing with most contractions\n",
    "eng_prefixes = (\n",
    "    \"i am \",\n",
    "    \"i m \",\n",
    "    \"he is\",\n",
    "    \"he s \",\n",
    "    \"she is\",\n",
    "    \"she s \",\n",
    "    \"you are\",\n",
    "    \"you re \",\n",
    "    \"we are\",\n",
    "    \"we re \",\n",
    "    \"they are\",\n",
    "    \"they re \",\n",
    ")\n",
    "\n",
    "\n",
    "def filterPair(p):\n",
    "    return (\n",
    "        len(p[0].split(\" \")) < MAX_LENGTH\n",
    "        and len(p[1].split(\" \")) < MAX_LENGTH\n",
    "        and p[1].startswith(eng_prefixes)\n",
    "    )\n",
    "\n",
    "\n",
    "def filterPairs(pairs):\n",
    "    return [pair for pair in pairs if filterPair(pair)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The full data processing pipeline is as follows:\n",
    "\n",
    "*   Read text file and split into lines, then split those lines into pairs\n",
    "*   Normalize each pair and filter by length\n",
    "*   Make word lists from the sentence pairs\n",
    "\n",
    "We can stack everything together in the function below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepareData(lang1, lang2, reverse=False):\n",
    "    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)\n",
    "    print(\"Read %s sentence pairs\" % len(pairs))\n",
    "    pairs = filterPairs(pairs)\n",
    "    print(\"Trimmed to %s sentence pairs\" % len(pairs))\n",
    "    print(\"Counting words...\")\n",
    "    for pair in pairs:\n",
    "        input_lang.addSentence(pair[0])\n",
    "        output_lang.addSentence(pair[1])\n",
    "    print(\"Counted words:\")\n",
    "    print(input_lang.name, input_lang.n_words)\n",
    "    print(output_lang.name, output_lang.n_words)\n",
    "    return input_lang, output_lang, pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_lang, output_lang, pairs = prepareData(\"eng\", \"fra\", True)\n",
    "print(random.choice(pairs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Seq2Seq Model\n",
    "\n",
    "**Useful vocabulary for this section**\n",
    "\n",
    "*   Recurrent Neural Net (RNN) - a network that uses its output sequence as an input for subsequent steps.\n",
    "*   Seq2Seq network  or Encoder Decoder network - a model consisting of two RNNs: (1) an encoder that reads an input sequence and outputs a vector encoding of the sequence and (2) a decoder that reads the encoded vector and outputs a sequence.\n",
    "*   Hidden state - a layer of arbitrary size which comes does not come at the beginning or end of the network.\n",
    "*   [Gated Recurrent Unit (GRU)](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) - a layer of a neural net which constructs a hidden state at a given time step `t` from the hidden state of time `t-1`. It uses a `tanh` activation and passes all parameters through a sigmoid function.\n",
    "\n",
    "The Seq2Seq network allows us to input an arbitrarily sized sentence in any language into the encoder and have the vector representation produced by the encoder be decoded into another language by the decoder.\n",
    "\n",
    "This architecture, though now used in other contexts, is ideal for machine translation. Even though words may come in different orders or are represented by multiple words in a target language, the transformer will be able to render a translation because it is decoding an vector encoded in a multilingual space.\n",
    "\n",
    "Finally, this architecture is easy(ish) to implement in `PyTorch` as we can first build the encoder and then the decoder then put them together.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Encoder\n",
    "\n",
    "This piece of our Seq2Seq network will be a RNN that outputs some value for every word from the input sentence. For every input word, it will putput a vector and a hidden state, which will be used for the next input word."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class EncoderRNN(nn.Module):\n",
    "    def __init__(self, input_size, hidden_size):\n",
    "        super(EncoderRNN, self).__init__()\n",
    "        self.hidden_size = hidden_size\n",
    "\n",
    "        self.embedding = nn.Embedding(input_size, hidden_size)\n",
    "        self.gru = nn.GRU(hidden_size, hidden_size)\n",
    "\n",
    "    def forward(self, input, hidden):\n",
    "        embedded = self.embedding(input).view(1, 1, -1)\n",
    "        output = embedded\n",
    "        output, hidden = self.gru(output, hidden)\n",
    "        return output, hidden\n",
    "\n",
    "    def initHidden(self):\n",
    "        return torch.zeros(1, 1, self.hidden_size, device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To break this down a bit:\n",
    "\n",
    "*   Input is embedded by [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding)\n",
    "*   Input embedding is activated with [`nn.GRU`](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) and previous hidden state\n",
    "*   `nn.GRU` outputs an output embedding and another hidden state\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Decoder\n",
    "\n",
    "This piece of out Seq2Seq network will be another RNN that take the encoder output (the embedding held by the variable `output` above) and will output a sequence of words that will constitute the translation. We'll touch on two different type of decoders: a simple decoder and an attention decoder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Simple decoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DecoderRNN(nn.Module):\n",
    "    def __init__(self, hidden_size, output_size):\n",
    "        super(DecoderRNN, self).__init__()\n",
    "        self.hidden_size = hidden_size\n",
    "\n",
    "        self.embedding = nn.Embedding(output_size, hidden_size)\n",
    "        self.gru = nn.GRU(hidden_size, hidden_size)\n",
    "        self.out = nn.Linear(hidden_size, output_size)\n",
    "        self.softmax = nn.LogSoftmax(dim=1)\n",
    "\n",
    "    def forward(self, input, hidden):\n",
    "        output = self.embedding(input).view(1, 1, -1)\n",
    "        output = F.relu(output)\n",
    "        output, hidden = self.gru(output, hidden)\n",
    "        output = self.softmax(self.out(output[0]))\n",
    "        return output, hidden\n",
    "\n",
    "    def initHidden(self):\n",
    "        return torch.zeros(1, 1, self.hidden_size, device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's break this down again:\n",
    "\n",
    "*   Input (output of encoder) is embedded by `nn.Embedding`\n",
    "*   This embedding is activated by [`F.ReLu`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU)\n",
    "*   An output and hidden state are created by passing the activated embedding through `nn.GRU`\n",
    "*   The hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Attention Decoder\n",
    "\n",
    "**What is attention**\n",
    "\n",
    "As we see above, the only language data being passed from the encoder to the decoder is the single vector output of `nn.GRU`. *Attention* allows the decoder to pay attention to different parts of the encoder's output. First, we'll calculate *attention weights* with a [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) layer using the decoder's input and the hidden state as inputs. These will be multiplied by the encoder output to create a combination (called `attn_applied` below) that will contain information about a specific part of the input and then help the decoder to choose the correct output words.\n",
    "\n",
    "*Note*: There are also other forms of attention, for example 'local attention'.\n",
    "\n",
    "See here for a diagram: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AttnDecoderRNN(nn.Module):\n",
    "    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):\n",
    "        super(AttnDecoderRNN, self).__init__()\n",
    "        self.hidden_size = hidden_size\n",
    "        self.output_size = output_size\n",
    "        self.dropout_p = dropout_p\n",
    "        self.max_length = max_length\n",
    "\n",
    "        self.embedding = nn.Embedding(self.output_size, self.hidden_size)\n",
    "        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)\n",
    "        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)\n",
    "        self.dropout = nn.Dropout(self.dropout_p)\n",
    "        self.gru = nn.GRU(self.hidden_size, self.hidden_size)\n",
    "        self.out = nn.Linear(self.hidden_size, self.output_size)\n",
    "\n",
    "    def forward(self, input, hidden, encoder_outputs):\n",
    "        embedded = self.embedding(input).view(1, 1, -1)\n",
    "        embedded = self.dropout(embedded)\n",
    "\n",
    "        attn_weights = F.softmax(\n",
    "            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1\n",
    "        )\n",
    "        attn_applied = torch.bmm(\n",
    "            attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0)\n",
    "        )\n",
    "\n",
    "        output = torch.cat((embedded[0], attn_applied[0]), 1)\n",
    "        output = self.attn_combine(output).unsqueeze(0)\n",
    "\n",
    "        output = F.relu(output)\n",
    "        output, hidden = self.gru(output, hidden)\n",
    "\n",
    "        output = F.log_softmax(self.out(output[0]), dim=1)\n",
    "        return output, hidden, attn_weights\n",
    "\n",
    "    def initHidden(self):\n",
    "        return torch.zeros(1, 1, self.hidden_size, device=device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can break this down the same way as with the simple decoder:\n",
    "\n",
    "*   First, the input is embedded using `nn.Embedding`\n",
    "*   Then, we pass this input into our attention layer using `nn.Linear` with the input as our x and the previous hidden state as the A in the linear transformation: $y = xA^{T} + b$ (this should look familiar from linear regression)\n",
    "*   From here, we follow the same steps as in the simple decoder. The attention layer is then combined with the original input embedding and activated by `F.ReLu`\n",
    "*   As above, an output and hidden state are created by passing the activated embedding through `nn.GRU`\n",
    "*   The hidden state is saved and the output is passed through a softmax layer to create probabilities from the embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "Now that we understand the architecture of our network, we can begin training. First, we'll need to convert the indices of our sentence pairs into tensors which can be input into our encoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def indexesFromSentence(lang, sentence):\n",
    "    return [lang.word2index[word] for word in sentence.split(\" \")]\n",
    "\n",
    "\n",
    "def tensorFromSentence(lang, sentence):\n",
    "    indexes = indexesFromSentence(lang, sentence)\n",
    "    indexes.append(EOS_token)  # appending the EOS token!\n",
    "    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)\n",
    "\n",
    "\n",
    "def tensorsFromPair(pair):\n",
    "    input_tensor = tensorFromSentence(input_lang, pair[0])\n",
    "    target_tensor = tensorFromSentence(output_lang, pair[1])\n",
    "    return (input_tensor, target_tensor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training function\n",
    "\n",
    "There is one final concept we need to explore before training our Seq2Seq network: *Teacher forcing*. Teacher forcing is when we use the true target outputs (in this case, the correct English language indices) as the each next input, instead of using the decoder's guess for that input. This allows the network to converge faster but can be abused if the dataset is not robust enough. We see that teacher-forced networks have much better understanding of grammar rules, but can stray from the correct translation easier. In fact, it has learned to represent the output grammar well, but not how to create the translation.\n",
    "\n",
    "`PyTorch` lets us implement teacher forcing with a simple `if` statement. Too, we can use `teacher_forcing_ratio` to control how much teacher forcing we want to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "teacher_forcing_ratio = 0.5\n",
    "\n",
    "\n",
    "def train(\n",
    "    input_tensor,\n",
    "    target_tensor,\n",
    "    encoder,\n",
    "    decoder,\n",
    "    encoder_optimizer,\n",
    "    decoder_optimizer,\n",
    "    criterion,\n",
    "    max_length=MAX_LENGTH,\n",
    "):\n",
    "    encoder_hidden = encoder.initHidden()  # initialized the encoder\n",
    "\n",
    "    # zeroes encoder/decoder gradients\n",
    "    encoder_optimizer.zero_grad()\n",
    "    decoder_optimizer.zero_grad()\n",
    "\n",
    "    # input tensors from `tensorsFromPair`\n",
    "    input_length = input_tensor.size(0)\n",
    "    target_length = target_tensor.size(0)\n",
    "\n",
    "    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)\n",
    "\n",
    "    loss = 0\n",
    "\n",
    "    # encoder forward pass\n",
    "    for ei in range(input_length):\n",
    "        encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)\n",
    "        encoder_outputs[ei] = encoder_output[0, 0]\n",
    "\n",
    "    # add SOS token to beginning of decoder input\n",
    "    decoder_input = torch.tensor([[SOS_token]], device=device)\n",
    "\n",
    "    decoder_hidden = encoder_hidden\n",
    "\n",
    "    # decoder forward pass\n",
    "    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False\n",
    "\n",
    "    if use_teacher_forcing:\n",
    "        # Teacher forcing: Feed the target as the next input\n",
    "        for di in range(target_length):\n",
    "            decoder_output, decoder_hidden, decoder_attention = decoder(\n",
    "                decoder_input, decoder_hidden, encoder_outputs\n",
    "            )\n",
    "            loss += criterion(decoder_output, target_tensor[di])\n",
    "            decoder_input = target_tensor[di]  # Teacher forcing\n",
    "\n",
    "    else:\n",
    "        # Without teacher forcing: use its own predictions as the next input\n",
    "        for di in range(target_length):\n",
    "            decoder_output, decoder_hidden, decoder_attention = decoder(\n",
    "                decoder_input, decoder_hidden, encoder_outputs\n",
    "            )\n",
    "            topv, topi = decoder_output.topk(1)\n",
    "            decoder_input = topi.squeeze().detach()  # detach from history as input\n",
    "\n",
    "            loss += criterion(decoder_output, target_tensor[di])\n",
    "            if decoder_input.item() == EOS_token:\n",
    "                break\n",
    "\n",
    "    # backward pass for whole Seq2Seq network\n",
    "    loss.backward()\n",
    "\n",
    "    encoder_optimizer.step()\n",
    "    decoder_optimizer.step()\n",
    "\n",
    "    return loss.item() / target_length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# utilities for timing\n",
    "\n",
    "import time\n",
    "import math\n",
    "\n",
    "\n",
    "def asMinutes(s):\n",
    "    m = math.floor(s / 60)\n",
    "    s -= m * 60\n",
    "    return \"%dm %ds\" % (m, s)\n",
    "\n",
    "\n",
    "def timeSince(since, percent):\n",
    "    now = time.time()\n",
    "    s = now - since\n",
    "    es = s / (percent)\n",
    "    rs = es - s\n",
    "    return \"%s (- %s)\" % (asMinutes(s), asMinutes(rs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# utilities for plotting\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# plt.switch_backend('agg')\n",
    "import matplotlib.ticker as ticker\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def showPlot(points):\n",
    "    plt.figure()\n",
    "    fig, ax = plt.subplots()\n",
    "    # this locator puts ticks at regular intervals\n",
    "    loc = ticker.MultipleLocator(base=0.2)\n",
    "    ax.yaxis.set_major_locator(loc)\n",
    "    plt.plot(points)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluation will do a forward pass in each RNN and return the words which are most likely\n",
    "# to be the translation of the input sentence as determined by softmax probabilities\n",
    "\n",
    "\n",
    "def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):\n",
    "    with torch.no_grad():\n",
    "        input_tensor = tensorFromSentence(input_lang, sentence)\n",
    "        input_length = input_tensor.size()[0]\n",
    "        encoder_hidden = encoder.initHidden()\n",
    "\n",
    "        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)\n",
    "\n",
    "        for ei in range(input_length):\n",
    "            encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)\n",
    "            encoder_outputs[ei] += encoder_output[0, 0]\n",
    "\n",
    "        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS\n",
    "\n",
    "        decoder_hidden = encoder_hidden\n",
    "\n",
    "        decoded_words = []\n",
    "        decoder_attentions = torch.zeros(max_length, max_length)\n",
    "\n",
    "        for di in range(max_length):\n",
    "            decoder_output, decoder_hidden, decoder_attention = decoder(\n",
    "                decoder_input, decoder_hidden, encoder_outputs\n",
    "            )\n",
    "            decoder_attentions[di] = decoder_attention.data\n",
    "            topv, topi = decoder_output.data.topk(1)\n",
    "            if topi.item() == EOS_token:\n",
    "                decoded_words.append(\"<EOS>\")\n",
    "                break\n",
    "            else:\n",
    "                decoded_words.append(output_lang.index2word[topi.item()])\n",
    "\n",
    "            decoder_input = topi.squeeze().detach()\n",
    "\n",
    "        return decoded_words, decoder_attentions[: di + 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluateRandomly(encoder, decoder, n=10):\n",
    "    for i in range(n):\n",
    "        pair = random.choice(pairs)\n",
    "        print(\">\", pair[0])\n",
    "        print(\"=\", pair[1])\n",
    "        output_words, attentions = evaluate(encoder, decoder, pair[0])\n",
    "        output_sentence = \" \".join(output_words)\n",
    "        print(\"<\", output_sentence)\n",
    "        print(\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training will proceed as follows:\n",
    "\n",
    "*   Start the timer\n",
    "*   Initialize optimizers and loss function\n",
    "*   Create training batch\n",
    "*   Record loss for plotting\n",
    "*   Evaluate our model\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def trainIters(\n",
    "    encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01\n",
    "):\n",
    "    start = time.time()\n",
    "    plot_losses = []\n",
    "    print_loss_total = 0  # Reset every print_every\n",
    "    plot_loss_total = 0  # Reset every plot_every\n",
    "\n",
    "    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)\n",
    "    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)\n",
    "    training_pairs = [tensorsFromPair(random.choice(pairs)) for i in range(n_iters)]\n",
    "    criterion = nn.NLLLoss()\n",
    "\n",
    "    for iter in range(1, n_iters + 1):\n",
    "        training_pair = training_pairs[iter - 1]\n",
    "        input_tensor = training_pair[0]\n",
    "        target_tensor = training_pair[1]\n",
    "\n",
    "        loss = train(\n",
    "            input_tensor,\n",
    "            target_tensor,\n",
    "            encoder,\n",
    "            decoder,\n",
    "            encoder_optimizer,\n",
    "            decoder_optimizer,\n",
    "            criterion,\n",
    "        )\n",
    "        print_loss_total += loss\n",
    "        plot_loss_total += loss\n",
    "\n",
    "        if iter % print_every == 0:\n",
    "            print_loss_avg = print_loss_total / print_every\n",
    "            print_loss_total = 0\n",
    "            print(\n",
    "                \"%s (%d %d%%) %.4f\"\n",
    "                % (\n",
    "                    timeSince(start, iter / n_iters),\n",
    "                    iter,\n",
    "                    iter / n_iters * 100,\n",
    "                    print_loss_avg,\n",
    "                )\n",
    "            )\n",
    "\n",
    "        if iter % plot_every == 0:\n",
    "            plot_loss_avg = plot_loss_total / plot_every\n",
    "            plot_losses.append(plot_loss_avg)\n",
    "            plot_loss_total = 0\n",
    "\n",
    "    showPlot(plot_losses)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hidden_size = 1024\n",
    "encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)\n",
    "attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(\n",
    "    device\n",
    ")\n",
    "\n",
    "trainIters(encoder1, attn_decoder1, 750000, print_every=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluateRandomly(encoder1, attn_decoder1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trying a new dataset\n",
    "\n",
    "As I mentioned above, now that we have implemented the transformer architecture, we can now try on a new dataset, one that is comparable in size to the forthcoming Dakota dataset. In this example, I'll use [Breton](https://en.wikipedia.org/wiki/Breton_language), the language spoken by people living in Brittany, currently a part of France but was at one point a sovereign nation with its own language. Breton is one of the few remaining Celtic languages, the speakers of which, similar to the native Americans of this continent thousands of years later, were murdered and forcibly assimilated into Roman society during the conquests of Julius Caesar. Though speakers of Breton are few, they are tenacious.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!unzip bre-eng.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# there's some extra cleaning we need to do to make it look like the eng-fre dataset...\n",
    "!head bre.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# making a new file which we can pass into the data prep process\n",
    "raw = open(\"bre.txt\").readlines()\n",
    "clean = [re.sub(\"(?=\\tCC).*\", \"\", r) for r in raw]\n",
    "clean[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"eng-bre.txt\", \"w\") as f:\n",
    "    for c in clean:\n",
    "        f.write(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head eng-bre.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# just need to move this file into the data directory\n",
    "!cp eng-bre.txt data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_LENGTH = 50\n",
    "\n",
    "\n",
    "def prepareData(lang1, lang2, trim=True, reverse=False):\n",
    "    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)\n",
    "    print(\"Read %s sentence pairs\" % len(pairs))\n",
    "    if trim:\n",
    "        pairs = filterPairs(pairs)\n",
    "        print(\"Trimmed to %s sentence pairs\" % len(pairs))\n",
    "    print(\"Counting words...\")\n",
    "    for pair in pairs:\n",
    "        input_lang.addSentence(pair[0])\n",
    "        output_lang.addSentence(pair[1])\n",
    "    print(\"Counted words:\")\n",
    "    print(input_lang.name, input_lang.n_words)\n",
    "    print(output_lang.name, output_lang.n_words)\n",
    "    return input_lang, output_lang, pairs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_lang, output_lang, pairs = prepareData(\"eng\", \"bre\", False, True)\n",
    "print(random.choice(pairs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hidden_size = 64\n",
    "encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)\n",
    "attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(\n",
    "    device\n",
    ")\n",
    "\n",
    "trainIters(encoder1, attn_decoder1, 5000, print_every=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluateRandomly(encoder1, attn_decoder1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# can save our models for later using pickle\n",
    "import pickle\n",
    "\n",
    "pickle.dump(encoder1, open(\"breton_english_encoder.p\", \"wb\"))\n",
    "pickle.dump(attn_decoder1, open(\"breton_english_decoder.p\", \"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "Transformers take advantage of the sequential nature of the textual data which they encode and decode. That said, transformers are able to encode and decode much more than just aligned sentence pairs. In fact, they can be used with any sequential data, which ends up being most data in general. For instance, a system like ChatGPT learns how to answer questions through the same process. The question is encoded by an encoder and then the answer is decoded from that vector space by a decoder. Stable diffusion and other image generation models take in text and encode them and then decode an image.\n",
    "\n",
    "Too, encoding embeddings can be swapped out with pre-trained word embeddings from word2vec or GloVe. These encoding layers are amazing resources for research and can offer new insights into large datasets across the social sciences and the humanities."
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [
    "w7XeSWa-_ILZ",
    "tzjYwyy5Ogrn",
    "SafG9zXlePLU"
   ],
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  },
  "mystnb": {
   "execution_mode": "off"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}