!wget https://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.02.0008 -O atticus.xml
Encoder-only Transformers: Bidirectional Encoder Representations from Transformers (BERT)#
As we saw last week with the decoder-only architecture, attention transformers are very good at learning text features and predicting the next token from a sequence of tokens. Today, we will explore the other half of the transformer: the encoder and encoder-only architectures. In this lesson, we will implement a specific encoder-only transformer called BERT, from the title of the paper that introduced it: Bidirectional Encoder Representations from Transformers. BERT was the cutting edge of NLP for many years before being unseated by decoder-only transformers, but BERT is still used for many different applications. As with word2vec, BERT gives us embeddings for individual words, feature extraction, allowing us to build further models for tasks like NER and token classification, as we did in week 10.
Parts of the Encoder-only transformer#
The encoder-only transformer is made up of several parts (see schematic below):
Embeddings: Just like word2vec, the RNN and the decoder-only architecture, the encoder-only architecture takes advantage of an embedding layer. As in the decoder-only transformer, there are two different types of embeddings: token embeddings and positional encodings.
Positional Encodings: These are added to the input embeddings to give the model information about the position of each token in the sequence. Like the token embeddings, this is just an embedding layer that learns what areas of the
block_sizeare more important based on the tokens.Masked Multi-Head Attention: Unlike what we saw with the decoder-only model, we train encoder-only models by masking a certain percentage of tokens per each sequence and having the model guess which tokens we masked. Attention will work the exactly same way however!
Feed forward: This layer allows the model to process the information from the attention layer through non-linear transformations, increasing the model’s capacity to learn complex patterns
Last linear layer: This last linear layer allows the model to make its predictions for the next token in the sequence.
Softmax: As we have seen since word2vec, this function transforms the logits of a linear layer into a probability distribution from which we can sample from and get the index of the predicted next token.
It is worth noting that a “Block” is made up of the masked mulit-head attention, the normalization layers and the feed forward layer. This Block can be repeated many times before a prediction is actually made. In fact, the only difference between smaller and larger models often comes down to how many repetitions of these blocks there are.

Data#
We are going to follow the decoder-only notebook as closely as possible to show you how similar these two architectures really are. (For truly, they are just two sidesof the same coin.) So, just like in that notebook, I will be using Cicer’s Letters to Atticus from Perseus.
Unlike the decoder-only notebook, I will be using a more standard tokenization scheme. Rather than each of our tokens being characters, we will use nltk’s word_tokenize function to tokenize our sentences. In a later lesson, we will see how to create our own very robust tokenizer, but for now, this will do.
# extracting text from XML
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(open("atticus.xml", "r").read(), features="xml")
letters = []
for d in soup.find_all("div2"):
dateline = d.dateline.extract().get_text().strip()
salute = d.salute.extract().get_text().strip()
text = re.sub(r"\s+", " ", d.get_text().strip().replace("\n", ""))
letters.append(dateline + "\n" + salute + "\n" + text)
text = "\n\n".join(letters)
print(len(text))
print(text[:1000])
import nltk
nltk.download("punkt_tab")
tokenized_text = nltk.word_tokenize(text) # tokenizing the text
print(len(tokenized_text))
print(tokenized_text[:100])
As mentioned above, BERT is a Masked Language Model (MLM) meaning that we mask a certain percentage of tokens and ask the model to fill in the gaps. To that end, we need to add a MASK token which will stand in for the masked tokens.
Also unlike the decoder-only model, we will need a PAD token, so that all of our sequences are the same length. In the decoder, we relied on next token prediction to create batches of training data. In this model, we can rely on that, so some sequences will be shorter than other, specifically if a sequence is shorter than block_size. In these cases, we can use a this PAD token.
tokens = list(set([w.lower() for w in tokenized_text])) + [
"MASK",
"PAD",
] # added tokens
print(len(tokens))
print(tokens[:100])
# same as what we saw with the decoder
stoi = {ch: i for i, ch in enumerate(tokens)}
itos = {i: ch for i, ch in enumerate(tokens)}
encode = lambda s: [
stoi[c.lower()] if c != "MASK" else stoi["MASK"] for c in nltk.word_tokenize(s)
]
decode = lambda l: " ".join([itos[i] for i in l])
print(encode("salve mundus"))
print(decode(encode("salve mundus")))
import torch
data = torch.tensor(encode(text), dtype=torch.long) # tokenizing our data
print(data.shape, data.dtype)
print(data[:1000])
# as before, reservering 10% for validation
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
Now they our data is tokenized we can work on developing a single method get_batch which will collate the data tensor above into multiple training examples.
In the last notebook, this was somewhat straightforward as we knew we were trying to predict the next token based on a given sequence of tokens. Recall, though, we want today’s language model to predict randomly masked tokens. This task will train the token embeddings to match the semantic relationships between words.
Here is how we’ll set up our training examples:
Select a random sequence of training data (just the token numbers)
From this sequence, select a subset of tokens as “masked” tokens, that are covered up and unknown to the model
Return the newly masked sequence (x), the target sequence (y) and token mask itself, along with any other data structures we need.
To this end, our new get_batch method will need to:
Select a random sequence of training data (this is the same code as we saw in the last lesson).
We will also randomly cut out and ‘pad’ certain tokens to give the model a different context lengths.
We will then create an ‘attention mask’, which starts off as just 1s but all of the padded tokens will be set to 0s. This attention mask is the encoder-only equivalent of the
trilmask in the decoder-only model. The encoder will learn what the correct values will be through the forward and backward passes, and these 1s will become weights that the model is applying different tokens in the sequence.We can then randomly mask some of the tokens, as mentioned above, keeping track of the masked tokens in a specific data structure.
# necessary hyperparameters
batch_size = 4
block_size = 8
vocab_size = len(tokens)
ix = torch.randint(
len(train_data) - block_size, (batch_size,)
) # random sequence of data
print(ix)
x = torch.stack([data[i : i + block_size] for i in ix]) # will be masked
y = x.clone() # will become targets
pad_token_id = stoi["PAD"]
pad_token_id
mask_token_id = stoi["MASK"]
mask_token_id
# 50/50 chance that the sequence will be cut off and padded with the pad token
# helps the model learn to embed words from a variety of sequence length
import random
for i in range(batch_size):
if random.random() < 0.5:
pad_length = random.randint(1, block_size // 2) # random amount to pad
x[i, -pad_length:] = pad_token_id
y[i, -pad_length:] = pad_token_id
# learnable attention mask set to 1s and 0s
attention_mask = (x != pad_token_id).float()
attention_mask
# masking 15% of the tokens in the sequence
mask = torch.rand(x.shape) < 0.15
mask = mask & (x != pad_token_id)
mask
The original BERT paper did the following of all of the masked tokens (not all of the tokens):
80% are replaced with MASK token
10% are replaced with a random token
10% are left unchanged
# 80% are replaced with the MASK token
mask_replace = mask & (torch.rand(x.shape) < 0.8)
mask_replace
# 10% (50% of left over mask tokens) are replaced with a random token
# 10% (other 50% of left over mask tokens) are left unchanged
mask_random = mask & (torch.rand(x.shape) < 0.5) & ~mask_replace
mask_random
# applying the mask token to selected ids
x[mask_replace] = mask_token_id
x
# applying the random token to the selected ids
random_tokens = torch.randint(vocab_size - 1, x[mask_random].shape)
random_tokens = torch.where(random_tokens == pad_token_id, mask_token_id, random_tokens)
random_tokens
# pulling it all together into a single tensor
x[mask_random] = random_tokens
x
# return the:
## training example (x), masked tensor
## targets for this example (y)
## attention mask - will change depending on pads and masks
## mask - "answer key" for the targets
x, y, attention_mask, mask
# single function that does all of this
def get_batch(split, mask_ratio=0.15):
data = train_data if split == "train" else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i : i + block_size] for i in ix])
y = x.clone()
for i in range(batch_size):
if random.random() < 0.5:
pad_length = random.randint(1, block_size // 2)
x[i, -pad_length:] = pad_token_id
y[i, -pad_length:] = pad_token_id
attention_mask = (x != pad_token_id).float()
mask = torch.rand(x.shape) < mask_ratio
mask = mask & (x != pad_token_id)
mask_replace = mask & (torch.rand(x.shape) < 0.8)
mask_random = mask & (torch.rand(x.shape) < 0.5) & ~mask_replace
x[mask_replace] = mask_token_id
random_tokens = torch.randint(vocab_size - 1, x[mask_random].shape)
random_tokens = torch.where(
random_tokens == pad_token_id, mask_token_id, random_tokens
)
x[mask_random] = random_tokens
return x, y, attention_mask, mask
# what the tokens look like
xb, yb, attention_mask, pred_mask = get_batch("train")
xb, yb, attention_mask, pred_mask
# what the actual words look like
for b in range(batch_size):
print(decode(xb[b].tolist()))
print(decode(yb[b].tolist()))
print()
Attention#
Our data, though a different configuration than in the last notebook, is ready for a single forward pass through an attention head. As we saw in the last notebook, a single head of attention is made up of:
Attention mask: in this example this came from our
get_batchmethod. In the last notebook, we usedtrilto create this.Three linear projection layers:
Key
Query
Value
A projection layer that projects our weights from
head_sizeton_embd
In addition to this, to complete a full forward pass we’ll also need:
A token embedding table: these are learnable parameters that will become the word embeddings/vectors.
A positional embedding table: these learnable parameters help the model manage the length of the sequence, given the attention mask.
A feed forward layer: Containing a non-linearity, this layer allows the model to model complex data beyond linear transformations.
A final projection layer: This layer takes our weights from
n_embdtovocab_size, so that the tokens with the highest probability of being a masked token has the highest weight.Cross entropy loss function (negative log likelihood): This is the loss function for an either/or decision, as we have seen in the past.
import torch.nn as nn
from torch.nn import functional as F
n_embd = 64
vocab_size = len(tokens)
token_embedding_table = nn.Embedding(vocab_size, n_embd)
position_embedding_table = nn.Embedding(block_size, n_embd)
tok_emb = token_embedding_table(xb) # token embeddings
pos_emb = position_embedding_table(torch.arange(block_size)) # position embeddings
tok_emb.shape, pos_emb.shape
x = tok_emb + pos_emb # elementwise addition to create x
x.shape
head_size = 16
key = nn.Linear(n_embd, head_size, bias=False)
query = nn.Linear(n_embd, head_size, bias=False)
k = key(x)
k.shape
q = query(x)
q.shape
weights = (
q @ k.transpose(-2, -1) * head_size**-0.5
) # need to reshape to make matmul work
weights.shape
attention_mask = attention_mask.unsqueeze(1).expand(-1, block_size, -1)
attention_mask.shape
weights = weights.masked_fill(attention_mask == 0, float("-inf"))
weights
weights = F.softmax(weights, dim=-1)
weights
value = nn.Linear(n_embd, head_size, bias=False)
v = value(x)
v.shape
out = weights @ v
out.shape
proj = nn.Linear(head_size, n_embd)
out = proj(out)
out.shape
class FeedFoward(nn.Module):
"""a simple linear layer followed by a non-linearity"""
def __init__(self, n_embd, dropout=0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
ffwd = FeedFoward(n_embd)
out = ffwd(out)
out.shape
lm_head = nn.Linear(n_embd, vocab_size)
logits = lm_head(out)
logits.shape
logits = logits.view(-1, vocab_size)
logits.shape
yb.view(-1).shape
targets = yb.view(-1)
targets.shape
targets
pred_mask = pred_mask.view(-1)
pred_mask.shape
pred_mask
masked_logits = logits[pred_mask]
masked_logits.shape
masked_logits
masked_targets = targets[pred_mask]
masked_targets.shape
masked_targets
loss = F.cross_entropy(masked_logits, masked_targets)
loss
Full model#
Below are all of the modules needed to fully construct the BERT model.
Single head of attention#
class Head(nn.Module):
def __init__(self, head_size, n_embd=64, dropout=0.0):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x, attention_mask):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
weights = q @ k.transpose(-2, -1) * C**-0.5
attention_mask = attention_mask.unsqueeze(1).expand(-1, block_size, -1)
weights = weights.masked_fill(attention_mask == 0, float("-inf"))
weights = F.softmax(weights, dim=-1)
weights = self.dropout(weights)
v = self.value(x)
out = weights @ v
return out
Multihead attention, feedfoward layer and a single Block#
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size, n_embd=64, dropout=0.0):
super().__init__()
self.heads = nn.ModuleList(
[Head(head_size, n_embd, dropout) for _ in range(num_heads)]
)
self.proj = nn.Linear(head_size * num_heads, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x, attention_mask):
out = torch.cat([h(x, attention_mask) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out
class FeedFoward(nn.Module):
"""a simple linear layer followed by a non-linearity"""
def __init__(self, n_embd, dropout=0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
class Block(nn.Module):
"""Transformer block: communication followed by computation"""
def __init__(self, n_embd, n_head, dropout=0.0):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size, n_embd=n_embd, dropout=dropout)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x, attention_mask=None):
x = x + self.sa(self.ln1(x), attention_mask)
x = x + self.ffwd(self.ln2(x))
return x
Final transformer all together#
class Transformer(nn.Module):
def __init__(
self, vocab_size, n_embd, n_head, n_layer, block_size, device, dropout=0.0
):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(
*[Block(n_embd, n_head, dropout) for _ in range(n_layer)]
)
self.ln_f = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
self.device = device
def forward(self, idx, targets=None, attention_mask=None, pred_mask=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=self.device))
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x, attention_mask)
x = self.ln_f(x)
logits = self.lm_head(x)
if targets is None:
loss = None
else:
logits = logits.view(-1, vocab_size)
targets = targets.view(-1)
pred_mask = pred_mask.view(-1)
masked_logits = logits[pred_mask]
masked_targets = targets[pred_mask]
loss = F.cross_entropy(masked_logits, masked_targets)
return logits, loss
Training and evaluation#
import torch
import torch.nn as nn
from torch.nn import functional as F
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel
block_size = 64 # what is the maximum context length for predictions
max_iters = 5000 # amount of epochs
eval_interval = 100 # every this many epochs we look at the validation set
learning_rate = 1e-5 # learning rate for the optimizer
device = "cuda" if torch.cuda.is_available() else "cpu" # what device to use
eval_iters = 200 # how many iterations in the evaluation
n_embd = 256 # embedding size
n_head = 8 # attention heads
n_layer = 4 # how many blocks
dropout = 0.1 # amount of dropout
# ------------
model = Transformer(
n_embd=n_embd,
n_head=n_head,
n_layer=n_layer,
vocab_size=vocab_size,
block_size=block_size,
device=device,
)
m = model.to(device)
print(sum(p.numel() for p in m.parameters()) / 1e6, "M parameters")
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ["train", "val"]:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y, attention_mask, pred_mask = get_batch(split, mask_ratio=0.15)
X, Y, attention_mask, pred_mask = (
X.to(device),
Y.to(device),
attention_mask.to(device),
pred_mask.to(device),
)
logits, loss = model(X, Y, attention_mask, pred_mask)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
train_losses = []
valid_losses = []
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for epoch in range(max_iters):
if epoch % eval_interval == 0 or epoch == max_iters - 1:
losses = estimate_loss()
train_losses.append(losses["train"])
valid_losses.append(losses["val"])
print(
f"step {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}"
)
xb, yb, attention_mask, pred_mask = get_batch("train")
xb, yb, attention_mask, pred_mask = (
xb.to(device),
yb.to(device),
attention_mask.to(device),
pred_mask.to(device),
)
logits, loss = model(xb, yb, attention_mask, pred_mask)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
import matplotlib.pyplot as plt
plt.plot(train_losses, label="train")
plt.plot(valid_losses, label="valid")
plt.legend()
plt.title("Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
fero_idx = stoi["fero"]
with torch.no_grad():
fero_embedding = model.token_embedding_table.to("cpu")(
torch.Tensor([fero_idx]).long()
)
fero_embedding.shape
fero_embedding
def get_embedding(word):
idx = stoi[word]
with torch.no_grad():
embedding = model.token_embedding_table.to("cpu")(torch.Tensor([idx]).long())
return embedding
visualize_words = [
# example words - names
"antonius",
"caesar",
"pompei",
"galba",
"catilina",
"cornificius",
"scipio",
"lucullus",
"pontius",
]
embeddings = [get_embedding(word) for word in visualize_words]
visualize_vecs = torch.stack(embeddings)
visualize_vecs = visualize_vecs.squeeze(1).to("cpu").numpy()
visualize_idx = [stoi[word] for word in visualize_words]
import numpy as np
temp = visualize_vecs - np.mean(visualize_vecs, axis=0)
covariance = 1.0 / len(visualize_idx) * temp.T.dot(temp)
U, S, V = np.linalg.svd(covariance)
coord = temp.dot(U[:, 0:2])
for i in range(len(visualize_words)):
plt.text(
coord[i, 0],
coord[i, 1],
visualize_words[i],
bbox=dict(facecolor="green", alpha=0.1),
)
plt.xlim((np.min(coord[:, 0] - 0.5), np.max(coord[:, 0] + 0.5)))
plt.ylim((np.min(coord[:, 1] - 0.5), np.max(coord[:, 1] + 0.5)))
plt.show()