!pip install sentence_transformers plotly -Uq
!pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122 -Uq
!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv

Introducing Semantic Search#

Information retrieval is a large and complicated field. In this notebook, we’ll look at the steps involved in a specific information retrieval algorithm called “semantic search,” which employs a language model to compare the similarity of a search query to chunks of original data. The steps are as follows:

Load our model
Read in and chunk our data
Embed the chunks
Take in and embed our user query
Take the dot product between our user query and our document embeddings
Align relevant indices with original chunked data
Return chunks to the user or another process

At the end of the notebook, we’ll pass this information that we retrieved to an LLM and complete a process called Retrieval Augmented Generation (RAG).

Some key concepts in semantic search#

Masked Language Modeling: The type of language modeling that we use when we are doing semantic search may seem confusing because it is unlike the modeling we have done in other notebooks. That said, it is more similar that it might seem. As we will see, these models which we use for this task take in a string (usually representing a sentence or paragraph) and output a vector of numbers. Unlike other forms of artificial intelligence, these models do not produce more text or images, rather they tell us the way they interpret language. The vectors and matrices that these models produce (called embeddings) represent how this model understands the text we give it. In training, as opposed to predicting the next token, they are given a full sentence with a random assortment of words in it masked with a special token. The model then has to guess at these masked words. This type of training gives the models an internal sense of semantic meaning that is more accurate to human understanding than predicting the next word in a sequence of words.

Dot Product Once we have generated embeddings for our source documents and our query string, we need some way of comparing them. We would like a function that took in a vector and a matrix of specific sizes and return how similar each row of the matrix is to the vector. Thankfully, in linear algebra, this exact function exists. It is called the “dot product” (we will be using the “scaled dot product”). Given a vector, \(V\), of size (1, N) and a matrix, \(M\), of (M, N), \(V \cdot M^{T}\) will return a row vector if size (M, 1). Each element of this new vector will be a score for -1 to 1 which represents how similar \(V\) was to a row in \(M\). More details to follow.

Data and model prep#

# imports
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import re
import nltk

nltk.download("punkt_tab")
import plotly.express as px
import plotly.graph_objects as go
from llama_cpp import Llama
from pprint import pprint

# loading our embedding model
model = SentenceTransformer("BAAI/bge-m3", trust_remote_code=True)

df = pd.read_csv("daf.csv")
df  # our data

df = df.drop("footnotes", axis=1)
df["sentences"] = df["text"].apply(nltk.sent_tokenize)
sentences = df.explode("sentences")
mask = sentences["sentences"].apply(
    lambda x: len(x) < 25
)  # removing all short sentences
sentences = sentences[~mask]

sentences

Below we begin a process called ‘embedding’, where we take our individual sub-documents (in this case each sentence from the Decline and Fall) and pass them through our embedding model. As mentioned above, this model is trained to output a representation of the given strings in multi-dimensional space in the form of vectors. When we give a model like this multiple sentences to embed, then it outputs multiple vectors all stacked on top of each other. This vertical arrangement of row vectors is also called a matrix and in this case has the shape: number of inputs x the model’s hidden state dimension (this number is created by the model itself in training and we have no control over it).

embeddings = model.encode(
    sentences.sentences.to_list(),  # our sentences
    batch_size=64,  # high batch size = faster embedding, more VRAM
    show_progress_bar=True,
    device="cuda",
    normalize_embeddings=True,  # divides embeddings by their norm, centering the distribution at zero with a variance close to one
)

embeddings.shape  # number of documents x the model's hidden state dimension.

embeddings[0]  # single vector representing the first sentence in our list

embeddings[
    0
].shape  # an embedding is a single vector of the size of the model's hidden state

Digression: Visualizing Embeddings#

To build a better intuition for what embeddings are and how they work, we will see how we can use some simple data visualization techniques to see what these embeddings are telling us about the underlying data.

# using PCA to decompose our 1024 long vectors to 2
pca = PCA(n_components=2)
pca.fit(embeddings)
X = pca.transform(embeddings)
X.shape  # 7880, 1024 -> 7880, 2

# making a dataframe to visualize the embeddings with the original sentences
plotting = pd.DataFrame(
    {
        "x": X[:, 0],
        "y": X[:, 1],
        "title": sentences.title,
        "sentence": sentences.sentences,
    }
)
plotting["sentence"] = (
    plotting["sentence"].str.wrap(100).apply(lambda x: x.replace("\n", "<br>"))
)

fig = px.scatter(plotting, x="x", y="y", hover_data="sentence")
fig.show()

In the scatter plot above, each dot represents a single embedding, which represents a single sentence. As a result, similar sentences tend (though not always) to get grouped together. This created clusters and subclusters of sentences which are similar. This internal structure of the embeddings will help us conduct information retrieval.

Query-based retrieval#

Now that we have some intuition on how embeddings are working, we can put them to test with a sample query.

Below we will use an extra string called retrieval_instruction. Often when we are taking in a query from the user, it will be mmuch shorted than the typical length of the documents in our sentence list. This extra string that we prepend to the user query makes the user query more comparable to the documents in our embeddings.

retrieval_instruction = "Represent this sentence for searching relevant passages: "
query = "Who were the Goths"
query_embedding = model.encode(
    retrieval_instruction + query, device="cuda", normalize_embeddings=True
)

query_embedding.shape  # just like a single embedding from above

# relevancy measure: dot product
sim_vector = (
    query_embedding @ embeddings.T
)  # (m, n) X (n, o) = (m, o), in our case: 1, 1024 X 1024, 7880
sim_vector.shape  # 1, 7880, this vector is made of similarity scores between the sentences in our original list of sentences and the query

# argsort sorts the array by index
sim_vector.argsort()

sim_vector.argsort()[::-1]  # reverses array

k = 20
rel_idx = sim_vector.argsort()[::-1][:k]  # selects top k indices from the array
rel_idx

rel_chunks = [
    sentences.sentences.to_list()[i] for i in rel_idx
]  # get back our sentences
rel_chunks  # read through these to verify that we're on the right track

Retrieval Augmented Generation (RAG)#

Semantic search is interesting and useful by itself, but recently it has taken on a new importance. Users of modern AI systems are always seeking new away to condition AI output on relevant data. Semantic search offers a good way of dealing with this problem and thus constitutes the first phase in a process called Retrieval Augmented Generation or RAG, where first we use semantic search to get relevant documents and then pass those relevant documents to an AI in a prompt. Below is a quick example of doing so.

# loading our LLM
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-7B-Instruct-GGUF",
    filename="*q4_0.gguf",
    verbose=True,
    n_gpu=-1,
    n_ctx=3000,
)

# RAG prompt, feel free to change and see the differences
base_prompt = """
# Question answering task
You are a helpful AI assistant that is skilled at answering user questions based on a given context.

## User question
{question}

## Context
{context}
""".strip()

message = [
    {
        "role": "user",
        "content": base_prompt.format(
            question=query,  # our query from above
            context="\n".join(rel_chunks),  # relevant chunks
        ),
    }
]

# may take some time (~5-10 minutes)
text = llm.create_chat_completion(message, max_tokens=-1)

pprint(text["choices"][0]["message"]["content"])  # output

Conclusion#

In this notebook, we have begun an exploration of embeddings, but there is much more to understand. In future lessons, we’ll see other ways to use document-level embeddings and train our own embedding model for languages other than English. If you are interested in exploring more, I would check out the documentation of the package we used to load the embedding model: sBERT. They have a lot of good articles on semantic search and other applications.