{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural Language Processing: Using Word Vectors for Textual Analysis\n", "\n", "We've seen how neural nets and deep learning can help us answer questions relevant to natural language processing, like Name Entity Recognition. Today, we'll talk about a more basic process that underlies the work we did with `spaCy` in the last lab, namely, the word2vec algorithm.\n", "\n", "In this lab, we will:\n", "\n", "* Understand the goal behind word2vec\n", "* Visualize the complex relationships it can capture\n", "* See what word2vec is doing behind the scenes\n", "* Apply these complex relationships to solve a problem\n", "\n", "This lab seeks to draw together a lot of disperate concepts and show you how neural nets, though they seem intimidating, are a logical outcome of the data science you have already seen in this course.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is word2vec\n", "\n", "Before we dive into how it works, let's first take a look at what the goal of word2vec is. As the name implies, this very simple neural net seeks to transform words into vectors. As we will see, each unique word in a corpus will have an associated vector attached to it. This vector can be manipulated like any other vector, allowing us to apply complex mathematical operations to word meaning and sense.\n", "\n", "In deep learning, this is called *feature extraction* because we are teaching a model how to extract the linguistic features from a word (though it could be anything, including images or audio). Feature extraction is rarely the end point in analysis. Instead, we can use these extracted features as the inputs to another model which will do some analysis. Before we get there though, let's look at how we can harness the power of word2vec." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install gensim datasets -Uq # this package handles word2vec for us" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The packages we'll need\n", "\n", "`gensim` implements word2vec for us in an incredibly effiecent and parallelized manner. This means we don't even need to use GPUs or other advanced techniques to run it.\n", "\n", "In this lab, we'll look at a collection of New York Times headlines that I created. It is a useful dataset, but we need to download it from a place called HuggingFace, using their `datasets` library. (If you're interested in how I built this dataset and how the `datasets` library works, let me know and I'd be happy to share the code with you)\n", "\n", "But first, we're going to examine a pretrained model called GloVe wiki-gigaword-300, which is trained on a repository of wikipedia articles. Working with this model will give us good intuition for working with the NYT dataset at the end of the lab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install numpy==1.26.0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import gensim\n", "from datasets import load_dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import gensim.downloader as api\n", "\n", "model = api.load(\"glove-wiki-gigaword-300\") # takes about 3 minutes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this model takes in a word and gives back a vector (word TO vec)\n", "model[\"obama\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(model[\"obama\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model[\"obama\"].shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this model is from about 8 years ago so some things aren't covered\n", "model[\"icespice\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using word vectors\n", "\n", "It's cool that we can turn text into numbers, but what can we do with them?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Similarity measurement\n", "\n", "The dot product (`np.dot` or `@`) tells us how similar two vectors are. The same is true for word vectors. The higher the number the stronger the relationship between the two vectors.\n", "\n", "This is mathematically equivalent to:\n", "${\\displaystyle \\mathbf {a} \\cdot \\mathbf {b} =\\mathbf {a} ^{\\mathsf {T}}\\mathbf {b}}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# strong relationship\n", "obama_vector = model[\"obama\"]\n", "president_vector = model[\"president\"]\n", "obama_vector.T @ president_vector, np.dot(obama_vector, president_vector)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# a less strong relationship\n", "random_vector = model[\"random\"] # random word\n", "obama_vector @ random_vector" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cosine similarity, scaled version of the dot product\n", "# this will ensure that values are between -1 and 1\n", "president_scaled_sim = (\n", " obama_vector\n", " @ president_vector\n", " / (np.linalg.norm(obama_vector) * np.linalg.norm(president_vector))\n", ")\n", "random_scaled_sim = (\n", " obama_vector\n", " @ random_vector\n", " / (np.linalg.norm(obama_vector) * np.linalg.norm(random_vector))\n", ")\n", "\n", "print(president_scaled_sim, random_scaled_sim)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# strongest relationships\n", "# gensim docs: https://radimrehurek.com/gensim/models/keyedvectors.html\n", "model.most_similar(\"obama\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing word vectors\n", "\n", "If you look online for word vectors, you may find 2-d or 3-d representations of word vectors. These can be very interesting to look at and can give us a good intuition for what the word2vec model is doing (before we see exactly what it's doing below)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import TruncatedSVD\n", "import pandas as pd\n", "\n", "# 2-d dimensionality reduction\n", "svd = TruncatedSVD(2)\n", "reduced = svd.fit_transform(model.vectors)\n", "\n", "# putting into a dataframe\n", "df = pd.DataFrame(list(zip(reduced, model.index_to_key)))\n", "df[\"x\"] = df[0].apply(lambda x: x[0])\n", "df[\"y\"] = df[0].apply(lambda x: x[1])\n", "df = df.drop([0], axis=1)\n", "df = df.rename(columns={1: \"word\"})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "fig = px.scatter(df, x=\"x\", y=\"y\", hover_data=[\"word\"])\n", "fig.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# taking a closer look\n", "captials_countries = [\"paris\", \"france\", \"london\", \"england\", \"rome\", \"italy\"]\n", "cc = df.loc[df.word.isin(captials_countries)]\n", "fig = px.scatter(cc, x=\"x\", y=\"y\", hover_data=[\"word\"])\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that there is an internal logic to these vectors that is detected just from the data itself. We didn't have to prompt or show the model anything. Instead, we just gave it the text and it figured out there is the same relationship between Paris and France as there is between London and England. Maybe you're starting to see why we sometimes call this 'artifical intelligence'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TASK 1\n", "\n", "Because these word vectors are just vectors that represent word meaning, we can do arithmetic on the vectors and expect to see a similar change in meaning. See the examples below and then try a set of semantic analogies for yourself." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ex1: what are toes minus feet plus hands?\n", "# toes are to feet as hands are to ___\n", "model.most_similar(positive=[\"toes\", \"hand\"], negative=[\"feet\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ex2: what is a king minus a man plus a woman?\n", "# men are to kings as women are to ___\n", "model.most_similar(positive=[\"king\", \"woman\"], negative=[\"man\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ex3: what is a turtle minus a shell plus teeth?\n", "# shells are to turtles as teeth are to ___\n", "model.most_similar(positive=[\"turtle\", \"teeth\"], negative=[\"shell\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# try a couple out for yourself:\n", "model.most_similar(positive=[\"moat\", \"swamp\"], negative=[\"water\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# bonus: try to code this out without the `most similar` function, don't worry about the similarity score\n", "# hint: if you have an index, you can find the word that goes with it with the model.index_to_key list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p1 = model[\"king\"]\n", "p2 = model[\"woman\"]\n", "n1 = model[\"man\"]\n", "\n", "pos = np.add(p1, p2)\n", "pos.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub = np.subtract(pos, n1)\n", "sub.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(400000, 300) x (300, 1) -> (400000, 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sims = []\n", "for v in range(400000):\n", " sims.append(model.vectors[v] @ sub.T)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sim_scores = model.vectors @ sub.T\n", "sim_scores.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# without `most_similar`\n", "def my_most_similar(model, positive=[], negative=[], to_see=10):\n", " pos = np.add(model[positive[0]], model[positive[1]])\n", " sub = np.subtract(pos, model[negative[0]])\n", " sim = (model.vectors @ sub.T).argsort()[::-1][:to_see]\n", " return [model.index_to_key[s] for s in sim]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_most_similar(model, positive=[\"toes\", \"hand\"], negative=[\"feet\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How does word2vec work?\n", "\n", "This section will have a lot of explanation and some code, though I don't plan on explaining each line. If you'd like to see an example of word2vec coded out from scratch, let me know. Instead this section will focus on giving you an intuition on this is working." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training data\n", "\n", "Before we can start training, we need to prepare our data for training. We already did one of these steps above, tokenization. In fact, when we prepared the data above, we followed all the right steps.\n", "\n", "### Training objective\n", "First, we need to better understand what the objective of word2vec is. We saw that it could take in some text and convert it into a vector. But this is, ironically, more of a byproduct of the training process than the goal.\n", "\n", "Instead, word2vec seeks to predict the next word after a given word. (Then, the vectors we got above were the weights that the model predicted per unique word. See below.) To do so, we need to arrange our data in a format that lets us train in this way.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### Block size\n", "`block_size` represents a moving context window which arranges our text into usable training data.\n", "\n", "\n", " Below is a diagram of this process using a `block_size` size of 2 (taken from: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/).\n", "\n", "\n", "![training_data.png]()\n", "\n", "\n", "For the purposes of this model, we would take in the first members of these tuples and try to predict the second." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Embedding layer\n", "\n", "Now, how do we take the objective above and get the result we saw from `gensim`?\n", "\n", "An embedding \"layer\" is just a data structure that holds all of the unique words in our texts and for each one has an arbitrarily log vector associated with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from torch import nn\n", "\n", "embedl = nn.Embedding(10, 100) # this embedding has 10 words at 100 vectors long\n", "embedl.weight.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embedl.weight[0] # random weights of a single \"word\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right now, these numbers are completely random, bur the training process will make them not random, so they will represent the meaning of the word they are assigned to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear layer\n", "\n", "A linear layer is a fancy way of doing as many of the word-level (\"what word comes next?\") predictions at the same time and will look very familiar from linear regression:\n", "\n", "$y_{pred} = xA^{\\mathsf {T}}+b$, where $A$ is our embedding matrix. This *linear* transformation will give us the predictions for any piece of the data.\n", "\n", "We can then compare these predictions to the correct values and train this model using gradient descent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using word vectors for analysis\n", "\n", "Now that we see how these vectors are created we can use them to answer a question. Specifically, we'll look at the entire NYT dataset and try to predict, based on just a headline, what section of the Times it's from." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# loading in our (92k) headlines\n", "ds = load_dataset(\"pnadel/nyt_headlines\")\n", "full = ds[\"train\"].to_pandas()\n", "hls = full.headline.to_list()\n", "hls[:10], len(hls)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization\n", "\n", "Before we can turn our words into vectors, we need to *tokenize* our text. Tokenization is the process of turning our sentences into a list of words. Tokenization can be very complicated (with HuggingFace even providing a `tokenizers` library) but we'll keep it very simple for this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.utils import simple_preprocess # gensim's native tokenization function\n", "\n", "simple_preprocess(hls[1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a list comprehension to tokenize our whole list\n", "hls = [simple_preprocess(hl) for hl in hls]\n", "hls[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model training\n", "\n", "If we have a list of list of tokens (aka a list of tokenized sentences), `gensim` makes it really easy to train a word2vec model.\n", "\n", "But what are all these parameters?\n", "\n", "* `vector_size`: the length of our word vectors, as we'll see the longer the vector the more linguistic information it can hold, but it also has a higher chance to overfit\n", "* `window`: this is the size of the context window that controls training data, we'll see this in more detail below\n", "* `min_count`: if a word occurs less than this number, it is ignored\n", "* `epochs`: how long to train the model, like `vector_size`, the higher the number the better the model, but it also has a higher chance to overfit.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "\n", "# training the model (~1 min)\n", "model = Word2Vec(hls, vector_size=50, window=8, min_count=2, epochs=40)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# vector representing 'biden'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# vector representing 'president'\n", "model.wv[\"president\"], model.wv[\"president\"].shape, type(model.wv[\"president\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up our classification model\n", "\n", "If we were to encode all the words in our headlines with their individual word embeddings, we wouldn't be able to train a classifier because the number of words in each headline varies. As a result, to compute a single vector for a headline, we'll just take the mean. This is a process called **mean pooling** and is very popular when create document-level embeddings. Can you think of why this might not be the best option in all cases? What are some alternatives?\n", "\n", "I'll also do some data splits below. Hwere they are spelled out:\n", "\n", "* First split: `full_vector` -> `df` + `valid` (.33 split)\n", "* Second split: `df` -> `train` + `test` (.33 split)\n", "\n", "At the end, we'll be left with three datasets. We'll train on `train`, test on `test` and then evaluate our classification model with `valid`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TASK 2\n", "\n", "As mentioned above, we need to take the mean of each set of embeddings of a headline. Below, try to do so by any means that makes sense to you.\n", "\n", "Some useful variables:\n", "\n", "* `hls` - list of list of each word in each headline\n", "* `model` - calling `model.wv[word]` for any word will give you the embedding for that word\n", "\n", "At the end, you should have a list of mean embeddings for each headline.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for each headline we want a 50 long vector that is the mean of all of its word embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sent_vectors[0], hls[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### DataFrames and Splits" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "\n", "full_vector = pd.DataFrame(sent_vectors)\n", "full_vector[\"section\"] = full[\"label\"]\n", "\n", "# first split\n", "df, valid = train_test_split(full_vector, test_size=0.33, random_state=1337)\n", "df\n", "\n", "# second split\n", "train, test = train_test_split(df, test_size=0.33, random_state=1337)\n", "X_train = train[train.columns[:-1]]\n", "X_test = test[test.columns[:-1]]\n", "\n", "y_train = train[train.columns[-1]]\n", "y_test = test[test.columns[-1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(train), len(test), len(valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression(random_state=0)\n", "lr_model = lr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = lr_model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# simple accuracy on test set\n", "(y_pred == y_test).sum() / len(y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "\n", "# confusion matrix\n", "cm = confusion_matrix(y_test, y_pred)\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=lr_model.classes_)\n", "disp.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Forest Classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rf = RandomForestClassifier()\n", "rf_model = rf.fit(X_train, y_train) # takes about 30 seconds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = rf_model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# simple accuracy\n", "(y_pred == y_test).sum() / len(y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# confusion matrix\n", "cm = confusion_matrix(y_test, y_pred)\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf_model.classes_)\n", "disp.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check on our validation set\n", "X_v = valid[valid.columns[:-1]]\n", "y_v = valid[valid.columns[-1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "v_pred = rf_model.predict(X_v)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# simple accuracy\n", "(v_pred == y_v).sum() / len(v_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cm = confusion_matrix(y_v, v_pred)\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf_model.classes_)\n", "disp.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check where we're wrong\n", "check = pd.DataFrame(\n", " {\"headline\": full.headline.iloc[y_v.index], \"true\": y_v, \"pred\": v_pred}\n", ")\n", "wrong = check.loc[check.true != check.pred]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = wrong.sample(5).apply(\n", " lambda x: print(\n", " f\"Text: {x['headline']}\\nTrue Label: {x['true']}\\nPredicted Label: {x['pred']}\\n---------\"\n", " ),\n", " axis=1,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "Language modeling can, at times, be very complicated and very simple. As we saw, the word2vec model attempts to recreate the simple logic that we all have while reading a text, yet the details of individual choices can still seem mystifying.\n", "\n", "View this notebooks as an invitation into the realm of natural language processing. I find that online resources for NLP range from incredibly informative to almost worthless, so I've provided some links below to some places to visit if you're interested in learning more:\n", "\n", "* [Andrej Karpathy's Youtube account](https://www.youtube.com/@AndrejKarpathy) is an amazing resource for learning how to design NLP models of the generation after Word2Vec.\n", "* [Jeremy Howard's fast.ai course](https://course.fast.ai/) is designed for data scientists to learn about deep learning.\n", "* [HuggingFace's NLP course](https://huggingface.co/course/chapter1/1) gives a practical walk through for the HuggingFace NLP framework.\n", "\n", "And as always, you can feel free to email me with a question or to set up a meeting at peter.nadel@tufts.edu.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "mystnb": { "execution_mode": "off" } }, "nbformat": 4, "nbformat_minor": 0 }