{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "DATASETS_DIR=\"utils/datasets\"\n", "mkdir -p $DATASETS_DIR\n", "cd $DATASETS_DIR\n", "\n", "# Get Stanford Sentiment Treebank\n", "if hash wget 2>/dev/null; then\n", " wget http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip\n", "else\n", " curl -L http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip -o stanfordSentimentTreebank.zip\n", "fi\n", "unzip stanfordSentimentTreebank.zip\n", "rm stanfordSentimentTreebank.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# word2vec using `numpy`\n", "\n", "In this notebook, I'll walk you through how to implement the popular and flexible word2vec model using `numpy`.\n", "\n", "As well as showing how word2vec works, I also want to present to you a workflow that facilitates experimentation and ease of use that we'll carry through into all of our lessons on neural nets and deep learning. There are off-the-shelf implementations of this algorithm, but writing our own will help us better understand it, `numpy` and Python programming more generally.\n", "\n", "* First, we'll need to work with the unique challenges of text data. We will develop some Python helper objects that will help us to convert our textual sources into tensors which can be mathematically manipulated.\n", "* Next, we'll examine the word2vec Continuous Bag of Words (CBoW) and Skipgram architectures and how they work. In doing, we'll build out the `numpy` modules needed to turn these architectures from theory to practice.\n", "* Last, we'll create a standard training loop which we can alter and experiment with.\n", "\n", "The learning objectives for this notebook are as follows:\n", "* Understand the advantages and disadvantages to neural approaches in NLP, and how this relates to our previous use of word2vec.\n", "* Train and interpret their own word2vec model using a non-English language.\n", "* Study the basics of the mathematical underpinnings of deep learning including backpropagation.\n", "* Learn the ways to efficiently run deep learning models.\n", "\n", "## Motivation\n", "\n", "Before we dive into how it works, let's first take a look at what the goal of word2vec is. As the name implies, this very simple neural net seeks to transform words into numbers. A **vector**, in this sense, refers to a list of numbers whose values represent the meaning of a given word.\n", "\n", "This probably sounds a little funky... Why do we need to convert words we know the meaning of into list of numbers whose meaning is hard to grasp? Ultimately, we want to give our computer a way of understanding text it hasn't seen before and unlike us a computer can't use text to learn meaning. *It can only use numbers*. What we really want is a some black-box that we can give a word and it will spit out the meaning of that word to the computer, a list of numbers. In fact, we want to *model* word meaning. This is what word2vec does.\n", "\n", "As we will see, each unique word in our text will have an associated vector attached to it. This vector can be manipulated like any other vector, allowing us to apply complex mathematical operations to word meaning and sense.\n", "\n", "In deep learning, this is called *feature extraction* because we are teaching a model how to extract the linguistic features from a word (though it could be anything, including images or audio). Feature extraction is rarely the end point in analysis. Instead, we can use these extracted features as the inputs to another model which will do some analysis. Before we get there though, let's look at how we can harness the power of word2vec.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "\n", "For this example, we will be using the Stanford Sentiment Treebank. This dataset has a lot of excerpts from English newspapers. They have been marked for sentiment value by human annotators, but we won't be using that data for this lesson. Let's take a look at what some of the data looks like. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = \"utils/datasets/stanfordSentimentTreebank\"\n", "with open(f\"{path}/datasetSentences.txt\", \"r\") as f:\n", " sentences = f.readlines()\n", "\n", "for sentence in sentences[:6]:\n", " print(sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These sources are in English (obviously), but the word2vec method is not tied to a particular language or dialect. What makes this framework so successful is that it is flexible and not language dependent. To that end, you will be apply this framework to a non-English language of your choice for your assignment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization\n", "\n", "We need to do some processing to make this sentences usable and then we need to do some tokenization. We've talked a bit about tokenization before, but in this case we need to do it ourselves. We'll also need to skip the first line because it is just the column headings.\n", "\n", "In the data below, we used \"whitespace tokenization\", that is, we split the sentence up on the space character (the default value for `.split` is `\" \"`). This strategy is very easy and works well for English, but may not work as well for other languages. We'll revisit multi-lingual tokenization later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Before Tokenization: \", sentences[6])\n", "print(\n", " \"After Tokenization: \", sentences[6].strip().split()[1:]\n", ") # \"whitespace tokenization\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = \"utils/datasets/stanfordSentimentTreebank\"\n", "\n", "\n", "def get_sentences():\n", " sentences = []\n", " with open(f\"{path}/datasetSentences.txt\", \"r\") as f:\n", " first = True\n", " for line in f:\n", " if first:\n", " first = False\n", " continue\n", " split = line.strip().split()[1:]\n", " sentences += [[w.lower() for w in split]]\n", " sent_lens = np.array([len(s) for s in sentences])\n", " cum_sent_lens = np.cumsum(sent_lens)\n", " return sentences, sent_lens, cum_sent_lens\n", "\n", "\n", "sentences, sent_lens, cum_sent_lens = get_sentences()\n", "\n", "for sent in sentences[:3]:\n", " print(sent)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(sentences), len(sentences[0]) # a list of list of strings (word/tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating our training dataset: Collation\n", "\n", "We have done a good job in getting our data from the downloaded files and then tokenizing them, but unfortunately our dataset is far from usable to train a model. As I mentioned above, computers can't work directly with string or text data. Instead we will somehow need to convert our text into numbers, and not just that, we'll also need to arrange these numbers so that all of the lists of words are the same length. This means that even though we have a bunch of different sentences with different sizes, we need to standardize them to a single size. This process is called **collation**.\n", "\n", "This size experts call `block_size` or `context_size` or `context_window`. You can think about it as the memory of the model. When the model is making predictions about what word will come next, it will only be able to use this context. As a result, we want `block_size` to be as high as possible, but we are limited by computational resources.\n", "\n", "We will choose an arbitrary word as a *center* word. Then get all `block_size` words before and after the center word. From the `block_size` context, we want our model to predict the center word. This is how the model will learn (more on this in a bit). We will end up comparing the correct center word and the predicted center word and adjusting our model according to how similar or dissimilar they are.\n", "\n", "\n", "Below is a diagram of this process using a `block_size` size of 2 (taken from: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/).\n", "\n", "![training_data.png]()\n", "\n", "\n", "\n", "So first, we need to standardize our sentences so that they all have the same context size. We will set a center token and then retrieve C words before and after it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_random_context(C=5):\n", " sent_id = random.randint(0, len(sentences) - 1)\n", " sent = sentences[sent_id]\n", " word_id = random.randint(0, len(sent) - 1)\n", "\n", " context = sent[max(0, word_id - C) : word_id]\n", " if word_id + 1 < len(sent):\n", " context += sent[word_id + 1 : min(len(sent), word_id + C + 1)]\n", "\n", " center = sent[word_id]\n", " context = [w for w in context if w != center]\n", "\n", " if len(context) > 0:\n", " return center, context\n", " else:\n", " return get_random_context(C)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "center, context = get_random_context()\n", "center, context # we will use the context words to predict the center word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"But,\" you may be asking yourself, \"this doesn't solve the problem about computers not being able to read text at all! We still have strings! What gives, Peter >:(\"\n", "\n", "Good point! That part is a bit easier: Because our center word is picked randomly, we can assign each word in our text a number somewhat at random. In fact, we will go sequentially and assign a *token id* to each word that we haven't seen before. We will create two dictionaries: one with keys that are each word and values are their token id, and another with keys that are the token ids and the values are their words. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_tokens():\n", " tokens = {}\n", " tok_freq = {}\n", " word_count = 0\n", " rev_tokens = []\n", " idx = 0\n", "\n", " for sent in sentences:\n", " for w in sent:\n", " word_count += 1\n", " if w not in tokens:\n", " tokens[w] = idx\n", " rev_tokens += [w]\n", " idx += 1\n", " tok_freq[w] = 1\n", " else:\n", " tok_freq[w] += 1\n", "\n", " tokens[\"UNK\"] = idx\n", " rev_tokens += [\"UNK\"]\n", " tok_freq[\"UNK\"] = 1\n", " word_count += 1\n", " return tokens, tok_freq, rev_tokens, word_count\n", "\n", "\n", "tokens, tok_freq, rev_tokens, word_count = get_tokens()\n", "len(tokens), len(tok_freq), len(rev_tokens), word_count" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Center word: *\", center, \"* Center word token id: \", tokens[center])\n", "for i in range(len(context)):\n", " print(\n", " \"Context word: *\", context[i], \"* Context word token id: \", tokens[context[i]]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Negative Sampling\n", "\n", "We'll come back to this if we have time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the data came with some splits in our data\n", "# we can apply them with this function\n", "def dataset_split():\n", " split = [[] for _ in range(3)]\n", " with open(f\"{path}/datasetSplit.txt\", \"r\") as f:\n", " first = True\n", " for line in f:\n", " if first:\n", " first = False\n", " continue\n", " split = line.strip().split(\",\")\n", " split[int(split[1]) - 1] += [int(split[0]) - 1]\n", " return split\n", "\n", "\n", "split = dataset_split()\n", "len(split), len(split[0]), len(split[1]), len(split[2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table_size = 1e8\n", "\n", "\n", "def sampleTable():\n", " tokens_num = len(tokens)\n", " sampling_freq = np.zeros((tokens_num,))\n", "\n", " i = 0\n", " for w in range(tokens_num):\n", " w = rev_tokens[i]\n", " if w in tok_freq:\n", " freq = 1.0 * tok_freq[w]\n", " freq = freq**0.75\n", " else:\n", " freq = 0.0\n", " sampling_freq[i] = freq\n", " i += 1\n", "\n", " sampling_freq /= np.sum(sampling_freq)\n", " sampling_freq = np.cumsum(sampling_freq) * table_size\n", "\n", " sample_table = np.zeros((int(table_size),))\n", "\n", " j = 0\n", " for i in range(int(table_size)):\n", " while i > sampling_freq[j]:\n", " j += 1\n", " sample_table[i] = j\n", "\n", " return sample_table\n", "\n", "\n", "sample_table = sampleTable()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def reject_prob():\n", " threshold = 1e-5 * word_count\n", " reject_prob = np.zeros((len(tokens),))\n", " for i in range(len(tokens)):\n", " w = rev_tokens[i]\n", " freq = 1.0 * tok_freq[w]\n", " reject_prob[i] = max(0, 1 - np.sqrt(threshold / freq))\n", " return reject_prob\n", "\n", "\n", "reject_prob = reject_prob()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Our complete dataset object\n", "\n", "Now that we have coded out the data specific functions, we can compile it all into a single class from which we can call these functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class StanfordSentiment:\n", " \"\"\"\n", " Class for reading and loading Stanford Sentiment Treebank. We ignore the sentiment component of the treebank and extract just the text.\n", " \"\"\"\n", "\n", " def __init__(self, path=None, table_size=1000000):\n", " if not path:\n", " path = \"utils/datasets/stanfordSentimentTreebank\"\n", "\n", " self.path = path\n", " self.table_size = table_size\n", "\n", " self.get_sentences()\n", " self.get_tokens()\n", " self.get_all_sentences()\n", " self.dataset_split()\n", " self.sampleTable()\n", "\n", " def get_tokens(self):\n", " if hasattr(self, \"tokens\") and self.tokens:\n", " return self.tokens\n", "\n", " tokens = {}\n", " tok_freq = {}\n", " word_count = 0\n", " rev_tokens = []\n", " idx = 0\n", "\n", " for sent in self.sentences:\n", " for w in sent:\n", " word_count += 1\n", " if w not in tokens:\n", " tokens[w] = idx\n", " rev_tokens += [w]\n", " idx += 1\n", " tok_freq[w] = 1\n", " else:\n", " tok_freq[w] += 1\n", "\n", " tokens[\"UNK\"] = idx\n", " rev_tokens += [\"UNK\"]\n", " tok_freq[\"UNK\"] = 1\n", " word_count += 1\n", "\n", " self.tokens = tokens\n", " self.tok_freq = tok_freq\n", " self.rev_tokens = rev_tokens\n", " self.word_count = word_count\n", " return self.tokens\n", "\n", " def get_sentences(self):\n", " if hasattr(self, \"sentences\") and self.sentences:\n", " return self.sentences\n", "\n", " sentences = []\n", " with open(f\"{self.path}/datasetSentences.txt\", \"r\") as f:\n", " first = True\n", " for line in f:\n", " if first:\n", " first = False\n", " continue\n", " split = line.strip().split()[1:]\n", " sentences += [[w.lower() for w in split]]\n", " sent_lens = np.array([len(s) for s in sentences])\n", " cum_sent_lens = np.cumsum(sent_lens)\n", "\n", " self.sentences = sentences\n", " self.sent_lens = sent_lens\n", " self.cum_sent_lens = cum_sent_lens\n", " return sentences\n", "\n", " def get_reject_prob(self):\n", " if hasattr(self, \"reject_prob\") and self.reject_prob:\n", " return self.reject_prob\n", "\n", " threshold = 1e-5 * self.word_count\n", " reject_prob = np.zeros((len(self.tokens),))\n", " n_tokens = len(self.tokens)\n", " for i in range(n_tokens):\n", " w = self.rev_tokens[i]\n", " freq = 1.0 * self.tok_freq[w]\n", " reject_prob[i] = max(0, 1 - np.sqrt(threshold / freq))\n", " self.reject_prob = reject_prob\n", " return reject_prob\n", "\n", " def get_all_sentences(self):\n", " if hasattr(self, \"all_sentences\") and self.all_sentences:\n", " return self.all_sentences\n", "\n", " sentences = self.get_sentences()\n", " reject_prob = self.get_reject_prob()\n", " tokens = self.get_tokens()\n", " all_sentences = [\n", " [\n", " w\n", " for w in s\n", " if 0 >= reject_prob[tokens[w]]\n", " or random.random() >= reject_prob[tokens[w]]\n", " ]\n", " for s in sentences * 30\n", " ]\n", " all_sentences = [s for s in all_sentences if len(s) > 1]\n", " self.all_sentences = all_sentences\n", " return all_sentences\n", "\n", " def get_random_context(self, C=5):\n", " sentences = self.get_all_sentences()\n", " sent_id = random.randint(0, len(sentences) - 1)\n", " sent = sentences[sent_id]\n", " word_id = random.randint(0, len(sent) - 1)\n", "\n", " context = sent[max(0, word_id - C) : word_id]\n", " if word_id + 1 < len(sent):\n", " context += sent[word_id + 1 : min(len(sent), word_id + C + 1)]\n", "\n", " center = sent[word_id]\n", " context = [w for w in context if w != center]\n", "\n", " if len(context) > 0:\n", " return center, context\n", " else:\n", " return self.get_random_context(C)\n", "\n", " def dataset_split(self):\n", " if hasattr(self, \"split\") and self.split:\n", " return self.split\n", "\n", " split = [[] for _ in range(3)]\n", " with open(f\"{self.path}/datasetSplit.txt\", \"r\") as f:\n", " first = True\n", " for line in f:\n", " if first:\n", " first = False\n", " continue\n", " split = line.strip().split(\",\")\n", " split[int(split[1]) - 1] += [int(split[0]) - 1]\n", " self.split = split\n", " return split\n", "\n", " def sampleTable(self):\n", " if hasattr(self, \"sample_table\") and self.sample_table:\n", " return self.sample_table\n", "\n", " tokens_num = len(self.tokens)\n", " sampling_freq = np.zeros((tokens_num,))\n", "\n", " i = 0\n", " for w in range(tokens_num):\n", " w = self.rev_tokens[i]\n", " if w in self.tok_freq:\n", " freq = 1.0 * self.tok_freq[w]\n", " freq = freq**0.75\n", " else:\n", " freq = 0.0\n", " sampling_freq[i] = freq\n", " i += 1\n", "\n", " sampling_freq /= np.sum(sampling_freq)\n", " sampling_freq = np.cumsum(sampling_freq) * self.table_size\n", "\n", " self.sample_table = np.zeros((int(self.table_size),))\n", "\n", " j = 0\n", " for i in range(int(self.table_size)):\n", " while i > sampling_freq[j]:\n", " j += 1\n", " self.sample_table[i] = j\n", "\n", " return self.sample_table\n", "\n", " def get_random_train_sentence(self):\n", " split = self.dataset_split()\n", " sent_id = random.choice(split[0])\n", " return self.all_sentences[sent_id]\n", "\n", " def get_split_sentences(self, split=0):\n", " split = self.dataset_split()\n", " sentences = [self.all_sentences[i] for i in split[split]]\n", " return sentences\n", "\n", " def get_train_sentences(self):\n", " return self.get_split_sentences(0)\n", "\n", " def get_test_sentences(self):\n", " return self.get_split_sentences(1)\n", "\n", " def get_val_sentences(self):\n", " return self.get_split_sentences(2)\n", "\n", " def sampleTokenIdx(self):\n", " return self.sample_table[random.randint(0, self.table_size - 1)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = StanfordSentiment() # takes about 45sec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens = dataset.tokens\n", "num_words = len(tokens)\n", "print(num_words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model\n", "\n", "As I mentioned in the motivation, we are trying to create a model that gives us a vector for each word that represents the meaning of that word. How could we do this?\n", "\n", "It may not seem like it but the problem is of the shape: $y = Ax$, where $x$ is our inputs, in our case the context words, $y$ is our expected outputs, in our case the center word. Like I said, we are going to have our model predict a center word from the context words and then compare that prediction to the actual center word. Based on how close we were, we can then adjust the model so that it does a better job on another training example. This leaves two major questions:\n", "\n", "1. How can we measure similarity between words mathematically?\n", "2. How can we \"adjust\" our model? What does that even mean?\n", "\n", "### Stochastic Gradient Descent\n", "Luckily, there is a single process which will answer both of this questions.\n", "\n", "> Before we discuss this topic, realize that no one just woke up one day and \"discovered\" this process. It took decades of mathematical and computational experimentation to develop. To that end, I do not expect you to *just* understand it, instead I want you to compile questions that you have. Pay close attention to what are you confused about and where you stop understanding.\n", "\n", "It is called Stochastic Gradient Descent or SGD and it will allow us both to create a quantitative similarity metric and to \"learn\" from what it tells us. As I posited above, this problem can be simiplied to the following: $y = Ax$, where $x$ is our inputs, in our case the context words, $y$ is our expected outputs, in our case the center word. In that case, what is $A$? $A$ will be matrix of \"weights\" which when multiplied by our $x$s will product our $y$s.\n", "\n", "We don't know need to figure out $A$. It will start out as completely random and then we will learn its value through training. Importantly, each row in the $A$ matrix is a single vector representing a word, so it will be as long as our entire vocabulary. So for each word in our text, we will have a initially random vector, whose size we will called `embedding_dim` or embedding dimension.\n", "\n", "Now that each word has a (random) vector associated with it, we can directly compare them. Using the **the scaled dot product**, we can determine how similar two vectors are. The scaled dot product ($x \\cdot y$) between two vectors will produce a number between -1 and 1, representing how similar or dissimilar a vector is from any other. This answers our first question above.\n", "\n", "Before we attack the second question, let's first take a look at what we just described looks like in code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vector_dim = 10\n", "word_vecs = np.concatenate(\n", " (\n", " (np.random.rand(num_words, vector_dim) - 0.5) / vector_dim,\n", " np.zeros(\n", " (num_words, vector_dim)\n", " ), # for simplicity's sake, we will have a separate set of vectors for each context word as well as for each center word\n", " ),\n", " axis=0,\n", ")\n", "\n", "word_vecs.shape # 2*num_words (one for the context vector and another for the center vector) x vector_dim\n", "# initially random vectors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# getting center word vecs and context word vecs\n", "# each word will have two word vectors: center and context, we will only care about the center word vectors\n", "center_word_vecs = word_vecs[:num_words, :]\n", "outside_word_vecs = word_vecs[num_words:, :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "block_size = 5\n", "center_word, context = dataset.get_random_context(block_size)\n", "center_word, context # get a center word and context" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# find index of center word\n", "center_word_idx = dataset.tokens[center_word]\n", "center_word_idx" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# getting the random word vec for this index/word\n", "center_word_vec = center_word_vecs[center_word_idx]\n", "center_word_vec # still random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we are able to get the center word vector, we can get the vectors for the outside words in the same way. For each one, we are going to take the similarity (dot product) between it and and the center word vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# example with just one outside word\n", "outside_word_idx = dataset.tokens[context[1]]\n", "outside_word_idx" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "outside_word_vec = outside_word_vecs[outside_word_idx]\n", "outside_word_vec # start as zeros" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dot_products = np.dot(\n", " outside_word_vecs, center_word_vec\n", ") # take the dot product between all outside words and the center word\n", "dot_products.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's see what this dot product produces\n", "import matplotlib.pyplot as plt\n", "\n", "plt.plot(dot_products)\n", "plt.show() # it's all zeros because all of the outside word vectors are zero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How can we take these numbers and get a prediction for word? Remember what we want to do: compare a predicted word to the actually correct center word. But right now all we have an array of zeros. How can we turn this into a prediction?\n", "\n", "We will be using something called the **softmax** function, which is defined as: $\\sigma(x_i) = \\dfrac{e^{x_i}}{\\sum_{j=1}^{K}e^{x_j}}$. This might look really scary, but don't worry. All this function does is turn a set of numbers into a probability distribution. See below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def softmax(x):\n", " orig_shape = x.shape\n", "\n", " if len(x.shape) > 1:\n", " # Matrix\n", " tmp = np.max(x, axis=1)\n", " x -= tmp.reshape((x.shape[0], 1))\n", " x = np.exp(x)\n", " tmp = np.sum(x, axis=1)\n", " x /= tmp.reshape((x.shape[0], 1))\n", " else:\n", " # Vector\n", " tmp = np.max(x)\n", " x -= tmp\n", " x = np.exp(x)\n", " tmp = np.sum(x)\n", " x /= tmp\n", "\n", " assert x.shape == orig_shape\n", " return x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "softmax_probs = softmax(dot_products)\n", "softmax_probs.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "softmax_probs[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's see what this looks like\n", "plt.hist(softmax_probs)\n", "plt.show() # probabilities are even, at zero, as you might expect" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, all of the words have the same probability: zero or almost zero (~5e-5), as softmax cannot output a value of zero. As a result, we can pick any word and compare it to our center word. To do so, we introduce a value called *loss* which represents how close we are to the true word for a given training example. There are many ways to calculate a loss, but because we are picking individual words, we are going to use **negative log likelihood**: $Loss = -y_{o,c}\\ln(p_{o,c})$.\n", "\n", "This also probably looks scary, but all it says is that we take the predicted word and the correct word and from their dot product can calculate a single number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loss = -np.log(softmax_probs[outside_word_idx]) # nll in code\n", "loss\n", "# this number represents how good our prediction is\n", "# zero is the lowest number that we can predict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stochastic Gradient Descent continued\n", "\n", "We have completed one half of our model! What we just did was called the **forward pass**. We are now going to investigate the **backwards pass**.\n", "\n", "It is called a backward pass because we are going to go backwards through all of the steps in the forward pass to figure out what we need to do to make the prediction better. This process is also called *back propagation* or *backprop*.\n", "\n", "So we need to figure out what elements of our $A$ matrix (our word vectors) we need to change, and how to change them, so that we **minimize our loss**. This is an *optimization* problem, meaning we need to determine the *most optimal* values for each cell of $A$ such that the loss between the predicted $y$s and the actual $y$s is the lowest we can make it. Thus, when our loss has reached the lowest it will go, we will have trained our word vectors so that they actually represent the meaning of their respective words. We will be able to show this by the end.\n", "\n", "But how do we optimize? Well, it involves some calculus. We want to see how much we need to change each and every element of $A$, so we use a *derivative*, which will tell us how far away we are from reaching the lowest point in our loss function.\n", "\n", "Thankfully, the derivatives we will need to calculate are fairly simple. All we have done is some multiplication and addition, which are very easy to take the derivative of. Don't worry about this though, the derivatives will be provided below.\n", "\n", "Once we have the derivative, we can take a small step in that direction by multiplying it by a small number (called a step size or learning rate) and subtract it from our values of $A$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's see an example in code\n", "loss # need to differential with respect to all of the values of A" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_grad_center_vec = -outside_word_vecs[outside_word_idx] + np.dot(\n", " softmax_probs, outside_word_vecs\n", ") # derivative of dot product for the center word vec\n", "current_grad_center_vec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_grad_outside_vecs = np.outer(\n", " softmax_probs, center_word_vec\n", ") # derivative of dot product for the outer word vecs\n", "current_grad_outside_vecs[outside_word_idx] -= center_word_vec\n", "current_grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grad_center_vecs = np.zeros(center_word_vecs.shape) # holder for our derivative values\n", "grad_outside_vecs = np.zeros(\n", " outside_word_vecs.shape\n", ") # holder for our derivative values\n", "\n", "grad_center_vecs[center_word_idx] += current_grad_center_vec\n", "grad_outside_vecs += current_grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now that we've calculated our derivatives we can take a step\n", "step = 1\n", "center_word_vecs -= step * grad_center_vecs\n", "outside_word_vecs -= step * grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and then run another forward pass\n", "dot_products = np.dot(outside_word_vecs, center_word_vec)\n", "dot_products # longer zero!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "softmax_probs = softmax(dot_products)\n", "softmax_probs # a bit more variation!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loss = -np.log(softmax_probs[outside_word_idx])\n", "loss # slight lower!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Negative Sampling\n", "\n", "We just saw how using softmax and ggradient descent can reduce our loss, meaning that we can learn word meaning and represent that meaning with vectors! That's great, but it takes too long. Softmax is a very expensive operation. We'll use it in later lessons, but here, we're going to use a similar technique, but a different activation function: **sigmoid**. Additionally, instead of applying softmax to every context word vector. We are only going to sample a small subset and estimate the loss based on that sample. This process is called *negative sampling*.\n", "\n", "Sigmoid is defined as follows: $\\frac{1}{1+e^{-x}}$\n", "\n", "Like softmax, sigmoid, transforms a set of numbers into probability distribution by not allowing any numbers greater than 1 or less than 0.\n", "\n", "Let's run through an example of negative sampling in code now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# new example copying from above\n", "vector_dim = 10\n", "word_vecs = np.concatenate(\n", " (\n", " (np.random.rand(num_words, vector_dim) - 0.5) / vector_dim,\n", " np.zeros(\n", " (num_words, vector_dim)\n", " ), # for simplicity's sake, we will have a separate set of vectors for each context word as well as for each center word\n", " ),\n", " axis=0,\n", ")\n", "\n", "block_size = 5\n", "center_word, context = dataset.get_random_context(block_size)\n", "\n", "center_word_idx = dataset.tokens[center_word]\n", "center_word_vec = word_vecs[center_word_idx]\n", "\n", "outside_word_idxs = [dataset.tokens[w] for w in context]\n", "\n", "center_word_vecs = word_vecs[:num_words, :]\n", "outside_word_vecs = word_vecs[num_words:, :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first we need a data structure from which we can easily sample from\n", "# we can use a table of values\n", "table_size = 1e8\n", "\n", "\n", "def sampleTable():\n", " tokens_num = len(tokens)\n", " sampling_freq = np.zeros((tokens_num,))\n", "\n", " i = 0\n", " for w in range(tokens_num):\n", " w = rev_tokens[i]\n", " if w in tok_freq:\n", " freq = 1.0 * tok_freq[w]\n", " freq = freq**0.75\n", " else:\n", " freq = 0.0\n", " sampling_freq[i] = freq\n", " i += 1\n", "\n", " sampling_freq /= np.sum(sampling_freq)\n", " sampling_freq = np.cumsum(sampling_freq) * table_size\n", "\n", " sample_table = np.zeros((int(table_size),))\n", "\n", " j = 0\n", " for i in range(int(table_size)):\n", " while i > sampling_freq[j]:\n", " j += 1\n", " sample_table[i] = j\n", "\n", " return sample_table\n", "\n", "\n", "sample_table = sampleTable()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_table[random.randint(0, table_size - 1)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "negSampleWordIndices = [None] * 5\n", "for k in range(5):\n", " newidx = sample_table[random.randint(0, table_size - 1)]\n", " print(newidx)\n", " negSampleWordIndices[k] = newidx\n", "[int(n) for n in negSampleWordIndices]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# function version\n", "def get_negative_samples(outsideWordIdx, dataset, K):\n", " negSampleWordIndices = [None] * K\n", " for k in range(K):\n", " newidx = dataset.sampleTokenIdx()\n", " while newidx == outsideWordIdx:\n", " newidx = dataset.sampleTokenIdx()\n", " negSampleWordIndices[k] = newidx\n", " return [int(n) for n in negSampleWordIndices]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "outside_word_idx = outside_word_idxs[0]\n", "\n", "neg_samples = get_negative_samples(outside_word_idx, dataset, 5)\n", "neg_samples # neg samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grad_center_vec = np.zeros(center_word_vec.shape)\n", "grad_outside_vecs = np.zeros(outside_word_vecs.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "u_0 = outside_word_vecs[outside_word_idx]\n", "u_0 # vector for the true context word" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z_0 = np.dot(u_0, center_word_vec)\n", "z_0 # dot product as before" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))\n", "\n", "\n", "p_0 = sigmoid(z_0)\n", "p_0 # new sigmoid transformation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loss = -np.log(p_0) # loss for just this part\n", "loss" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# derivatives for this part\n", "grad_center_vec += (p_0 - 1) * u_0\n", "grad_outside_vecs[outside_word_idx] += (p_0 - 1) * center_word_vec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k in neg_samples: # loop through neg sample idxs\n", " u_k = outside_word_vecs[k] # find the correct context vector for a sample idx\n", " z_k = np.dot(u_k, center_word_vec) # take the dot product as above\n", " p_k = sigmoid(-z_k) # activate using sigmoid\n", " loss -= np.log(p_k) # calculate the loss\n", "\n", " # derivatives for this negative sample\n", " grad_center_vec -= (p_k - 1) * u_k\n", " grad_outside_vecs[k] -= (p_k - 1) * center_word_vec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loss # new loss about neg sampling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# example backwards pass\n", "step = 1\n", "center_word_vecs -= step * grad_center_vecs\n", "outside_word_vecs -= step * grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# checking if our grad descent worked like last time\n", "loss = 0.0\n", "u_0 = outside_word_vecs[outside_word_idx]\n", "z_0 = np.dot(u_0, center_word_vec)\n", "p_0 = sigmoid(z_0)\n", "loss = -np.log(p_0)\n", "\n", "for k in neg_samples:\n", " u_k = outside_word_vecs[k]\n", " z_k = np.dot(u_k, center_word_vec)\n", " p_k = sigmoid(-z_k)\n", " loss -= np.log(p_k)\n", "\n", "loss # went down slightly!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running the model\n", "\n", "We have covered *A LOT* this lesson, so I have assembled the functions that we will need to train the model below. You have seen all of the code in them, though the presentation/order might be a little weird." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def w2v_wrapper(model, w2i, word_vecs, dataset, block_size, loss_and_grad):\n", " batch_size = 50\n", " loss = 0.0\n", " grad = np.zeros(word_vecs.shape)\n", " N = word_vecs.shape[0]\n", " center_word_vecs = word_vecs[: int(N / 2), :]\n", " outside_word_vecs = word_vecs[int(N / 2) :, :]\n", " for i in range(batch_size):\n", " block_size1 = random.randint(1, block_size)\n", " center_word, context = dataset.get_random_context(block_size1)\n", "\n", " c, grad_in, grad_out = model(\n", " center_word,\n", " block_size1,\n", " context,\n", " w2i,\n", " center_word_vecs,\n", " outside_word_vecs,\n", " dataset,\n", " loss_and_grad,\n", " )\n", " loss += c / batch_size\n", " grad[: int(N / 2), :] += grad_in / batch_size\n", " grad[int(N / 2) :, :] += grad_out / batch_size\n", "\n", " return loss, grad" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def softmax(x):\n", " \"\"\"Compute the softmax function for each row of the input x.\n", " It is crucial that this function is optimized for speed because\n", " it will be used frequently in later code.\n", "\n", " Arguments:\n", " x -- A D dimensional vector or N x D dimensional numpy matrix.\n", " Return:\n", " x -- You are allowed to modify x in-place\n", " \"\"\"\n", " orig_shape = x.shape\n", "\n", " if len(x.shape) > 1:\n", " # Matrix\n", " tmp = np.max(x, axis=1)\n", " x -= tmp.reshape((x.shape[0], 1))\n", " x = np.exp(x)\n", " tmp = np.sum(x, axis=1)\n", " x /= tmp.reshape((x.shape[0], 1))\n", " else:\n", " # Vector\n", " tmp = np.max(x)\n", " x -= tmp\n", " x = np.exp(x)\n", " tmp = np.sum(x)\n", " x /= tmp\n", "\n", " assert x.shape == orig_shape\n", " return x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# normal softmax\n", "def softmaxloss_gradient(center_word_vec, outside_word_idx, outside_word_vecs, dataset):\n", " dot_products = np.dot(outside_word_vecs, center_word_vec)\n", " softmax_probs = softmax(dot_products)\n", " loss = -np.log(softmax_probs[outside_word_idx])\n", "\n", " grad_center_vec = -outside_word_vecs[outside_word_idx] + np.dot(\n", " softmax_probs, outside_word_vecs\n", " )\n", " grad_outside_vecs = np.outer(softmax_probs, center_word_vec)\n", " grad_outside_vecs[outside_word_idx] -= center_word_vec\n", "\n", " return loss, grad_center_vec, grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_negative_samples(outsideWordIdx, dataset, K):\n", " \"\"\"Samples K indexes which are not the outsideWordIdx\"\"\"\n", "\n", " negSampleWordIndices = [None] * K\n", " for k in range(K):\n", " newidx = dataset.sampleTokenIdx()\n", " while newidx == outsideWordIdx:\n", " newidx = dataset.sampleTokenIdx()\n", " negSampleWordIndices[k] = newidx\n", " return [int(n) for n in negSampleWordIndices]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# negative sampling\n", "def negative_samplingloss_gradient(\n", " center_word_vec, outside_word_idx, outside_word_vecs, dataset, K=10\n", "):\n", " neg_samples = get_negative_samples(outside_word_idx, dataset, K)\n", "\n", " grad_center_vec = np.zeros(center_word_vec.shape)\n", " grad_outside_vecs = np.zeros(outside_word_vecs.shape)\n", "\n", " u_0 = outside_word_vecs[outside_word_idx]\n", " z_0 = np.dot(u_0, center_word_vec)\n", " p_0 = sigmoid(z_0)\n", " loss = -np.log(p_0)\n", "\n", " grad_center_vec += (p_0 - 1) * u_0\n", " grad_outside_vecs[outside_word_idx] += (p_0 - 1) * center_word_vec\n", "\n", " for k in neg_samples:\n", " u_k = outside_word_vecs[k]\n", " z_k = np.dot(u_k, center_word_vec)\n", " p_k = sigmoid(-z_k)\n", " loss -= np.log(p_k)\n", "\n", " grad_center_vec -= (p_k - 1) * u_k\n", " grad_outside_vecs[k] -= (p_k - 1) * center_word_vec\n", "\n", " return loss, grad_center_vec, grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def skipgram(\n", " current_center_word,\n", " block_size1,\n", " outside_words,\n", " w2i,\n", " center_word_vecs,\n", " outside_word_vecs,\n", " dataset,\n", " loss_and_grad,\n", "):\n", " loss = 0.0\n", " grad_center_vecs = np.zeros(center_word_vecs.shape)\n", " grad_outside_vecs = np.zeros(outside_word_vecs.shape)\n", "\n", " center_word_idx = w2i[current_center_word]\n", " center_word_vec = center_word_vecs[center_word_idx]\n", "\n", " for outside_word in outside_words:\n", " outside_word_idx = w2i[outside_word]\n", " current_loss, current_grad_center_vec, current_grad_outside_vecs = (\n", " loss_and_grad(center_word_vec, outside_word_idx, outside_word_vecs, dataset)\n", " )\n", " loss += current_loss\n", " grad_center_vecs[center_word_idx] += current_grad_center_vec\n", " grad_outside_vecs += current_grad_outside_vecs\n", "\n", " return loss, grad_center_vecs, grad_outside_vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "import glob\n", "import random\n", "import numpy as np\n", "import os.path as op\n", "\n", "SAVE_PARAMS_EVERY = 2000\n", "\n", "\n", "def load_saved_params():\n", " \"\"\"\n", " A helper function that loads previously saved parameters and resets\n", " iteration start.\n", " \"\"\"\n", " st = 0\n", " for f in glob.glob(\"saved_params_*.npy\"):\n", " iter = int(op.splitext(op.basename(f))[0].split(\"_\")[2])\n", " if iter > st:\n", " st = iter\n", "\n", " if st > 0:\n", " params_file = \"saved_params_%d.npy\" % st\n", " state_file = \"saved_state_%d.pickle\" % st\n", " params = np.load(params_file)\n", " with open(state_file, \"rb\") as f:\n", " state = pickle.load(f)\n", " return st, params, state\n", " else:\n", " return st, None, None\n", "\n", "\n", "def save_params(iter, params):\n", " params_file = \"saved_params_%d.npy\" % iter\n", " np.save(params_file, params)\n", " with open(\"saved_state_%d.pickle\" % iter, \"wb\") as f:\n", " pickle.dump(random.getstate(), f)\n", "\n", "\n", "losses = []\n", "\n", "\n", "def sgd(f, x0, step, iterations, use_saved=False, PRINT_EVERY=10):\n", " ANNEAL_EVERY = 5000\n", " if use_saved:\n", " start_iter, oldx, state = load_saved_params()\n", " if start_iter > 0:\n", " x0 = oldx\n", " step = 0.0\n", " if state:\n", " random.setstate(state)\n", " else:\n", " start_iter = 0\n", "\n", " x = x0\n", " exploss = None\n", "\n", " for iter in range(start_iter + 1, iterations + 1):\n", " loss = None\n", " loss, gradient = f(x)\n", " x -= step * gradient\n", "\n", " if exploss is None:\n", " exploss = loss\n", " else:\n", " exploss = 0.95 * exploss + 0.05 * loss\n", "\n", " if iter % PRINT_EVERY == 0:\n", " if not exploss:\n", " exploss = loss\n", " else:\n", " exploss = 0.95 * exploss + 0.05 * loss\n", " print(\"iter %d: %f\" % (iter, exploss))\n", " losses.append(exploss)\n", "\n", " if iter % SAVE_PARAMS_EVERY == 0 and use_saved:\n", " save_params(iter, x)\n", "\n", " if iter % ANNEAL_EVERY == 0:\n", " step *= 0.5\n", "\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training loop\n", "\n", "Now that we have all of these functions and utilities we can finally put everything together and train a word2vec model. Training with these parameters will take about 20 minutes, so plan accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "import matplotlib.pyplot as plt\n", "\n", "random.seed(314)\n", "dataset = StanfordSentiment()\n", "tokens = dataset.tokens\n", "num_words = len(tokens)\n", "\n", "vector_dim = 10\n", "C = 5\n", "\n", "random.seed(31415)\n", "np.random.seed(9265)\n", "\n", "start_time = time.time()\n", "word_vecs = np.concatenate(\n", " (\n", " (np.random.rand(num_words, vector_dim) - 0.5) / vector_dim,\n", " np.zeros((num_words, vector_dim)),\n", " ),\n", " axis=0,\n", ")\n", "word_vecs = sgd(\n", " lambda vec: w2v_wrapper(\n", " skipgram, tokens, vec, dataset, C, negative_samplingloss_gradient\n", " ),\n", " word_vecs,\n", " step=1, # was .01\n", " iterations=10000,\n", " use_saved=True,\n", " PRINT_EVERY=10,\n", ")\n", "\n", "print(\"training took %d seconds\" % (time.time() - start_time))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation\n", "\n", "Now that our model is trained let's make sure that our word vectors make sense and reflect underlying word meaning. these types of evaluations are difficult because word meaning is inherently qualitative, not quantitative. But we can still make some interpretations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from logging import makeLogRecord\n", "\n", "plt.plot(losses, label=\"Loss\")\n", "plt.xlabel(\"Iterations\")\n", "plt.ylabel(\"Loss\")\n", "plt.title(\"Loss vs. Iterations\")\n", "plt.legend()\n", "plt.show()\n", "makeLogRecord" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trained_word_vectors = np.concatenate(\n", " (word_vecs[:num_words, :], word_vecs[num_words:, :]), axis=0\n", ") # put all of center word vecs together" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize_words = [\n", "# \"great\", \"cool\", \"brilliant\", \"wonderful\", \"well\", \"amazing\",\n", "# \"worth\", \"sweet\", \"enjoyable\", \"boring\", \"bad\", \"dumb\",\n", "# \"annoying\", \"female\", \"male\", \"queen\", \"king\", \"man\", \"woman\", \"rain\", \"snow\",\n", "# \"hail\", \"coffee\", \"tea\"]\n", "\n", "visualize_words = [\"Paris\", \"London\", \"England\", \"France\"] # analogies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "visualize_idx = [tokens[word.lower()] for word in visualize_words]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# dimension reduction for visualization\n", "visualize_vecs = trained_word_vectors[visualize_idx, :]\n", "temp = visualize_vecs - np.mean(visualize_vecs, axis=0)\n", "covariance = 1.0 / len(visualize_idx) * temp.T.dot(temp)\n", "U, S, V = np.linalg.svd(covariance)\n", "coord = temp.dot(U[:, 0:2])\n", "full_piece = temp.dot(U[:, 0:2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "for i in range(len(visualize_words)):\n", " plt.text(\n", " coord[i, 0],\n", " coord[i, 1],\n", " visualize_words[i],\n", " bbox=dict(facecolor=\"green\", alpha=0.1),\n", " )\n", "\n", "plt.xlim((np.min(coord[:, 0]), np.max(coord[:, 0])))\n", "plt.ylim((np.min(coord[:, 1]), np.max(coord[:, 1])))\n", "plt.show()\n", "makeLogRecord" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conluding remarks\n", "\n", "Implementing anything from scratch as we did in the notebook is **not** easy. External libraries can be incredibly helpful, but challenging ourselves to not use them can give us a lot of insight into the inner workings of these libraries." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "mystnb": { "execution_mode": "off" } }, "nbformat": 4, "nbformat_minor": 0 }