!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv -q

Textual Feature Extraction using Traditional Machine Learning#

In this workshop, we are going to learn how to conduct feature extraction on text using sci-kit learn.

First, what is feature extraction? Feature extraction or vectorization is any process by which we can convert raw data into a format that can be understood by a computer. Text is not consistent in format, meaning that, unlike an image, there is no single format that all documents come in. Usually, each sentence is a different length and could be made up of specialized terminology. As a result, feature extraction allows us to standardize textual inputs so that they can be used for a variety of tasks. For example, in the Traditional Topic Modeling in SKLearn workshop, I use these features to extract topics from a text.

In this notebook, we’ll look at two different algorithms that are commonly used for textual feature extraction:

Count Vectorization
Term frequency - inverse document frequency (TF-IDF) vectorization

But before we can look at these, we’ll look at the simplest type of feature extraction: the so-called ‘bag of words’ approach.

Imports and data#

For this example, we’ll be using Edward Gibbon’s Decline and Fall of the Roman Empire. This text is really long so it will be a good example of a lot of the problems that pop up in NLP.

import pandas as pd

daf = pd.read_csv("daf.csv")[["title", "text"]]
daf.head()

text = " ".join(daf.text).lower()
text[:1001]

Bag of Words#

The Bag of Words approach consists simply in counting individual word occurrence. But walking through this process will teach us about NLP fundamentals that carry over into the more advanced methods.

Tokenization#

Before we can count words, we need to decide what constitutes a word and then split our text up by word. This process is called tokenization and is very important at all levels of NLP. There are many different types of tokenizers, but for this example we’ll use the simplest one from the Natural Language Toolkit (nltk).

import nltk

nltk.download("punkt")

from nltk.tokenize import word_tokenize

tokens = word_tokenize(daf.iloc[1].text)
tokens[:10]

words = [word for word in word_tokenize(text) if word.isalpha()]
# word.isalpha returns True is a word is made up of only alphabet characters and not numbers or spaces
words[:10]

Counting words#

Now that we’ve split the text into words, we can count them. I’ll show three ways to do it, but they will all return the same result. I only do so to show you how easy it is to do this task in Python.

Each one follows the same algorithm:

Create an empty data structure to hold the frequencies
Loop through all words
If we see a word that is in in our data structure already, increment the associated frequency by 1
If we see a word that is NOT in our data structure already, add the word and set its value to 1

# longest version
counts = {}  # data structure
for word in words:  # looping through all words
    if word in counts:  # if the word is already in our data structure
        counts[word] += 1  # increment by 1
    else:
        counts[word] = 1  # set as 1

counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)  # dictionary sorting
counts[:10]

# shorter version
import collections

counts = collections.defaultdict(int)  # data structure
for word in words:  # looping through all words
    counts[word] += 1  # increment by 1

counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
counts[:10]

# shortest version
counts = collections.Counter(words)
counts.most_common(10)

Removing stop words#

As you see above, the most common words in this text are not very expressive of the content of the text. Rather they are just the most common words in the English language. These words are sometimes called ‘stop words’ and, when using word frequency, it is common to remove them.

For this example, we’ll use nltk’s stop words list. See below for the whole list.

from nltk.corpus import stopwords

nltk.download("stopwords")
print(stopwords.words("english"))

# an example
[
    word
    for word in word_tokenize(daf.iloc[0].text.lower())
    if word.isalpha() and word not in stopwords.words("english")
][:10]

Counting words without stop words#

import collections

counts = collections.defaultdict(int)  # data structure

stop_words = stopwords.words("english")  # list of stop words
for word in words:  # looping through all words
    if word not in stop_words:  # if the word is not in stop words list
        counts[word] += 1  # increment by 1

counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
counts[:10]

Count Vectorization#

The bag of words model of feature extraction is straightforward and easy to understand but it is limited. We would prefer a single class that could do all of the above steps in a efficient and fast way. Thankfully, sci-kit learn provides this in the CounterVectorizer.

CountVectorizer will take each document in our corpus (each chapter, before we were just joining them all together) and convert it into an array of ones and zeros. See the example below.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()  # initialize the class

# taken from: https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage
corpus = [
    "This is the first document.",
    "This is the second second document.",
    "And the third one.",
    "Is this the first document?",
]

X = vectorizer.fit_transform(corpus)
X.shape  # number of documents by number of unique words

pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out()
)  # document-term matrix or dtm

As we see above, each unique word in the corpus is made into a column. Within each cell, then, the count of how many times that word occurs in that document is recorded. This makes it so that each sentence now has a consistent shape and it’s one that’s a bit more verbose than just a single number, while at the same time representing the same information as the bag of words model. See below how we can apply this to the large text.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(daf.text)
X.shape  # number of documents by number of unique words

# most of our entries are 0
# we call this a 'sparse' matrix
cv = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
cv

cv["empire"].to_numpy()

# we can recreate the same results as above but now on the basis of each document
# results vary slightly because of how sklearn tokenizes their text
sorted(cv.sum(axis=0).to_dict().items(), key=lambda x: x[1], reverse=True)[:10]

Term frequency - inverse document frequency (TF-IDF)#

A final alteration we can make to this counter is normalizing the counts by the number of times that word occurs. Normalization is any mathematical transformation we make to a number that will make it more normal or regular. In this case, someone could argue: “‘might’ occurs so many times not because it’s important but rather because it is a word that has several meanings and is therefore used more often than words with only one meaning.”

That’s where TF-IDF comes in. This normalization technique is the product of two statistics: term frequency and inverse document frequency.

The first is a count of how many times a term occurs in a document (which is what we have in the CounterVectorizer above) divided by the total number of words in the document. Given a term \(t\) and a document \(d\), term frequency \(\mathrm{tf}(t,d)\) is defined as: \(\mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}\).
The second responds directly to the objection above. It is a measure of the importance of the term in the context of the rest of the documents, determining whether this term is rare or common. It is obtained by dividing the total number of documents \(N\) by the number of documents containing the term \(|\{d \in D: t \in d\}|\), where \(d\) is a document in our documents \(D\) and \(t\) is any term in document \(d\). This inverse document frequency is defined as \(\mathrm{idf}(t, D) = \frac{N}{|\{d \in D: t \in d\}|}\).

Thus TF-IDF takes the form: \(\mathrm{tfidf} = \mathrm {tf} \cdot \mathrm {idf}\). In practice, log normalization (taking the natural log) is applied to both parts of TF-IDF. Below we’ll see how to use sci-kit learn to implement TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(daf.text)  # this is called the document-term matrix
X.shape  # number of documents by number of unique words

# now all of the zeros are decimals
tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
tfidf

# a single column
# this single column represents the tf-idf score for the word 'Rome' in each chapter
tfidf["rome"]

What can you do with TF-IDF?#

There are many applications for TF-IDF (be sure to check out the Traditional Topic Modeling in SKLearn workshop). Here we’ll do some document comparison. Because we now have a 13035 long vector representing each chapter, we can compare them to each other to find chapters that have similar word usage.

If we have two vectors, we can compare them using the dot product. For the two vectors \(A = \begin{bmatrix} a_1 & a_2 & a_3 \end{bmatrix}\) and \(B = \begin{bmatrix} b_1 & b_2 & b_3 \end{bmatrix}\), \(A \cdot B = (a_1)(b_1) + (a_2)(b_2) + (a_3)(b_3)\). If we then scale this value by the product of the magnitudes of \(A\) and \(B\), we are guaranteed to get back a number that is between -1 and 1. This is often called cosine similarity and is used frequently to compare vectors in NLP. A score of exactly 1 means that the two vectors are the same, whereas a score of -1 means that the vectors have the same magnitude but are pointing in opposite directions. See the implementation below.

import numpy as np

ch1 = X.toarray()[0]  # chapter 1
ch2 = X.toarray()[1]  # chapter 2

# cosine similarity
np.dot(ch1, ch2) / (
    np.linalg.norm(ch1) * np.linalg.norm(ch2)
)  # np.linalg.norm returns the magnitude of a vector

def docsim(id1, id2):
    ch1 = X.toarray()[id1]
    ch2 = X.toarray()[id2]
    return np.dot(ch1, ch2) / (np.linalg.norm(ch1) * np.linalg.norm(ch2))

for i in range(len(daf[:50])):
    print(f"Chapter 1 compared with Chapter {i+1}: {docsim(0, i)}")

import matplotlib.pyplot as plt

# this will take some time (~1 min)
ch1_sims = [
    docsim(0, i) for i in range(1, len(daf[1:]) + 1)
]  # remove the first chapter

# plotting
# similarity goes down over the whole book, as we might expect
plt.plot(ch1_sims)
plt.title("Chapter 1 similarity")
plt.xlabel("Chapter")
plt.ylabel("Similarity")
plt.show()

from sklearn.linear_model import LinearRegression

x = np.array(range(1, len(daf[1:]) + 1))
y = np.array(ch1_sims)
X = x[:, np.newaxis]  # add bias term
reg = LinearRegression().fit(X, ch1_sims)
r = reg.score(X, y)
m = reg.coef_[0]
b = reg.intercept_

# plot with regression line
plt.scatter(x, ch1_sims)
plt.plot(
    m * X + b,
    label=f"$y={round(m,4)}x + {round(b,4)}$ with $r = {round(r,4)}$",
    color="red",
)
plt.title("Chapter 1 similarity")
plt.xlabel("Chapter")
plt.ylabel("Similarity")
plt.legend(fontsize=9)
plt.show()

The above plot shows us that as we read through the text chronologically, we see content that becomes more and more different from the first chapter. Specifically, the word usage throughout the rest of the book different significantly from the word usage of the first chapter. This plot also helps us identitfy places where this trend may not be completely true. For instance, we see a cluster in the 150s and 200s of chapters that have a higher similarity score than the chapters around them. We could select this points and investigate further.

All of this would not have been possible without vectorizing our text. As a result, feature extraction is usually the first step in any NLP model, like we’ve made above. There are more advanced ways of doing it, but TF-IDF has stood the test of time and is still used today.