!wget https://www.gutenberg.org/files/25717/25717-0.txt

Regular Expressions#

In this workshop, we’ll go over how to use regular expressions (AKA regex) for simple text processing. We’ll practice these skills in Python but what you learn here can be applied to any regular expression system in any programming language.

What are regular expressions#

A regular expression is a string of characters that represent a pattern. The goal is the extract any substrings that matches a given pattern from a larger text.

You can think of regular expressions as a mini-programming language. Just like in Python, you can type anything and call it regular expression, but it won’t necessarily be valid and run. Regular expressions have a unique syntax and structure that must be practiced. That said, it is simpler than most programming languages so the goal of this notebook is to familiarize yourself with regular expressions so you can go on and learn more about them.

The data#

For this example, we’ll be looking at Edward Gibbon’s Decline and Fall of the Roman Empire. This is a useful example because we’ll get a lot of results for general patterns and very few results for very specific patterns. There are also a lot of weird characters, so it simulates a difficult to work with dataset that you might encounter.

First we’ll see the basics of regular expressions and then we’ll apply what we’ve learned to the task of currating this raw text into a usable dataset.

# read in data
with open("25717-0.txt") as f:
    data = f.read()

data[:1000]

The simplest patterns#

Let’s start out with some very simple patterns to get used to how Python implements regular expressions.

import re  # regular expression library from PSL

re

`re` Methods#

# search
re.search("the", data)

# search
re.search("the", data).group(0), re.search("the", data).start(), re.search(
    "the", data
).end()

# findall, similar to ctrl-f
re.findall("the", data)

# findall
len(re.findall("the", data))

# finditer
re.finditer("the", data)

# be careful, can split on 'the' in a word, like 'other'
re.split("the", data)

# sub
re.sub("the", "THE", data)

More complex patterns#

We’ve seen how we can use regular expressions like a the ctrl-f function to find a specific string in a larger string, but we can use the regular expression meta language to get more complex results.

For instance, we can create a complex pattern to extract all of the elements of the table of contents.

# Let's try to get all of the table of contents
re.findall("\n\nChapter ", data)  # \n means new line

chapter_list = re.findall("\nChapter .*\n", data)
chapter_list

Let’s break that down:

\n - new line character
Chapter - arbitrary string to find (including the space)
. - match any character
* - quantifier that looking for the preceding pattern zero or more times, in this case modifying .
.* is a common idion in regex that means match any character until the next part of the pattern, in this case a \n

Let’s try a more complex example: extracting the footnotes. Footnotes can be valuable resources of information, but they can also confuse downstream text analysis.

This is what one from Gibbon lookings like:

60 (return) [ Vegetius finishes his second book, and the description of the legion, with the following emphatic words:—“Universa quæ in quoque belli genere necessaria esse creduntur, secum legio debet ubique portare, ut in quovis loco fixerit castra, armatam faciat civitatem.”]

As you see, they start with a number, followed by the string (return) (this might look weird but it comes from the optical character recognitiion that got the text) and then the footnote in brackets.

# first part
re.findall("\d+\s\(return\)", data)

Breaking it down:

\d - matches any digit
+ - qunatifier that matches the preceding character 1 or more times (very similar to *)
\s - any whitespace (meaning spaces, tabs or new lines)
\( and \) - as we’ll see in more detail later, ( and ) are special characters in regex, so we have to “escape” from them using the \ character
return - a specific string we’re looking for

# second part
re.findall("\d+\s\(return\)\s\[.*\]", data)

We’re getting results, but our example isn’t there. That’s because our pattern doesn’t account for text that is on multiple lines. Let’s break down what we have so far, though:

\d - matches any digit
+ - qunatifier that matches the preceding character 1 or more times (very similar to *)
\s - matches any whitespace (meaning spaces, tabs or new lines)
\( and \) - as we’ll see in more detail later, ( and ) are special characters in regex, so we have to “escape” from them using the \ character
return - a specific string we’re looking for
\s - matches any whitespace
\[ and \] - similar to the parentheses, the \[ and \] are special characters, so we have to “escape” from them using the \ character

Unfortunately, .* does not match the \n character, so we’ll need to be creative.

# the solution
footnotes = re.findall("\d+\s\(return\)\s\[.*?\]", data, re.DOTALL)
footnotes

Above we only added two features:

In the parameters of re.findall, re.DOTALL - This flag will allows .* to match \n
In the pattern, ? - The question mark denotes is the non-greedy match, meaning the pattern .*? will match as few characters as possible

We need both of these additions because while re.DOTALL allows us to match \n, it will match all of the text between the first \[ and the last \]. This leads to one big string that has the entire text from the beginning of the first footnote to the end of the last footnote. Obviously, this is not the behavior we want, so we have to use the ? which will stop the matching after the first \] it sees.

Now we can finally clean up these footnotes to make them more readable, also using regex.

We’ll extract the tet in the brackets
Then we’ll remove any \n and tabs (\t)

clean_fn = []
for footnote in footnotes:
    number = re.search("\d+", footnote).group()
    text = re.search("\[.*?\]", footnote, re.DOTALL).group(0)
    text = re.sub("\n|\s{6}|\[|\]", "", text).strip()
    print(number, text)
    clean_fn.append(text)

I’ve added some new symbols in the second pattern so let’s take a closer look:

| - logical OR in regular expressions. In this case, we want to get rid of a bunch of types of characters, and this operator allows us to search for many different patterns at the same time
{6} - In the footnotes, there were strings of six spaces following a \n. We can;t just remove all spaces because most of the spaces are needed to separate the words. Instead we can ask to remove only six consecutive spaces. The {} are a type of quantifier like * and + but we can specify a number of elements to expect.

Cleaning the text with regular expressions#

Now let’s apply what we’ve learned to a specific use: cleaning this text so that it can be analyzed. We want to:

Remove any text added at the beginning or end by Project Gutenberg (licenses, disclaimers, etc…)
Remove the table of context or other reference material
Split the text into chapters
Store the accompanying footnotes in a separate data structure

We can do all of this with what we’ve already learned.

# title page
daf_start = re.search("HISTORY OF THE DECLINE.*?\(Revised\)", data, re.DOTALL).end()
daf_end = re.search("\n{5}\*{3} END", data).start()
daf = data[daf_start:daf_end]

chapter_list = re.findall("\nChapter .*\n", data)
chapter_list

re.search(chapter_list[0].strip().split(":")[0], daf)

chapter_pattern = "|".join(set([f"{c.strip().split(':')[0]}:" for c in chapter_list]))
chapter_pattern

chapter_split = re.split(chapter_pattern, daf)
len(chapter_split)

re.search("\n\n      ", chapter_split[1])

chapter_dict = {}
chapter_split = chapter_split[1:]  # removing introduction
for chapter in chapter_split:
    # getting chapter text
    title_match = re.search("\n\n", chapter)
    title_end = title_match.start()
    text_start = title_match.end()
    chapter_title = chapter[:title_end].replace("      ", "").replace("\n", " ").strip()
    chapter_text = chapter[text_start:].replace("      ", "").replace("\n", " ").strip()

    # separating and removing footnotes
    footnotes = re.findall("\d+.*?\s\(return\)\s\[.*?\]", chapter_text, re.DOTALL)
    chapter_text = re.sub("\d+.*?\s\(return\)\s\[.*?\]", "", chapter_text)
    chapter_text = re.sub("\d+\s{4,6}", "", chapter_text)
    clean_fn = []
    for footnote in footnotes:
        number = re.search("\d+", footnote).group()
        text = re.search("\[.*?\]", footnote, re.DOTALL).group(0)
        text = re.sub("\n|\s{4,7}|\[|\]", "", text).strip()
        clean_fn.append((number, text))

    # load into dictionary
    chapter_dict[chapter_title] = (chapter_text, clean_fn)

list(chapter_dict.items())[0]

# a bit of pandas
import pandas as pd

df = (
    pd.DataFrame.from_dict(chapter_dict, orient="index")
    .reset_index()
    .rename(columns={"index": "title", 0: "text", 1: "footnotes"})
)

df

df.to_csv("gibbon_chapters.csv", index=False)

Some example analysis#

Below I’ll show you the types of analysis we can achiever once we have cleaned our data. In this example, I use gensim to generate a word2vec model from the raw text we cleaned. To learn more, check out our forthcoming “Working with gensim” workshop.

gensim allows us to convert the linguistic features of words into numerical objects called vectors. We can then manipulate these word vectors to analyze the text. It uses the context of the words around a given word to calculate these mathematical representations, thus data cleaning is very important. For instance, footnotes would skew and corrupt gensim, so we had to extract them and save them for later use.

!pip install gensim -Uq
import gensim
import nltk

nltk.download("punkt")

from nltk import sent_tokenize

sentences = list(df.text.apply(sent_tokenize).explode())
sentences[:10]

from gensim.utils import tokenize

sentences = [[t for t in tokenize(sentence)] for sentence in sentences]
model = gensim.models.Word2Vec(
    sentences, vector_size=100, window=5, min_count=1, workers=4
)

# this function gives us the words which are most similar to a given word
model.wv.most_similar("Rome")