Using pretrained models for Named Entity Recognition (NER)#

In this notebook, we are going to explore an important subfield of natural language processing, named entity recognition or NER.

By the end of today’s class you’ll be able to:

  • Use a pretrained spaCy model to find named entities, especially for a non-English language

  • Explain why finding named entities is challenging without the use of a pretrained token classification model

  • Employ list comprehensions and advanced dictionaries in Python to parse model output

  • Install spaCy and download associated models in a Colab notebook

What is NER and why does it matter?#

Named entity recognition describes any method which uses computational methods to extract from unstructured text names of people, places or things. It is a hard classification task, meaning that every word in a document is either a type of named entity or it is not. For example in the following sentences:

My name is Peter Nadel. I work at Tufts University.

the token ‘Peter Nadel’ could be tagged as a PERSON tag, where as Tufts Univerisity could be tagged with a PLACE tag. Importantly, in NER, no token can receive more than one tag.

As a result, NER can be using in a wide variety of fields and applications.

How do you do NER?#

Just like many other NLP tasks, there are two main ways of conducting NER:

  1. Rules-based: This approach involves developing a list of rules which can identify a named entity deterministically. For example, if we wanted to identify someone’s name, we would develop a rule like: find two words that are capitalized next to each other. This has the advantage that we will always find the entities we have rules for, but as the disadvantage that we have to make a huge amount of rules for this approach to be effective.

  2. Machine learning: This approach involves collecting and manually annotating many examples of what named entities look like in context. We can then teach a computer what a named entity looks like, allowing it to identify named entities in new texts. This has the advantage that we don’t need to know exactly what a named entity looks like to work, but requires considerable manual annotation to get started.

In this notebook, we will use a machine learning model to conduct NER. This will be a pretrained model, meaning that someone else already spent the time and energy to make it so that it works and we don’t need to worry about that. (However, later in the course we will train an NER model from scratch.)

Preparing for NER#

We’ll be using a package called spaCy to conduct our NER. spaCy has a variety of pretrained models that we can take advantage of. The number of languages that spaCy support is somewhat small, but through this class we’ll see how we can supplement it with other languages. For this example, we’ll use LatinCy, a spaCy module for the Latin language. The model we’ll be using was trained by Patrick Burns, a researcher at NYU’s Institute for the Study of the Ancient World.

Both spaCy and LatinCy do not come with this Colab notebook by default, so we’ll need to install them. We will be using pip, a command line tool for installing Python packages, to do so.

# installations: recall that we use the '!' to indicate that this is a shell command
# this cell will take about 5 min to run
!pip install spacy transformers
!python -m spacy download en_core_web_lg
!pip uninstall spacy_lookups_data
!pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl"

Using spaCy for Named Entity Recognition#

English examples#

Before we turn to LatinCy, let’s take a look at what this task looks like for some simple English texts. Then we can apply the same rationale to using the Latin model with complex Latin texts.

import spacy

english_nlp = spacy.load(
    "en_core_web_lg"
)  # nlp object takes in the model name and give us back a tool we can work with
english_nlp
# example from above
text = """
My name is Peter Nadel. I work at Tufts University.
""".strip()
doc = english_nlp(text)  # call english_nlp with text to get a doc object
type(doc)
# investigate the Doc object
from spacy.tokens.doc import Doc

Doc  # can find same info here: https://spacy.io/api/doc
# break for two to three minutes to think of some questions
# get entities
entities = doc.ents
for i, entity in enumerate(entities):
    print(f"Entity {i+1}: ", entity.text, "| Entity Type: ", entity.label_)
# now let's try a more complex example: the opening of middlemarch by goerge eliot
# Go back and replace this text when you're ready
text = """
It was after the Nycene synod, and under the reign of the pious Irene, that the popes consummated the separation of Rome and Italy, by the translation of the empire to the less orthodox Charlemagne. They were compelled to choose between the rival nations: religion was not the sole motive of their choice; and while they dissembled the failings of their friends, they beheld, with reluctance and suspicion, the Catholic virtues of their foes. The difference of language and manners had perpetuated the enmity of the two capitals; and they were alienated from each other by the hostile opposition of seventy years. In that schism the Romans had tasted of freedom, and the popes of sovereignty: their submission would have exposed them to the revenge of a jealous tyrant; and the revolution of Italy had betrayed the impotence, as well as the tyranny, of the Byzantine court. The Greek emperors had restored the images, but they had not restored the Calabrian estates 85 and the Illyrian diocese, 86 which the Iconociasts had torn away from the successors of St. Peter; and Pope Adrian threatens them with a sentence of excommunication unless they speedily abjure this practical heresy. 87 The Greeks were now orthodox; but their religion might be tainted by the breath of the reigning monarch: the Franks were now contumacious; but a discerning eye might discern their approaching conversion, from the use, to the adoration, of images. The name of Charlemagne was stained by the polemic acrimony of his scribes; but the conqueror himself conformed, with the temper of a statesman, to the various practice of France and Italy. In his four pilgrimages or visits to the Vatican, he embraced the popes in the communion of friendship and piety; knelt before the tomb, and consequently before the image, of the apostle; and joined, without scruple, in all the prayers and processions of the Roman liturgy. Would prudence or gratitude allow the pontiffs to renounce their benefactor? Had they a right to alienate his gift of the Exarchate? Had they power to abolish his government of Rome? The title of patrician was below the merit and greatness of Charlemagne; and it was only by reviving the Western empire that they could pay their obligations or secure their establishment. By this decisive measure they would finally eradicate the claims of the Greeks; from the debasement of a provincial town, the majesty of Rome would be restored: the Latin Christians would be united, under a supreme head, in their ancient metropolis; and the conquerors of the West would receive their crown from the successors of St. Peter. The Roman church would acquire a zealous and respectable advocate; and, under the shadow of the Carlovingian power, the bishop might exercise, with honor and safety, the government of the city. 88
""".strip().replace(
    " \n", " "
)
doc = english_nlp(text)
entities = doc.ents
for i, entity in enumerate(entities):
    print(f"Entity {i+1}: ", entity.text, "| Entity Type: ", entity.label_)

That’s a lot more entities, so let’s start store this data in a data structure. In Introduction to Digital Humanities, you probably saw how to count words in a block of text. Here we’ll do a similar thing but first we’ll count the number of times an entity is mentioned and then we’ll count how many times a entity type is mentioned.

And we’ll actually do both of these in two different ways:

  • defaultdict: a default dictionary is a data structure in Python that functions like a dictionary, but the values are of a certain type.

  • Counter: a dictionary that is designed for counting discrete elements of an list or string.

Additionally, for the Counter, we’ll need to separate the entities list out into a list of entities and a list of their labels. To do so, we’ll use list comprehensions. A list comprehension is a special Python syntax that allows us to put a loop on a single line. See the example below:

# normal for loop
holder = []
for element in elements:
    holder.append(element)
# list comprehension
holder = [element for element in elements]

Importantly, these two blocks of code do the same thing, it’s just that the list comprehension is on a single line. This can help with efficiency (though only for small- to medium-sized lists) and is easier to read.

# method one: defaultdict
from collections import defaultdict

entity_counts = defaultdict(int)
entity_type_counts = defaultdict(int)

# for loop for incrementing
for entity in entities:
    entity_counts[entity.text] += 1
    entity_type_counts[entity.label_] += 1

# top 3 of each
# you may not have seen lambda before, we will discuss later in the course, link for those interested: https://docs.python.org/3/glossary.html#term-lambda
for entity_type, count in sorted(
    entity_type_counts.items(), key=lambda x: x[1], reverse=True
)[:3]:
    print(f"{entity_type}: {count}")
print("-" * 10)
for entity, count in sorted(entity_counts.items(), key=lambda x: x[1], reverse=True)[
    :3
]:
    print(f"{entity}: {count}")
# method two: Counter
from collections import Counter

# we need a two lists for entities and labels
entity_texts = [ent.text for ent in entities]
entity_labels = [ent.label_ for ent in entities]

entity_counts = Counter(entity_texts)
entity_type_counts = Counter(entity_labels)

# top 3 of each
for entity_type, count in entity_type_counts.most_common(3):
    print(f"{entity_type}: {count}")
print("-" * 10)
for entity, count in entity_counts.most_common(3):
    print(f"{entity}: {count}")
# we can now even plot the results
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.barh(list(entity_counts.keys()), list(entity_counts.values()))
plt.xlabel("Count")
plt.ylabel("Entity")
plt.title("Entity Counts")

plt.subplot(1, 2, 2)
plt.barh(list(entity_type_counts.keys()), list(entity_type_counts.values()))
plt.xlabel("Count")
plt.ylabel("Entity Type")
plt.title("Entity Type Counts")
plt.show()

Non-English case: Parsing Latin texts#

import spacy

nlp = spacy.load("la_core_web_lg")  # loading the latin model instead of the english one

Data collection and scraping#

# for this example we'll use Cicero's Letter's to Atticus
# here we download it in XML form and parse it with BeautifulSoup4
# if you don't remember this from the intro class, don't worry we'll revisit this in week 5
!wget https://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.02.0008 -O atticus.xml
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("atticus.xml", "r").read(), features="xml")
soup.find("div2")  # first letter
import re  # need to use regular expressions to do some cleaning, we'll revisit this too

letters = []
for d in soup.find_all("div2"):
    dateline = d.dateline.extract().get_text().strip()
    salute = d.salute.extract().get_text().strip()
    text = re.sub(r"\s+", " ", d.get_text().strip().replace("\n", ""))
    letters.append([dateline, salute, text])

print(letters[0])
# now we can use pandas to store the data for each letter
import pandas as pd

df = pd.DataFrame(letters, columns=["dateline", "salute", "text"])
df.head()
# example parse with one letter
first_letter = df.text.iloc[0]
first_letter_doc = nlp(first_letter)
first_letter_entities = first_letter_doc.ents
for i, entity in enumerate(first_letter_entities):
    print(
        f"Entity {i+1}: ",
        entity.text,
        "| Entity Type: ",
        entity.label_,
        "| Entity Lemma: ",
        entity.lemma_,
    )
# here I also print out the words lemma, the base form of the word for counting purposes
# more on this next week
def get_entity_counts(text):
    doc = nlp(text)
    entities = doc.ents
    entity_texts = [ent.lemma_ for ent in entities]  # counting lemmas not text
    entity_labels = [ent.label_ for ent in entities]
    entity_counts = Counter(entity_texts)
    entity_type_counts = Counter(entity_labels)
    return entity_counts, entity_type_counts


df["entity_counts"] = df.text.apply(get_entity_counts)
df["entity_type_counts"] = df.entity_counts.apply(
    lambda x: x[1]
)  # taking the type counts
df["entity_counts"] = df.entity_counts.apply(lambda x: x[0])  # taking the lemma counts
all_entity_counts = df.entity_counts.sum()
all_type_counts = df.entity_type_counts.sum()
# limiting the plot below to 15 so that there aren't too many
top_15_entities = sorted(all_entity_counts.items(), key=lambda x: x[1], reverse=True)[
    :15
]
top_15_entities = dict(top_15_entities)
top_15_entities
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.barh(list(top_15_entities.keys()), list(top_15_entities.values()))
plt.xlabel("Count")
plt.ylabel("Entity")
plt.title("Entity Counts")

plt.subplot(1, 2, 2)
plt.barh(list(all_type_counts.keys()), list(all_type_counts.values()))
plt.xlabel("Count")
plt.ylabel("Entity Type")
plt.title("Entity Type Counts")

plt.show()

Conclusion#

We’ve seen today how using specialized, pretrained models can help us do tasks like named entity recognition. We also worked on our Python skills in data parsing and plotting. In the next class, we will discuss some of the other features of spaCy models.