{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using pretrained models for Named Entity Recognition (NER)\n", "\n", "In this notebook, we are going to explore an important subfield of natural language processing, named entity recognition or NER.\n", "\n", "By the end of today's class you'll be able to:\n", "* Use a pretrained `spaCy` model to find named entities, especially for a non-English language\n", "* Explain why finding named entities is challenging without the use of a pretrained token classification model\n", "* Employ list comprehensions and advanced dictionaries in Python to parse model output\n", "* Install spaCy and download associated models in a Colab notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is NER and why does it matter?\n", "\n", "Named entity recognition describes any method which uses computational methods to extract from unstructured text names of people, places or things. It is a hard classification task, meaning that every word in a document is either a type of named entity or it is not. For example in the following sentences:\n", "> My name is Peter Nadel. I work at Tufts University.\n", "\n", "the token 'Peter Nadel' could be tagged as a PERSON tag, where as Tufts Univerisity could be tagged with a PLACE tag. Importantly, in NER, no token can receive more than one tag.\n", "\n", "As a result, NER can be using in a wide variety of fields and applications.\n", "\n", "## How do you do NER?\n", "Just like many other NLP tasks, there are two main ways of conducting NER:\n", "1. **Rules-based**: This approach involves developing a list of rules which can identify a named entity deterministically. For example, if we wanted to identify someone's name, we would develop a rule like: find two words that are capitalized next to each other. This has the advantage that we will always find the entities we have rules for, but as the disadvantage that we have to make a huge amount of rules for this approach to be effective. \n", "2. **Machine learning**: This approach involves collecting and manually annotating many examples of what named entities look like in context. We can then teach a computer what a named entity looks like, allowing it to identify named entities in new texts. This has the advantage that we don't need to know exactly what a named entity looks like to work, but requires considerable manual annotation to get started.\n", "\n", "In this notebook, we will use a machine learning *model* to conduct NER. This will be a *pretrained* model, meaning that someone else already spent the time and energy to make it so that it works and we don't need to worry about that. (However, later in the course we will train an NER model from scratch.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing for NER\n", "\n", "We'll be using a package called `spaCy` to conduct our NER. `spaCy` has a variety of pretrained models that we can take advantage of. The number of languages that `spaCy` support is somewhat small, but through this class we'll see how we can supplement it with other languages. For this example, we'll use `LatinCy`, a `spaCy` module for the Latin language. The model we'll be using was trained by [Patrick Burns](https://isaw.nyu.edu/people/staff/patrick-burns), a researcher at NYU's Institute for the Study of the Ancient World.\n", "\n", "Both `spaCy` and `LatinCy` do not come with this Colab notebook by default, so we'll need to install them. We will be using `pip`, a command line tool for installing Python packages, to do so." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# installations: recall that we use the '!' to indicate that this is a shell command\n", "# this cell will take about 5 min to run\n", "!pip install spacy transformers\n", "!python -m spacy download en_core_web_lg\n", "!pip uninstall spacy_lookups_data\n", "!pip install \"la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `spaCy` for Named Entity Recognition\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### English examples\n", "\n", "Before we turn to `LatinCy`, let's take a look at what this task looks like for some simple English texts. Then we can apply the same rationale to using the Latin model with complex Latin texts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "\n", "english_nlp = spacy.load(\n", " \"en_core_web_lg\"\n", ") # nlp object takes in the model name and give us back a tool we can work with\n", "english_nlp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# example from above\n", "text = \"\"\"\n", "My name is Peter Nadel. I work at Tufts University.\n", "\"\"\".strip()\n", "doc = english_nlp(text) # call english_nlp with text to get a doc object\n", "type(doc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# investigate the Doc object\n", "from spacy.tokens.doc import Doc\n", "\n", "Doc # can find same info here: https://spacy.io/api/doc\n", "# break for two to three minutes to think of some questions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get entities\n", "entities = doc.ents\n", "for i, entity in enumerate(entities):\n", " print(f\"Entity {i+1}: \", entity.text, \"| Entity Type: \", entity.label_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now let's try a more complex example: the opening of middlemarch by goerge eliot\n", "# Go back and replace this text when you're ready\n", "text = \"\"\"\n", "It was after the Nycene synod, and under the reign of the pious Irene, that the popes consummated the separation of Rome and Italy, by the translation of the empire to the less orthodox Charlemagne. They were compelled to choose between the rival nations: religion was not the sole motive of their choice; and while they dissembled the failings of their friends, they beheld, with reluctance and suspicion, the Catholic virtues of their foes. The difference of language and manners had perpetuated the enmity of the two capitals; and they were alienated from each other by the hostile opposition of seventy years. In that schism the Romans had tasted of freedom, and the popes of sovereignty: their submission would have exposed them to the revenge of a jealous tyrant; and the revolution of Italy had betrayed the impotence, as well as the tyranny, of the Byzantine court. The Greek emperors had restored the images, but they had not restored the Calabrian estates 85 and the Illyrian diocese, 86 which the Iconociasts had torn away from the successors of St. Peter; and Pope Adrian threatens them with a sentence of excommunication unless they speedily abjure this practical heresy. 87 The Greeks were now orthodox; but their religion might be tainted by the breath of the reigning monarch: the Franks were now contumacious; but a discerning eye might discern their approaching conversion, from the use, to the adoration, of images. The name of Charlemagne was stained by the polemic acrimony of his scribes; but the conqueror himself conformed, with the temper of a statesman, to the various practice of France and Italy. In his four pilgrimages or visits to the Vatican, he embraced the popes in the communion of friendship and piety; knelt before the tomb, and consequently before the image, of the apostle; and joined, without scruple, in all the prayers and processions of the Roman liturgy. Would prudence or gratitude allow the pontiffs to renounce their benefactor? Had they a right to alienate his gift of the Exarchate? Had they power to abolish his government of Rome? The title of patrician was below the merit and greatness of Charlemagne; and it was only by reviving the Western empire that they could pay their obligations or secure their establishment. By this decisive measure they would finally eradicate the claims of the Greeks; from the debasement of a provincial town, the majesty of Rome would be restored: the Latin Christians would be united, under a supreme head, in their ancient metropolis; and the conquerors of the West would receive their crown from the successors of St. Peter. The Roman church would acquire a zealous and respectable advocate; and, under the shadow of the Carlovingian power, the bishop might exercise, with honor and safety, the government of the city. 88\n", "\"\"\".strip().replace(\n", " \" \\n\", \" \"\n", ")\n", "doc = english_nlp(text)\n", "entities = doc.ents\n", "for i, entity in enumerate(entities):\n", " print(f\"Entity {i+1}: \", entity.text, \"| Entity Type: \", entity.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a lot more entities, so let's start store this data in a data structure. In Introduction to Digital Humanities, you probably saw how to count words in a block of text. Here we'll do a similar thing but first we'll count the number of times an entity is mentioned and then we'll count how many times a entity type is mentioned.\n", "\n", "And we'll actually do both of these in two different ways:\n", "* `defaultdict`: a default dictionary is a data structure in Python that functions like a dictionary, but the values are of a certain type.\n", "* `Counter`: a dictionary that is designed for counting discrete elements of an list or string.\n", "\n", "Additionally, for the `Counter`, we'll need to separate the entities list out into a list of entities and a list of their labels. To do so, we'll use list comprehensions. A list comprehension is a special Python syntax that allows us to put a loop on a single line. See the example below:\n", "\n", "```python\n", "# normal for loop\n", "holder = []\n", "for element in elements:\n", " holder.append(element)\n", "```\n", "``` python\n", "# list comprehension\n", "holder = [element for element in elements]\n", "```\n", "Importantly, these two blocks of code do the same thing, it's just that the list comprehension is on a single line. This can help with efficiency (though only for small- to medium-sized lists) and is easier to read." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# method one: defaultdict\n", "from collections import defaultdict\n", "\n", "entity_counts = defaultdict(int)\n", "entity_type_counts = defaultdict(int)\n", "\n", "# for loop for incrementing\n", "for entity in entities:\n", " entity_counts[entity.text] += 1\n", " entity_type_counts[entity.label_] += 1\n", "\n", "# top 3 of each\n", "# you may not have seen lambda before, we will discuss later in the course, link for those interested: https://docs.python.org/3/glossary.html#term-lambda\n", "for entity_type, count in sorted(\n", " entity_type_counts.items(), key=lambda x: x[1], reverse=True\n", ")[:3]:\n", " print(f\"{entity_type}: {count}\")\n", "print(\"-\" * 10)\n", "for entity, count in sorted(entity_counts.items(), key=lambda x: x[1], reverse=True)[\n", " :3\n", "]:\n", " print(f\"{entity}: {count}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# method two: Counter\n", "from collections import Counter\n", "\n", "# we need a two lists for entities and labels\n", "entity_texts = [ent.text for ent in entities]\n", "entity_labels = [ent.label_ for ent in entities]\n", "\n", "entity_counts = Counter(entity_texts)\n", "entity_type_counts = Counter(entity_labels)\n", "\n", "# top 3 of each\n", "for entity_type, count in entity_type_counts.most_common(3):\n", " print(f\"{entity_type}: {count}\")\n", "print(\"-\" * 10)\n", "for entity, count in entity_counts.most_common(3):\n", " print(f\"{entity}: {count}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can now even plot the results\n", "import matplotlib.pyplot as plt\n", "\n", "plt.figure(figsize=(20, 10))\n", "plt.subplot(1, 2, 1)\n", "plt.barh(list(entity_counts.keys()), list(entity_counts.values()))\n", "plt.xlabel(\"Count\")\n", "plt.ylabel(\"Entity\")\n", "plt.title(\"Entity Counts\")\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.barh(list(entity_type_counts.keys()), list(entity_type_counts.values()))\n", "plt.xlabel(\"Count\")\n", "plt.ylabel(\"Entity Type\")\n", "plt.title(\"Entity Type Counts\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-English case: Parsing Latin texts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "\n", "nlp = spacy.load(\"la_core_web_lg\") # loading the latin model instead of the english one" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data collection and scraping" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for this example we'll use Cicero's Letter's to Atticus\n", "# here we download it in XML form and parse it with BeautifulSoup4\n", "# if you don't remember this from the intro class, don't worry we'll revisit this in week 5\n", "!wget https://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.02.0008 -O atticus.xml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(open(\"atticus.xml\", \"r\").read(), features=\"xml\")\n", "soup.find(\"div2\") # first letter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re # need to use regular expressions to do some cleaning, we'll revisit this too\n", "\n", "letters = []\n", "for d in soup.find_all(\"div2\"):\n", " dateline = d.dateline.extract().get_text().strip()\n", " salute = d.salute.extract().get_text().strip()\n", " text = re.sub(r\"\\s+\", \" \", d.get_text().strip().replace(\"\\n\", \"\"))\n", " letters.append([dateline, salute, text])\n", "\n", "print(letters[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now we can use pandas to store the data for each letter\n", "import pandas as pd\n", "\n", "df = pd.DataFrame(letters, columns=[\"dateline\", \"salute\", \"text\"])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# example parse with one letter\n", "first_letter = df.text.iloc[0]\n", "first_letter_doc = nlp(first_letter)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_letter_entities = first_letter_doc.ents\n", "for i, entity in enumerate(first_letter_entities):\n", " print(\n", " f\"Entity {i+1}: \",\n", " entity.text,\n", " \"| Entity Type: \",\n", " entity.label_,\n", " \"| Entity Lemma: \",\n", " entity.lemma_,\n", " )\n", "# here I also print out the words lemma, the base form of the word for counting purposes\n", "# more on this next week" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_entity_counts(text):\n", " doc = nlp(text)\n", " entities = doc.ents\n", " entity_texts = [ent.lemma_ for ent in entities] # counting lemmas not text\n", " entity_labels = [ent.label_ for ent in entities]\n", " entity_counts = Counter(entity_texts)\n", " entity_type_counts = Counter(entity_labels)\n", " return entity_counts, entity_type_counts\n", "\n", "\n", "df[\"entity_counts\"] = df.text.apply(get_entity_counts)\n", "df[\"entity_type_counts\"] = df.entity_counts.apply(\n", " lambda x: x[1]\n", ") # taking the type counts\n", "df[\"entity_counts\"] = df.entity_counts.apply(lambda x: x[0]) # taking the lemma counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_entity_counts = df.entity_counts.sum()\n", "all_type_counts = df.entity_type_counts.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# limiting the plot below to 15 so that there aren't too many\n", "top_15_entities = sorted(all_entity_counts.items(), key=lambda x: x[1], reverse=True)[\n", " :15\n", "]\n", "top_15_entities = dict(top_15_entities)\n", "top_15_entities" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(20, 10))\n", "plt.subplot(1, 2, 1)\n", "plt.barh(list(top_15_entities.keys()), list(top_15_entities.values()))\n", "plt.xlabel(\"Count\")\n", "plt.ylabel(\"Entity\")\n", "plt.title(\"Entity Counts\")\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.barh(list(all_type_counts.keys()), list(all_type_counts.values()))\n", "plt.xlabel(\"Count\")\n", "plt.ylabel(\"Entity Type\")\n", "plt.title(\"Entity Type Counts\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "We've seen today how using specialized, pretrained models can help us do tasks like named entity recognition. We also worked on our Python skills in data parsing and plotting. In the next class, we will discuss some of the other features of `spaCy` models." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "mystnb": { "execution_mode": "off" } }, "nbformat": 4, "nbformat_minor": 0 }