!pip install datasets wiktionaryparser -Uq

Language Hacking: Using spaCy for Morphosyntactic Analysis#

This workshop is inspired by this article written by Tufts professor, Dr. Gregory Crane. Crane teaches in the Classical Studies department, meaning that he studies historical languages like Ancient Greek and Latin, so you might be wondering how natural language processing could be relevant to such a discipline. The answer is in language hacking, the processing of taking a language which you may or may not know and using pretrained language models to give you a deeper understanding of the text.

Professor Crane is a ‘digital philologist.’ Philology (from φιλολογία, the “love of words”) is the study of language in historical sources, which can include everything from ancient literature to contemporary song lyrics. As a result, philologists are interested in many different languages, especially sources in many languages, but no one, no matter how good a philologist, can learn every language they might be interested in.

This is where language hacking comes in. Deep learning models like the ones we’ll play around with today offer new opportunities for research, language-learning and cross-cultural exchange.

What is spaCy?#

Above I mentioned that we would be using a language model to tell us about the meaning of words in languages we don’t know. In this lesson, we’ll download these models from a Python package called spaCy. spaCy is very powerful package with a lot of functionality. We’ll only be using a small part of what they offer: their pretrained language models.

Data and Model Preparation#

Before we can start language hacking, we need to set up our texts and models. For this example, I’ll be using the original French version of Alexandre Dumas’ The Three Muskeeters (Les trois mousquetaires). As a result, we’ll also be using the French spaCy model, which we’ll need to download.

from datasets import load_dataset

dataset = load_dataset("pnadel/les_trois_mousquetaires")
data = dataset["train"].to_pandas()
data
data.iloc[0]["text"][:1000]
# downloading french model from spacy
!python -m spacy download fr_core_news_md
# loading the model
import spacy

nlp = spacy.load("fr_core_news_md")
nlp  # working!

Apply the Model#

Now that we have our data and our model, we can apply the one to the other. The nlp object that we made above can be called like a function with some text. See below with a simple example.

Introduction to Morphosyntax#

This section serves as an introduction to some important NLP vocabulary and Python syntax for using spaCy.

# this is spacy `Doc` object
example = nlp(
    "Je m'appelle Peter. J'aime les jeux vidéos."
)  # "My name is Peter. I like video games"
type(example)
# .text just gives us the text as a string
example.text, type(example.text)
# iterable to access each sentence in the original text
example.sents, type(example.sents)
# iterating through each sentence
for sent in example.sents:
    print(sent.text)
# iterating through each token
for token in example:
    print(token.text)
# iterating through each token
# AND getting some morphosyntactic information
for token in example:
    print(token.text, token.pos_, token.dep_, token.lemma_, sep="\t\t")

See bleow for a glossary of relevant terms for the rest of this notebook.

  • Lemma: A lemma is the root form of a word. For example, the words “ran”, “running” and “runs” all come from the root “run”. In this case, “run” would be the lemma of “ran”, “running” and “runs” (Accessed through the lemma_ property).

  • Part of speech: You might be familiar with part of speech as the function that a word takes in a sentence, but there couple different standards for representing this information.

    • UPOS or Universal part of speech: This is the “normal” part of speech that you likely saw while learning English (Accessed through the pos_ property).

    • XPOS or Language-specific part of speech: These are part of speech tags that might change depending on language. Oddly, they don’t always change from language to language. In fact, they can be shared between languages but are often much more specific about the part of speech of a word (Accessed through the tag_ property).

  • Morphology: To quote the spaCy docs, “Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech.” So, where a lemma is the root form of a word, morphological features are what are added to the lemma to create grammatically correct variants of the same lemma (Accessed through the morph property).

  • Sentence relation or Sentence dependency: Each word, in addition to its part of speech and lemma, can be identified by what words it depends on or what words depend on it (or both). For example, the word “apple” is a NOUN yet it can be used either as a subject (“The apple is red”) or an object (I ate the apple). The UPOS tag would be the same for each, but the sentence relation, that is what the word is doing in the sentence, would be different (Accessed through the dep_ property). You can find a list of dependencies here: English or French.

  • Treebanks: A sentence is made up of a series of words which are dependent on one another. This understanding allows us to construct tree-like structures of sentences. This is similar to diagraming sentences, if you have ever done that. It can be quite useful to use this model when language hacking, as it will give you a better idea about how to progress through a sentence. Below you can see the treebank that spaCy created for the first sentence.

from spacy import displacy

displacy.render(list(example.sents)[0], style="dep", jupyter=True)

As in all treebanks, the verb (“appelle”) depends on only one thing: root, a placeholder which represents the semantic beginning of the sentence. From here all word depend on the main verb of the sentence. Take a look at the next example to see a variation.

# copulative example
displacy.render(
    nlp("Je suis fatigué"), style="dep", jupyter=True
)  # "I am tired" in English

In this treebank, there is not word with the VERB tag. Instead, the root word is “fatigué” or “tired” in English. This is because the grammatical verb is “suis” or “am” in English is tagged as a cop or copulative (this is sometimes called a ‘linking’ verb in English education, as copula > Lat. co-, together and apere, fasten). These verbs only join a subject to an adjective but do not indicate any action. It is for this reason that in many language they are left out (cf. A. Gk. “μακρός ὁ οἴκος”, meaning “the house is large”, but literally “the house large”). Even in English, certain dialects like African American Vernacular English (AAVE) sometimes do not express copular verbs. For all of these reasons, they are marked differently than other verbs in treebanks.

Using the Data#

Now that we have a grasp on the core vocabulary, we can begin to delve into real French literature.

first_chapter = data.iloc[0]["text"]
first_chapter_doc = nlp(first_chapter)
first_sentence = list(first_chapter_doc.sents)[0]
first_sentence.text

In English: A short time ago, while making researches in the Royal Library for my History of Louis XIV., I stumbled by chance upon the Memoirs of M. d’Artagnan, printed—as were most of the works of that period, in which authors could not tell the truth without the risk of a residence, more or less long, in the Bastille—at Amsterdam, by Pierre Rogue.

displacy.render(first_sentence, style="dep", jupyter=True)

This tree is very complex, so let’s break it down using spaCy’s head, rights, lefts and children functionalities.

# each token has a "head" word
# this is the word that it depends on
first_sentence[10], first_sentence[10].head
# the head word of faisant is tombai
# picking out the root verb
# looking for a token whose head is itself
root = [token for token in first_sentence if token.head == token][0]
root.text, root.dep_, root.pos_
# `lefts` returns a generator of all words to the left of a given token
# which have that given token as their head
print("LEFTS")
for t in root.lefts:
    print(t.text, t.dep_, t.pos_, sep="\t\t")

print()
# `rights` returns a generator of all words to the right of a given token
# which have that given token as their head
print("RIGHTS")
for t in root.rights:
    print(t.text, t.dep_, t.pos_, sep="\t\t")
# `children` returns a generator of all of the descendants of a given word
for descendant in root.children:
    print(descendant.text, descendant.dep_, descendant.pos_, sep="\t\t")

Trying to Read in a Language We Don’t Know#

We can now move on to using these morphosyntactic annotations to read text in a language we don’t know.

# first let's apply the model to the whole text
from tqdm import tqdm

tqdm.pandas()  # for a progress bar

data["spacy_docs"] = data["text"].progress_apply(nlp)
# using wiktionary api as our dictionary
from wiktionaryparser import WiktionaryParser

parser = WiktionaryParser()
word = parser.fetch("hasard", "french")
word[0]["definitions"][0]["text"][1:]
def look_up_definition(word):
    parser = WiktionaryParser()
    word = parser.fetch(word, "french")
    try:
        return word[0]["definitions"][0]["text"][1:]
    except:
        return "No definition for this word. It is likely a proper noun."
import pprint

pp = pprint.PrettyPrinter(indent=4)

example_sentence = list(data.iloc[-1].spacy_docs.sents)[
    4
]  # picking an easy example sentence, but feel free to alter the index to get a more complex sentence
pp.pprint(example_sentence.text)
# treebank
displacy.render(example_sentence, style="dep", jupyter=True)

Steps to get started language hacking:

  1. Find the root. This will usually be the verb.

  2. Look at what directly depends on the root. Begin translation.

  3. For each dependent word, look at what depends directly on it. Continue translation.

  4. Make observations on word usage and syntax

# find root and look it up
root = [token for token in example_sentence if token.head == token][0]
root.text, root.dep_, root.pos_, root.morph, look_up_definition(root.lemma_)

From this information, we can see that the root of tis sentences is “entra”, that it means “enter” and that it is a 3rd person singular, indictative, past tense verb, which would translate to “entered” in English. Now we can look at the lefts and rights to find what depends on the root.

print(f"LEFTS: {[r for r in root.lefts]}")
first_dep = [r for r in root.lefts][0]
first_dep.text, first_dep.dep_, first_dep.pos_, first_dep.morph, look_up_definition(
    first_dep.lemma_
)

From the ‘nsubj’ tag, we can tell that this word is the subject of the verb “entra” and it means “he”. In the context of the story, this “he” refers to King Louis XIII.

print(f"RIGHTS: {[r for r in root.rights]}")
next_dep = [r for r in root.rights][0]
next_dep.text, next_dep.dep_, next_dep.pos_, next_dep.morph, look_up_definition(
    next_dep.lemma_
)

From the “obl:arg” tag, we can see that the king entered some kind of suburb. Let’s explore this word’s dependencies to find out more.

for descendant in next_dep.children:
    if descendant.dep_ != "dep":
        print(
            descendant.text,
            descendant.dep_,
            descendant.pos_,
            look_up_definition(descendant.lemma_),
            sep="\t\t",
        )

From this information, we can see that “he [the king] entered by the the Saint-Jacques suburb.” We’re almost there!

last_dep = [r for r in root.rights][1]
last_dep.text, last_dep.dep_, last_dep.pos_, last_dep.morph, look_up_definition(
    last_dep.lemma_
)
for descendant in last_dep.children:
    if descendant.dep_ != "dep":
        print(
            descendant.text,
            descendant.dep_,
            descendant.pos_,
            look_up_definition(descendant.lemma_),
            sep="\t\t",
        )

Perfect! Now we have enough to create a translation of the whole sentence:

“He [the king] entered by the Saint-Jacques suburb in a splendid ceremony.”

Feel free to go back up and follow the same procedure with a different sentence.

Limitations of Language Hacking#

Language hacking is a very useful paradigm for reading text in languages you either don’t know or are learning. It allows scholars to explore traditions and cultures that they would have been excluded from in the past. That said, it comes with some key limitations.

  • Dictionaries: As we saw above, we relied heavily on the open source wiktionary as our dictionary. This will work fine for a language like French with millions of speakers. But for languages that no one or very few people speak, a specialized dictionary will be necessary.

  • Available language models: The point above about dictionaries also holds for pretrained language models. spaCy has pretrained language models for a number of languages, but there are many, many more languages that they do not support. Training a new model for a new language is possible, but very time-consuming.

  • Inexact translations: Language hacking is by no means a substitute for a prepared translation. Translators can provide a much more competent rendering of the original language, but language hacking gives scholars an additional method to interrogate linguistic questions in texts from other languages.

Please contact me at peter.nadel@tufts.edu for any questions.