{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Gentle Introduction to Natural Language Process\n", "Today, we will take a text, *War and Peace* by Leo Tolstoy, and try to get verify or refute the claims of an outside source, *Tolstoy's Phoenix: From Method to Meaning in War and Peace* by George R. Clay (1998). This is a very common task both in and outside of the digital humanities and will introduce you to the popular NLP Python package, the Natural Language Toolkit (NLTK), and expose you to common methodologies for wrangling textual data. \n", "\n", "## Our overall task\n", "\n", "We will start by downloading the book and then we will learn how to clean the text, perform basic statistics, create visualizations, and discuss what we found and how to present those results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goals\n", "* Understand the rights we have to access books on Project Gutenberg\n", "* Read in a text from Project Gutenberg\n", "* Clean textual data using regular expressions\n", "* Perform basic word frequency statistics\n", "* Create visualziations of these statistics\n", "* Discuss how to communicte these results\n", "* Return to our research question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## General methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in our data\n", "We will start with a url from Project Gutenberg. All of the texts from Gutenberg are in the public domain, so we won't have to worry about rights, but be aware of who own the intellectual property to a text before you scrape it. In this section, we will break the text up by chapter division. Later we'll do the same but by book division.\n", "\n", "#### The requests library\n", "Here, we are using a library called `requests`. This library is great for HTTP requests, which are like asking for a specific action from the internet.\n", "\n", "More details below:\n", "https://pypi.org/project/requests/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = \"https://www.gutenberg.org/cache/epub/2600/pg2600.txt\"\n", "text = requests.get(url).text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## IMPORTANT: this pipeline will work for all text, but you will need to understand your text\n", "## There is no one-size-fits-all for text cleaning, especially for extra information, like tables\n", "## of content and disclaimers\n", "\n", "# In this case, we found the first line that states the start of the book. We found this by looking at the actual book.\n", "# Link to book: https://www.gutenberg.org/cache/epub/2600/pg2600.txt\n", "\n", "start = text.find(\"BOOK ONE: 1805\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\")\n", "start" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "end = text.find(\"END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE\")\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Using the start and end points that we found above, we can filter out all of the text before the start of the book\n", "## and after its end.\n", "\n", "## In Python, we can use square brackets to delimit the new start and new end that we want. As we see below:\n", "\n", "text = text[start:end]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a text that we are confident is nothing but the text of the book, we can begin to dissect it into its component parts. First, let's break it up into chapters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do so, I am using a very versatile package called `re` or regular expressions (regex) to do some advanced string parsing. This regex function, finditer, takes in a pattern and a text and returns all of the times that pattern occurs in the text. These patterns can look very complicated, but, in this case, it is '(Chapter [A-Z]+)', which means: *find all of the times when there is the word 'CHAPTER' followed by a space and then any amount of capital letters*.\n", "This pattern fits the roman numeral counting of the chapters (ex. CHAPTER I or CHAPTER XII)\n", "
\n", "\n", "Finditer also return the indices where the chapter title begins and ends. So, we can use the ending of one chapter title and the beginning of the next one to get all of the text in between the two chapters. This text is, by definition, the text in that chapter.\n", "\n", "You can read more about regex [here](https://librarycarpentry.org/lc-data-intro-archives/04-regular-expressions/index.html) and you can play around with your own regex at [regex101](https://regex101.com/). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "ch_dict = {}\n", "ch_list = list(re.finditer(r\"(CHAPTER [A-Z]+)\", text))\n", "for i, m in enumerate(ch_list):\n", " if i < len(ch_list) - 1:\n", " ch_dict[f\"Chapter {i+1}\"] = (m.end(0), ch_list[i + 1].start(0) - 1)\n", " else:\n", " ch_dict[f\"Chapter {i+1}\"] = (m.end(0), len(text) - 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all the text extracted! But... dictionaries are not the most useful data structure. They can be difficult to query and to get basic statistics from. So we will convert our dictionary into a more robust data structure, a dataframe from the `pandas` package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd ## You will almost always see pandas imported this way.\n", "\n", "## Let's see what happens when we input the dictionary directly\n", "pd.DataFrame(ch_dict)\n", "## it's close but not quite what we wanted" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## pandas is programmed to look for numerical indices, which dictionaries (because they're an unordered data type) do not have\n", "## we can coerce it though to accept string values as the index with the 'from_dict' method and the 'orient' keyword argument\n", "## the 'reset_index' method will then turn our index into a column and give us an index for the rows\n", "ch_df = pd.DataFrame.from_dict(ch_dict, orient=\"index\")\n", "ch_df = ch_df.reset_index()\n", "ch_df = ch_df.rename(columns={\"index\": \"chapter\", 0: \"start_idx\", 1: \"end_idx\"})\n", "ch_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the dataframe is in the correct orientation, let's use the indices that we got to select the texts of each chapter. All we need to do is input `start_idx:end_idx` into square brackets next to the `text` object. Below is an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = ch_df.iloc[0][\"start_idx\"] ## .iloc[0] gives us the first element of the dataframe\n", "e = ch_df.iloc[0][\"end_idx\"]\n", "text[s:e]\n", "## the first full chapter of War and Peace\n", "## you can change the number after iloc to change which chapter you see" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is great, but now how do we do it with the whole column?\n", "\n", "We can use the `.apply` method from `pandas`, which allows us to cast a single function to every cell in a column. In this case, we need to use two different columns, so we will use a `lambda` expression. `lambda` expressions are a way to build very simple functions in Python. They are very flexible and support a lot of different features like flow control and regular expressions. [Here](https://www.w3schools.com/python/python_lambda.asp) is a useful starter guide to `lambda` from W3Schools.\n", "\n", "For us, we need `lambda` expression that will pull a value for each column and return the text between those two values. Note the `axis=1` keyword parameter. This tells `pandas` that we want to go through the columns not the rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ch_df[\"text\"] = ch_df.apply(lambda x: text[x[\"start_idx\"] : x[\"end_idx\"]], axis=1)\n", "ch_df[\"text\"] = ch_df[\"text\"].apply(\n", " lambda x: x.replace(\"\\r\", \"\").replace(\"\\n\", \" \").replace(\" \", \" \").lower()\n", ") ## using lambda again for some simple cleaning with the replace and lower methods\n", "ch_df = ch_df.drop(\n", " [\"start_idx\", \"end_idx\"], axis=1\n", ") ## and now we can drop our indices just so they don't clutter up our dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ch_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization\n", "Now that we have read in our data and began some cleaning, we will tokenize our text. Tokenization is a process by which we can split up the chapter we got into relevant units. There are two main types of tokenization:\n", "* Sentence-level tokenization: splitting up the text into sentences\n", "* Word-level tokenization: splitting up the text into words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## The Natural Language Toolkit (NLTK) provides a lot of very useful utilities for cleaning and analyzing textual data.\n", "## It requires this 'punkt' download for us to be able to tokenize\n", "import nltk\n", "\n", "nltk.download(\"punkt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentence-level tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ch_df[\"sents\"] = ch_df.text.apply(nltk.sent_tokenize)\n", "type(\n", " ch_df[\"sents\"].iloc[0]\n", ") ## each element of the 'sents' column is a list of sentences in each chapter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Now we can 'explode' these lists, so that each sentence has its own cell in the 'sents' column.\n", "sent_explode = ch_df.explode(\"sents\").drop([\"text\"], axis=1).reset_index(drop=True)\n", "sent_explode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word-level tokenization\n", "To conduct word-level tokenization, we'll be using regex again but this time, we'll use a useful function in NLTK that takes in a regular expression and will split words on that pattern. This is opposition to so-called 'white-space tokenization' where words are delimited by spaces. This type of tokenization can be useful in some cases, but in most natural language will be insufficient. That's why this NLTK function is so helpful. All we need to do is input the pattern below and we will get very good English tokenization\n", "\n", "Note that tokenization is language dependent. This means that just because a certain method works for English it will not necessarily work for other languages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import RegexpTokenizer\n", "\n", "tokenizer = RegexpTokenizer(r\"(?x)\\w+(?:[-’]\\w+)*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ex_sentence = sent_explode.sents.iloc[5]\n", "print(ex_sentence)\n", "tokenizer.tokenize(ex_sentence)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sent_explode[\"word_tokenized\"] = sent_explode[\"sents\"].apply(tokenizer.tokenize)\n", "sent_explode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization\n", "We can now create a general function to visualize word use over the course of the book. For this task, we are more interested in general trends, so let's zoom back out to the book-level.\n", "\n", "For more information on how to create visualization in Python, please go through [this training](https://tuftsdatalab.github.io/uep239-data-analysis/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## creating a new dataframe organized by book\n", "## this is the same code as above\n", "book_dict = {}\n", "book_list = list(re.finditer(r\"(BOOK [A-Z]+:)\", text))\n", "for i, m in enumerate(book_list):\n", " if i < len(book_list) - 1:\n", " book_dict[f\"Book {i+1}\"] = (m.end(0), book_list[i + 1].start(0) - 1)\n", " else:\n", " book_dict[f\"Book {i+1}\"] = (m.end(0), len(text) - 1)\n", "\n", "book_df = (\n", " pd.DataFrame.from_dict(book_dict, orient=\"index\")\n", " .reset_index()\n", " .rename(columns={\"index\": \"book\", 0: \"start_idx\", 1: \"end_idx\"})\n", ")\n", "\n", "book_df[\"text\"] = book_df.apply(lambda x: text[x[\"start_idx\"] : x[\"end_idx\"]], axis=1)\n", "book_df[\"text\"] = book_df[\"text\"].apply(\n", " lambda x: x.replace(\"\\r\", \"\").replace(\"\\n\", \" \").replace(\" \", \" \").lower()\n", ")\n", "book_df = book_df.drop([\"start_idx\", \"end_idx\"], axis=1)\n", "\n", "book_df[\"word_tokenized\"] = book_df.text.apply(tokenizer.tokenize)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "search = \"napoleon\"\n", "book_df[\"search_count\"] = book_df[\"word_tokenized\"].apply(\n", " lambda x: x.count(search.lower())\n", ")\n", "\n", "book_df.plot(x=\"book\", y=\"search_count\", kind=\"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalizing\n", "Once I am happy with how my pipeline is functioning, I like to then try to write a function that is as general as possible. In this case, we started with a Gutenberg URL to a `txt` file, so let's take that as the starting place for our general function. Please take some time to develop a function which will return a `DataFrame` which we can use to generate a similar visualization as above from any Gutenbery URL. Be careful to note where, in the code above, some methods are specific to *War and Peace* and which are general." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_gutenberg_data(\n", " url, # the Gutenberg URL\n", " # you may need more arguments for your function\n", "):\n", " \"\"\"\n", " Takes in a Gutenberg URL\n", " Returns a DataFrame of relevant information\n", " \"\"\"\n", " ## GET TEXT FROM URL WITH A REQUEST\n", "\n", " ## TRIM YOUR TEXT BASED ON SUBSTRINGS\n", "\n", " content_dict = {}\n", " content_list = list(re.finditer(r\"YOUR REGEX HERE!!\"), text)\n", "\n", " ## CREATE A DICTIONARY FOR THE INDEX POSITIONS YOUR CONTENT\n", "\n", " ## CONVERT YOUR DICTIONARY INTO A DATAFRAME\n", "\n", " ## CREATE A TEXT COLUMN IN YOUR DATAFRAME\n", "\n", " ## CLEAN YOUR TEXT\n", "\n", " ## TOKENIZE (SENTENCE AND WORD LEVEL)\n", "\n", " ## RETURN YOUR DATAFRAME" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "search = \"\" ## INPUT SEARCH TERM HERE\n", "\n", "## ADD A COUNTS COLUMN\n", "\n", "## PLOT A GRAPH" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specific Application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From *Tolstoy's Phoenix: From Method to Meaning in War and Peace* by George R. Clay (1998), page 7:\n", "
\n", ">In the most direct use of Tolstoy's technqiues, he categorizes in his own voice. Instead of \"The princess smiled, thinking she know more about the subject than Prince Vasili,\" he will write: \"The princess smiled, *as people do* who think they know more about the subject under discussion than those they are talking with\" (1:2, p.77; emphasis added). The first version expresses a private thought, but the one Tolstoy used implies (as R. F. Christian phrased it) that \"there is a basic denominator of human experience\" -- a sameness about our pattern of behavior, so that we all know what kind of smile it is... By writing \"as people do,\" Tolstoy isn't telling us what this smile is like, he is assuming that we *know* what it is like: that we have seen it many times before..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to find more examples and see if they fall in line with Clay's analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the phrases Clay identifies as relevant\n", "phrases = [\n", " \"as people do\",\n", " \"of one who\",\n", " \"peculiar to\",\n", " \"as with everyone\",\n", " \"as one who\",\n", " \"only used by persons who\",\n", " \"of a man who\",\n", " \"which a man has\",\n", " \"in the way\",\n", " \"as is usually the case\",\n", " \"such as often occurs\",\n", " \"which usually follows\",\n", "]\n", "\n", "\n", "def printExcerpt(row):\n", " print(row[\"chapter\"], row[\"sents\"], sep=\"\\t\")\n", " print(\"\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for phrase in phrases:\n", " print(phrase.upper())\n", " sub_df = sent_explode.loc[sent_explode[\"sents\"].str.contains(phrase)]\n", " sub_df.apply(printExcerpt, axis=1)\n", " print(\"-----\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now we can see all of the times that Tolstoy uses these particular phrases and we can begin to see what Clay is saying. In this lines, Tolstoy is not giving us a direct characterization or description of a person or thing. Instead he is telling us about the type of person or thing it is. The next step would be then to see if there is a trend in the people or things he refers to in this manner or if this is, as Clay contends, is a general feature of Tolstoy's prose. At every step in that process, we must make sure to compare our results with Clay and Tolstoy himself. It can be easy to trick yourself into a discovery and the only way to avoid that is to have a close connection to the data itself. Always spend time to play around with your data, even after you think it's 100% clean, to see what it hides." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reviewing what we learned\n", "* How to take advantage of free repositories of text like Project Gutenberg\n", "* Taking a url to a text and reading it into our Python runtime\n", "* Cleaning our data so that we can find some general statistics\n", "* How to create visualizations of these statistics\n", "* Applying simple methodologies to a complex question\n", "* Outlining a method we could use to answer such a question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a challenge, now that we have a good baseline for picking out the phrases Clay points us too, can you create dataframe that stores the sentence, what phrase from Clay it uses, and what chapter it is in? Be sure to start with a dictionary, as we did when we created a dataframe together and then convert it into a dataframe using `from_dict`. When you're done, you can then export it as a .csv file, so you can use it later with the method `whatever_your_df_is_called.to_csv('name_of_file.csv')`. This way you can share your work with other scholars and later can deploy it to a website or another publication. If you have trouble or just want to show off how you did it, feel free to reach out and let me know at peter.nadel@tufts.edu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Thanks for reading" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3.10.8 ('webscrape_env')", "language": "python", "name": "python3" }, "language_info": { "name": "python" }, "mystnb": { "execution_mode": "off" }, "vscode": { "interpreter": { "hash": "9fab2b31d7f94569bb304ed7220415dbab9e94896d45f4079aae060fbbe4f8bf" } } }, "nbformat": 4, "nbformat_minor": 0 }