{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install whoosh paginate-whoosh streamlit -Uq\n", "!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating a Search Engine for your own data using `Whoosh`\n", "\n", "No matter the discipline, scholars tend to accumulate a vast array of textual sources. Regardless of whether these are primary or seconrdary sources, researchers often need help wading through these sources and finding places where it's best to start digging into the text.\n", "\n", "In this notebook, we'll explore how to create and customize your own search engine so that you can easily and quickly search through your data. We will be using a Python library `Whoosh`, which implements indexing, complex logical queries and searching in pure Python, meaning that it doesn't require a compiler or Java. `Whoosh` is not a search engine itself, but rather a library that allows users to develop their own search engine. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up the data\n", "\n", "In this section I'll download some data and put it into a form that is to easy index for `Whoosh`. In this notebook, we'll be searching through Edward Gibbon's *Decline and Fall of the Roman Empire*, a notoriously long and difficult book about the history of Europe from ~200 to ~1400 CE." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "daf = pd.read_csv(\"daf.csv\")[[\"title\", \"text\"]]\n", "daf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have this data already in a DataFrame or spreadsheet. However, this is not needed for `Whoosh`. Instead, I'm using this format as an easy way to store and access the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing our data\n", "\n", "Now that we have some data to search through, we can begin indexing it. When it indexes text data, `Whoosh` creates a variety of files for us. These files contain tables which relate document names (in our case, the names of the chapters, the `title` column) to vectorized collections of words. Vectorization is the process of turning natural language into long lists of numbers so that we can conduct automated processes. There are a many ways to do this, so if you are interested, I recommend checking out the `Textual Feature Extraction using Traditional Machine Learning` workshop after this one.\n", "\n", "To index data in `Whoosh` we need a couple things in addition to our dawta:\n", "\n", "* An empty directory, where we can saved the files that `Whoosh` produces\n", "\n", "* We also need a **schema**, a list of names and data types that we give to `Whoosh` so that the indexer knows how to store our data\n", "\n", "See the example below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# designing a schema\n", "# we have two fields, title and text, both strings, so this will be a relatively simple schema\n", "\n", "from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED\n", "from whoosh.analysis import StemmingAnalyzer\n", "\n", "schema = Schema(\n", " title=ID(stored=True),\n", " text=TEXT(\n", " analyzer=StemmingAnalyzer(), stored=True\n", " ), # (optional) applies stemming to the text -> run, running, ran are all found when searching for \"run\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now that we have a schema we can index our text\n", "import os\n", "from whoosh import index\n", "\n", "if not os.path.exists(\"daf-index\"): # creates an empty directory\n", " os.mkdir(\"daf-index\")\n", "\n", "ix = index.create_in(\"daf-index\", schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Nota bene*: When you create an index as we have in the last line of the cell above that directory is now classified as an `FileIndex` type object. This means that if you need to start over, you'll need to **delete this folder and make it again**. Sometimes, this can be confusing, especially because the error doesn't give you much guidance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "writer = ix.writer() # allows us to add documents\n", "for i, row in daf.iterrows(): # looping through our data\n", " writer.add_document(\n", " title=row[\"title\"], text=row[\"text\"]\n", " ) # adding each row to our index\n", "\n", "writer.commit() # commits the added documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Search our data\n", "\n", "With our completed index, we can begin searching through our data. `Whoosh` provides many options for searching including boolean operators (AND, OR, NOT) applying the stemmer we used when indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from whoosh.qparser import QueryParser\n", "from pprint import pprint\n", "\n", "qp = QueryParser(\"text\", schema=ix.schema) # field we want to search in\n", "q = qp.parse(\"The Crusades\")\n", "\n", "# can print out the results below\n", "with ix.searcher() as s:\n", " results = s.search(q) # list of dictionaries\n", " for i, hit in enumerate(results): # loop through them\n", " print(f\"Result {i+1}\")\n", " print(hit[\"title\"]) # chapter name from our schema\n", " pprint(hit[\"text\"]) # text\n", " print(\"--\" * 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yay! Our search engine worked! But there's more than we can do. To start, we're only seeing the first 10 results. This is by default, as this `results` list is *paginated* meaning that in the searcher method we could specify what \"page\" of ten results we want to see at a particular time. This will be useful a bit later, but for now we can set the keyword argument `limit` to `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q = qp.parse(\"The Crusades\")\n", "with ix.searcher() as s:\n", " results = s.search(q, limit=None)\n", " print(len(results))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# logical operators\n", "q = qp.parse(\"The Crusades AND Bohemond\")\n", "with ix.searcher() as s:\n", " results = s.search(q, limit=None)\n", " print(q, len(results))\n", "\n", "q = qp.parse(\"The Crusades OR Bohemond\")\n", "with ix.searcher() as s:\n", " results = s.search(q, limit=None)\n", " print(q, len(results))\n", "\n", "q = qp.parse(\"The Crusades NOT Bohemond\")\n", "with ix.searcher() as s:\n", " results = s.search(q, limit=None)\n", " print(q, len(results))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# can also filter by chapter\n", "from whoosh import query\n", "\n", "q = qp.parse(\"The Crusades\")\n", "with ix.searcher() as s:\n", " allow_q = query.Term(\n", " \"title\", \"The Crusades.—Part I.\"\n", " ) # query.Term takes a schema field (title) and something from that field\n", " results = s.search(q, filter=allow_q, limit=None)\n", " for hit in results: # only return results from that chapter\n", " print(hit[\"title\"])\n", " pprint(hit[\"text\"])\n", " print(\"--\" * 20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, HTML\n", "\n", "# can also highlight where a keyword appears\n", "q = qp.parse(\"The Crusades\")\n", "\n", "with ix.searcher() as s:\n", " results = s.search(q, limit=None)\n", " results.fragmenter.maxchars = 1000 # increasing context of the highlight\n", " results.fragmenter.surround = 250\n", " for i, hit in enumerate(results):\n", " print(f\"Result {i+1}\")\n", " print(hit[\"title\"])\n", " display(\n", " HTML(hit.highlights(\"text\"))\n", " ) # highlights are given bold html () tags, using IPython to display this\n", " print(\"--\" * 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a search interface\n", "\n", "Searching our data is great, but we can go one step further and put a simple graphical interface in front of the search engine. I'll use a Python-based web app framework called `streamlit`. This notebook won't go into depth about how this app is created, but if you are interested, please check out our `Introduction to Streamlit` workshop." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#@title Load all of the code\n", "\n", "#@markdown Run this cell to get all of the code needed for the application. You need to run this cell only once.\n", "%%writefile app.py\n", "\n", "from whoosh.index import open_dir\n", "from whoosh import query\n", "from whoosh.qparser import QueryParser\n", "from IPython.display import display, HTML, clear_output\n", "import re\n", "import streamlit as st\n", "\n", "@st.cache_data\n", "def get_index():\n", " return open_dir(\"daf_index\")\n", "\n", "ix = get_index()\n", "\n", "st.title(\"Gibbon's *Decline and Fall of the Roman Empire* Simple Search\")\n", "with st.expander(\"Searching tips\"):\n", " st.write(\"\"\"\n", " * If you'd like to search for just a single term, you can enter it in the box below.\n", " * If you'd like to search for a phrase, you can enclose it in quotations, such as \"serious complications\".\n", " * A query like \"serious complications\"~5 would return results where \"serious\" and \"complications\" are at most 5 words away from each other.\n", " * AND can be used as a boolean operator and will return results where two terms are both in a passage. AND is automatically placed in a query of two words, so 'latent syphilis' is internally represented as latent AND syphilis.\n", " * OR can be used as a boolean operator and will return results where either one of two terms are in a passage.\n", " * NOT can be used as a boolean operator and will return results which do not include the term following the NOT.\n", " * From these boolean operators, one can construct complex queries like: syphilis AND hospitals NOT \"serious complications\". This query would return results that have both syphilis and hospitals in them, but do not have \"serious complications\".\n", " * Parentheses can be used to group boolean statements. For example, the query syphilis AND (\"serious complications\" OR hospitals) would return results that have syphilis and either serious complications or hispitals in them.\n", " * If you'd like to search in a specific date range, you can specify it with the date: field. For example, year:[19500101 TO 19600101] syphilis would return results between January 1st, 1950 and January 1st, 1960 that have syphilis in them.\n", " \"\"\")\n", "\n", "if 'page_count' not in st.session_state:\n", " st.session_state['page_count'] = 0\n", "\n", "if 'to_see' not in st.session_state:\n", " st.session_state['to_see'] = 10\n", "\n", "if 'pages' not in st.session_state:\n", " st.session_state['pages'] = []\n", "\n", "def clear_session_state():\n", " st.session_state['page_count'] = 0\n", " st.session_state['to_see'] = 10\n", " st.session_state['pages'] = []\n", "\n", "query_str = st.text_input(\"Search\", key=\"search\", on_change=clear_session_state)\n", "stemmer = st.toggle('Use stemming', help='If selected, the search will use stemming to find words with the same root. For example, \"running\" will match \"run\" and \"ran\".', on_change=clear_session_state)\n", "\n", "if stemmer:\n", " parser = QueryParser(\"text\", ix.schema, termclass=query.Variations)\n", "else:\n", " parser = QueryParser(\"text\", ix.schema)\n", "\n", "query = parser.parse(query_str)\n", "\n", "html_template = \"\"\"\n", "
{hit}
\n", "