!pip install whoosh paginate-whoosh streamlit -Uq
!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv
Creating a Search Engine for your own data using Whoosh#
No matter the discipline, scholars tend to accumulate a vast array of textual sources. Regardless of whether these are primary or seconrdary sources, researchers often need help wading through these sources and finding places where it’s best to start digging into the text.
In this notebook, we’ll explore how to create and customize your own search engine so that you can easily and quickly search through your data. We will be using a Python library Whoosh, which implements indexing, complex logical queries and searching in pure Python, meaning that it doesn’t require a compiler or Java. Whoosh is not a search engine itself, but rather a library that allows users to develop their own search engine.
Setting up the data#
In this section I’ll download some data and put it into a form that is to easy index for Whoosh. In this notebook, we’ll be searching through Edward Gibbon’s Decline and Fall of the Roman Empire, a notoriously long and difficult book about the history of Europe from ~200 to ~1400 CE.
import pandas as pd
daf = pd.read_csv("daf.csv")[["title", "text"]]
daf
I have this data already in a DataFrame or spreadsheet. However, this is not needed for Whoosh. Instead, I’m using this format as an easy way to store and access the data.
Indexing our data#
Now that we have some data to search through, we can begin indexing it. When it indexes text data, Whoosh creates a variety of files for us. These files contain tables which relate document names (in our case, the names of the chapters, the title column) to vectorized collections of words. Vectorization is the process of turning natural language into long lists of numbers so that we can conduct automated processes. There are a many ways to do this, so if you are interested, I recommend checking out the Textual Feature Extraction using Traditional Machine Learning workshop after this one.
To index data in Whoosh we need a couple things in addition to our dawta:
An empty directory, where we can saved the files that
WhooshproducesWe also need a schema, a list of names and data types that we give to
Whooshso that the indexer knows how to store our data
See the example below.
# designing a schema
# we have two fields, title and text, both strings, so this will be a relatively simple schema
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
schema = Schema(
title=ID(stored=True),
text=TEXT(
analyzer=StemmingAnalyzer(), stored=True
), # (optional) applies stemming to the text -> run, running, ran are all found when searching for "run"
)
# now that we have a schema we can index our text
import os
from whoosh import index
if not os.path.exists("daf-index"): # creates an empty directory
os.mkdir("daf-index")
ix = index.create_in("daf-index", schema)
Nota bene: When you create an index as we have in the last line of the cell above that directory is now classified as an FileIndex type object. This means that if you need to start over, you’ll need to delete this folder and make it again. Sometimes, this can be confusing, especially because the error doesn’t give you much guidance.
writer = ix.writer() # allows us to add documents
for i, row in daf.iterrows(): # looping through our data
writer.add_document(
title=row["title"], text=row["text"]
) # adding each row to our index
writer.commit() # commits the added documents
Search our data#
With our completed index, we can begin searching through our data. Whoosh provides many options for searching including boolean operators (AND, OR, NOT) applying the stemmer we used when indexing.
from whoosh.qparser import QueryParser
from pprint import pprint
qp = QueryParser("text", schema=ix.schema) # field we want to search in
q = qp.parse("The Crusades")
# can print out the results below
with ix.searcher() as s:
results = s.search(q) # list of dictionaries
for i, hit in enumerate(results): # loop through them
print(f"Result {i+1}")
print(hit["title"]) # chapter name from our schema
pprint(hit["text"]) # text
print("--" * 20)
Yay! Our search engine worked! But there’s more than we can do. To start, we’re only seeing the first 10 results. This is by default, as this results list is paginated meaning that in the searcher method we could specify what “page” of ten results we want to see at a particular time. This will be useful a bit later, but for now we can set the keyword argument limit to None.
q = qp.parse("The Crusades")
with ix.searcher() as s:
results = s.search(q, limit=None)
print(len(results))
# logical operators
q = qp.parse("The Crusades AND Bohemond")
with ix.searcher() as s:
results = s.search(q, limit=None)
print(q, len(results))
q = qp.parse("The Crusades OR Bohemond")
with ix.searcher() as s:
results = s.search(q, limit=None)
print(q, len(results))
q = qp.parse("The Crusades NOT Bohemond")
with ix.searcher() as s:
results = s.search(q, limit=None)
print(q, len(results))
# can also filter by chapter
from whoosh import query
q = qp.parse("The Crusades")
with ix.searcher() as s:
allow_q = query.Term(
"title", "The Crusades.—Part I."
) # query.Term takes a schema field (title) and something from that field
results = s.search(q, filter=allow_q, limit=None)
for hit in results: # only return results from that chapter
print(hit["title"])
pprint(hit["text"])
print("--" * 20)
from IPython.display import display, HTML
# can also highlight where a keyword appears
q = qp.parse("The Crusades")
with ix.searcher() as s:
results = s.search(q, limit=None)
results.fragmenter.maxchars = 1000 # increasing context of the highlight
results.fragmenter.surround = 250
for i, hit in enumerate(results):
print(f"Result {i+1}")
print(hit["title"])
display(
HTML(hit.highlights("text"))
) # highlights are given bold html (<b></b>) tags, using IPython to display this
print("--" * 20)
Creating a search interface#
Searching our data is great, but we can go one step further and put a simple graphical interface in front of the search engine. I’ll use a Python-based web app framework called streamlit. This notebook won’t go into depth about how this app is created, but if you are interested, please check out our Introduction to Streamlit workshop.
#@title Load all of the code
#@markdown Run this cell to get all of the code needed for the application. You need to run this cell only once.
%%writefile app.py
from whoosh.index import open_dir
from whoosh import query
from whoosh.qparser import QueryParser
from IPython.display import display, HTML, clear_output
import re
import streamlit as st
@st.cache_data
def get_index():
return open_dir("daf_index")
ix = get_index()
st.title("Gibbon's *Decline and Fall of the Roman Empire* Simple Search")
with st.expander("Searching tips"):
st.write("""
* If you'd like to search for just a single term, you can enter it in the box below.
* If you'd like to search for a phrase, you can enclose it in quotations, such as "serious complications".
* A query like "serious complications"~5 would return results where "serious" and "complications" are at most 5 words away from each other.
* AND can be used as a boolean operator and will return results where two terms are both in a passage. AND is automatically placed in a query of two words, so 'latent syphilis' is internally represented as latent AND syphilis.
* OR can be used as a boolean operator and will return results where either one of two terms are in a passage.
* NOT can be used as a boolean operator and will return results which do not include the term following the NOT.
* From these boolean operators, one can construct complex queries like: syphilis AND hospitals NOT "serious complications". This query would return results that have both syphilis and hospitals in them, but do not have "serious complications".
* Parentheses can be used to group boolean statements. For example, the query syphilis AND ("serious complications" OR hospitals) would return results that have syphilis and either serious complications or hispitals in them.
* If you'd like to search in a specific date range, you can specify it with the date: field. For example, year:[19500101 TO 19600101] syphilis would return results between January 1st, 1950 and January 1st, 1960 that have syphilis in them.
""")
if 'page_count' not in st.session_state:
st.session_state['page_count'] = 0
if 'to_see' not in st.session_state:
st.session_state['to_see'] = 10
if 'pages' not in st.session_state:
st.session_state['pages'] = []
def clear_session_state():
st.session_state['page_count'] = 0
st.session_state['to_see'] = 10
st.session_state['pages'] = []
query_str = st.text_input("Search", key="search", on_change=clear_session_state)
stemmer = st.toggle('Use stemming', help='If selected, the search will use stemming to find words with the same root. For example, "running" will match "run" and "ran".', on_change=clear_session_state)
if stemmer:
parser = QueryParser("text", ix.schema, termclass=query.Variations)
else:
parser = QueryParser("text", ix.schema)
query = parser.parse(query_str)
html_template = """
<p>{hit}</p>
<hr/>
""".strip()
class Page:
def __init__(self, results, pageno, items_per_page):
self.results = results
self.pageno = pageno
self.items_per_page = items_per_page
def __len__(self):
return len(self.results)
def __call__(self):
for i, hit in enumerate(self.results):
st.write(f"<small>Document {i+1} of {len(self.results)}</small>", unsafe_allow_html=True)
title = hit['title']
st.write(f"<h4>{title}</h4>", unsafe_allow_html=True)
r = re.split('\w\.\.\.\w', hit.highlights("text").replace("\n\n", ""))
for h in r:
st.write(html_template.format(hit=h), unsafe_allow_html=True)
if query_str:
st.session_state['pages'] = []
with ix.searcher() as searcher:
with st.spinner("Searching..."):
res = searcher.search(query, limit=None)
res.fragmenter.maxchars = 1000
res.fragmenter.surround = 250
pages = (len(res) // st.session_state['to_see']) + 1
for i in range(1, pages):
page = res[i*st.session_state['to_see']-st.session_state['to_see']:i*st.session_state['to_see']]
p = Page(page, i, st.session_state['to_see'])
st.session_state['pages'].append(p)
with st.sidebar:
st.markdown("# Page Navigation")
if st.button('See next page', key='next'):
st.session_state['page_count'] += 1
if st.button('See previous page', key='prev'):
st.session_state['page_count'] -= 1
page_swap = st.number_input('What page do you want to visit?', min_value=1, max_value=len(st.session_state['pages']), value=1)
if st.button('Go to page'):
st.session_state['page_count'] = page_swap-1
st.write(f"Page {st.session_state['page_count']+1} of {pages-1}")
if st.session_state['page_count'] < len(st.session_state['pages']):
selected_page = st.session_state['pages'][st.session_state['page_count']]
selected_page()
elif st.session_state['page_count'] < 1:
st.session_state['page_count'] = 0
else:
st.write("No more pages!")
# @title Start the application
# @markdown Run this cell and then wait for the URL to printed below.
# @markdown
# @markdown Click on it and use the password/Endpoint IP for the password.
!npm install localtunnel
!streamlit run /content/app.py &>/content/logs.txt &
print("\n" * 3)
import urllib
print(
"Password/Endpoint IP for localtunnel is:",
urllib.request.urlopen("https://ipv4.icanhazip.com")
.read()
.decode("utf8")
.strip("\n"),
)
print("Copy and paste this in the box in the link below.")
print("\n" * 3)
print("Click on the link below.")
!npx localtunnel --port 8501