{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install google-colab-selenium"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Webscraping II: Dynamic websites\n",
    "In the last notebook, we saw how to use `BeautifulSoup` to scrape data from a website and read that data into a powerful data structure called a `pandas` `DataFrame`. In this notebook, we'll do something very similar again. We'll be taking a website url, passing it through 3rd party software and extracting useful information that we can use to populate a DataFrame. This time, however, we will be scraping a *dynamic* website, that is a website whose HTML code is generated by an application.\n",
    "\n",
    "## Our task\n",
    "The Packard Humanities Institute runs [this site](https://latin.packhum.org/), which collates and presents various textual data on classical Latin literature. It is a very useful resource, but they do not allow users to download anything,\n",
    "so we will give in a search query and scrape the resulting data to a `DataFrame` and save it.\n",
    "\n",
    "## Goals:\n",
    "* Understand what a dynamic website is and how it is different from a static website\n",
    "* Install `Selenium` along with it's associated dependencies\n",
    "* Navigate the content of the site both in `Selenium` and `BeautifulSoup`\n",
    "* Get more experience with generating DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is a dynamic website\n",
    "Let's delve a bit deeper into what a dynamic website is and why we can't just use `BeautifulSoup` to parse it as we can with static websites. While a static webpage would require a manual update before content on the site can change, a dynamic website takes advantage of client and server-side scripting to be more adaptable to a user's needs.\n",
    "* Client-side scripting: code that is executed by the user's browser, generally using JavaScript. This scripting renders changes to the site when the user interacts with it. This can be anything from selecting a choice in a drop down menu to full fledged games like Wordle. This type of scripting is also common in many static sites.\n",
    "* Server-side scripting: code that is executed by the server before sending content to the user's browser. This code can be written in a wide variety of languages like Ruby (`RubyOnRails`), JavaScript (`VueJS`, `NodeJS`) and Python (`Django`, `Flask`). This code generally gets inputs from querying a database associated with the site and outputs HTML code from a template. This way, programmers can update elements in their sites without having to rewrite large sections of it. But, it also means that the HTML is not yet generated when we do a get request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import IFrame\n",
    "\n",
    "phi_url = \"https://latin.packhum.org/\"\n",
    "IFrame(phi_url, width=800, height=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can easily pass the HTML for this page into `BeautifulSoup`, but we can't interact with the page. We need to click the \"Click here if you agree to this License\", but we don't yet know how. What tool can we use to interact with the webpage and give us back the result of that interaction?\n",
    "\n",
    "The answer is: `Selenium`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting started with `Selenium`\n",
    "\n",
    "The syntax for `selenium` can be quite confusing compared to `BeautifulSoup`. Almost everything you do in `selenium` will go through a `WebDriver` object, below called `driver`. You should think of this a browser window on your own computer. We can do a variety of things with a `driver`, but most important are:\n",
    "\n",
    "1. Submit a GET request to a webapge;\n",
    "2. Find elements on a page (like `BeautifulSoup`);\n",
    "3. Create a screenshot of what the page looks like;\n",
    "4. Click on elements;\n",
    "5. And write characters into an input box."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import google_colab_selenium as gs  # from the google-colab-selenium package, will be different if you want to do this locally\n",
    "\n",
    "driver = gs.Chrome()  # driver object is the main entry point for selenium"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# just like using requests, we will use a GET request\n",
    "driver.get(phi_url)  # this time its method on driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from selenium.webdriver.common.by import By  # allows us to select by different things\n",
    "\n",
    "enter = driver.find_element(By.CLASS_NAME, \"lic\")  # selecting by class name\n",
    "enter.click()  # selenium allows us to click!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Navigating using `Selenium`\n",
    "\n",
    "At this point, we will start to use `Selenium` to **automate the actions we would normally make**. This is incredibly powerful. It means that anything you can do once in your own browser, you can replicate in `Selenium`, so you can link up any research process you have to Python, scraping and collecting along the way. Oftentimes though, `Selenium` is not as reliable as you might expect and does things in a \"illogical\" or \"weird\" way. As a result, you need to foster a sense of what `Selenium` will do for any given action.\n",
    "\n",
    "As always, adding logging and print statement can be very useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "\n",
    "\n",
    "def take_and_show_screenshot(driver, filename):\n",
    "    driver.save_screenshot(filename)  # driver can take a screenshot\n",
    "    return Image(filename=filename)\n",
    "\n",
    "\n",
    "take_and_show_screenshot(driver, \"enter_button.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list_elements = driver.find_elements(By.TAG_NAME, \"li\")\n",
    "for i, element in enumerate(list_elements):\n",
    "    print(\"Element number:\", i, \"; Text: \", element.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# second element is \"Word Search\"\n",
    "list_elements[1].click()  # click on it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "take_and_show_screenshot(driver, \"word_search.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_term = \"artifex\"  # \"artist\" in latin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_bar = driver.find_element(By.TAG_NAME, \"input\")\n",
    "search_bar.send_keys(search_term)  # send keys inputs the characters into a web element\n",
    "search_bar.send_keys(\"\\n\")  # equal to hitting enter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "take_and_show_screenshot(driver, \"search_results.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Starting on our task\n",
    "\n",
    "As I mentioned above, our task in this notebook is to collect search results. PHI will take in a search term, as we showed above, and return all times that that term appears in its corpus of Latin texts. It will give one line above and one line below. We would like to get all of this passages and organize them into a `DataFrame`, making sure to track the author, title and citation of the work.\n",
    "\n",
    "### XPATH\n",
    "Above we used `CLASS_NAME` and `TAG_NAME` to select web elements, but what if we need a more complex query? When using `BeautifulSoup`, we can use the `attrs` keyword argument in `find` and `find_all` to create complex queries. In `Selenium` we use the `By.XPATH` option. What is XPATH? XML Path Language or XPATH is a simple expression language designed to query markup language documents, like XML and HTML. The syntax can look hard, but here are the key things to keep in mind:\n",
    "* `/` navigates through the markup tree through the tags between the slash;\n",
    "* `//` gets the following tag name and starts the query there;\n",
    "* `@` indicates an attribute.\n",
    "\n",
    "The query below `//div[@id='results']/ul/li` tells `Selenium` to start at the first `div` tag with attribute `id='results'` and then get the first `ul` tags and then get all of the `li` tags.\n",
    "\n",
    "To learn more visit [this page](https://developer.mozilla.org/en-US/docs/Web/XPath)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matches = driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\")\n",
    "len(matches)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! We were able to use XPATH to get the results from PHI. But how do we then go to the next page?\n",
    "\n",
    "How would you do this in a browser? You would scroll down to the bottom of the page and then click on \"Next\". We can do the same in `Selenium`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this code scrolls to the bottom of the page\n",
    "driver.execute_script(\n",
    "    \"window.scrollTo(0, document.body.scrollHeight);\"\n",
    ")  # this is running a JavaScript snippet, not Python code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "take_and_show_screenshot(\n",
    "    driver, \"scroll_down.png\"\n",
    ")  # nice we scrolled, now we need to click"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "next_button = driver.find_element(\n",
    "    By.XPATH, \"//a[@class='plink pg_n']\"\n",
    ")  # find the next button\n",
    "next_button.click()  # click!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "take_and_show_screenshot(driver, \"next_page.png\")  # and now were on the next page"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matches.extend(\n",
    "    driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\")\n",
    ")  # lets do the same as above, get the matches and add them to our list\n",
    "len(matches)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matches[0].text  # huh? why's it not working"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is a `StaleElementReferenceException` and why is the error so weird looking? `Selenium` is not a native Python package. It is written in another programming language called Java (not JavaScript either). As a result, it is not being run in the same runtime as our normal Python code. Thankfully, we don't really need to worry about this until there's an error, at which point, we won't be able to see exactly what caused the error.  \n",
    "\n",
    "In this case, a `StaleElementReferenceException` refers to what happens when **`Selenium` attempts to access a web element that is no longer on the page** This can happen a lot, especially if you are scrolling and moving between pages. To deal with this problem, we need to get all of the text/data we need from a web element on its original page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a function that strips all of the relevant info\n",
    "def get_relevant_info(one_match):\n",
    "    text = one_match.text\n",
    "    cite_link = one_match.find_element(By.TAG_NAME, \"a\").get_attribute(\"href\")\n",
    "    return text, cite_link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "matches = driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\")\n",
    "for one_match in matches:\n",
    "    print(get_relevant_info(one_match))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pulling it all together\n",
    "\n",
    "Now that we've explore the first few pages of results, we are ready to pull all of the pieces together and create a short script that will take in a user query and scrape all of the relevant data from it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# last thing: there's a tracker at the top of the page that tells us how many results to expect\n",
    "# we can get the text there to make sure we have everything\n",
    "num_matches_raw = driver.find_element(By.XPATH, \"//div[@id='stats']\").text\n",
    "num_matches = int(re.search(r\"(\\d+) results\", num_matches_raw).group(1))\n",
    "num_matches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from selenium.common.exceptions import (\n",
    "    NoSuchElementException,\n",
    ")  # the error when the search bar isn't there\n",
    "\n",
    "# reinitializing the driver\n",
    "driver = gs.Chrome()\n",
    "driver.get(phi_url)\n",
    "\n",
    "# click on enter button\n",
    "enter = driver.find_element(By.CLASS_NAME, \"lic\")\n",
    "enter.click()\n",
    "\n",
    "# click on \"Word Search\"\n",
    "list_elements = driver.find_elements(By.TAG_NAME, \"li\")\n",
    "list_elements[1].click()\n",
    "\n",
    "# submit search term\n",
    "search_term = \"artifex\"  # feel free to change\n",
    "search_bar = driver.find_element(By.TAG_NAME, \"input\")\n",
    "search_bar.send_keys(search_term)\n",
    "search_bar.send_keys(\"\\n\")\n",
    "\n",
    "# get number of matches\n",
    "num_matches_raw = driver.find_element(By.XPATH, \"//div[@id='stats']\").text\n",
    "num_matches = int(re.search(r\"(\\d+) results\", num_matches_raw).group(1))\n",
    "print(\"Number of expected results: \", num_matches)\n",
    "\n",
    "matches = []\n",
    "while True:\n",
    "    page_matches = []\n",
    "    for element in driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\"):\n",
    "        page_matches.append(get_relevant_info(element))\n",
    "    matches.extend(page_matches)\n",
    "    print(\"Current length: \", len(matches))\n",
    "    print(\"Current length: \", len(matches))\n",
    "    driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
    "    try:\n",
    "        next_button = driver.find_element(By.XPATH, \"//a[@class='plink pg_n']\")\n",
    "        next_button.click()\n",
    "    except NoSuchElementException:\n",
    "        break\n",
    "\n",
    "print(\"Final length: \", len(matches))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Huh! Why didn't it work? What is a `NoSuchElementException`. Like the last error we saw, the traceback is not very helpful, though it does tell us what XPATH set off the error: `//div[@id='stats']`.\n",
    "\n",
    "This is getting thrown because `Selenium` is trying to find the results amount before it as loaded into the page. We get a `NoSuchElementException` because, to `Selenium` in the moment that we run this line: `driver.find_element(By.XPATH, \"//div[@id='stats']\").text`, the XPATH `//div[@id='stats']` does not exist, likely nothing exists on the page.\n",
    "\n",
    "As a result, we need to have the `driver` wait before we run that code. We can use `time.sleep` to do this. *Note*: There are ways of doing this in `Selenium` as well, but they're a little hit-or-miss, and most programmers use `time.sleep` anyways.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "# reinitializing the driver\n",
    "driver = gs.Chrome()\n",
    "driver.get(phi_url)\n",
    "\n",
    "# click on enter button\n",
    "enter = driver.find_element(By.CLASS_NAME, \"lic\")\n",
    "enter.click()\n",
    "\n",
    "# click on \"Word Search\"\n",
    "list_elements = driver.find_elements(By.TAG_NAME, \"li\")\n",
    "list_elements[1].click()\n",
    "\n",
    "# submit search term\n",
    "search_term = \"artifex\"  # feel free to change\n",
    "search_bar = driver.find_element(By.TAG_NAME, \"input\")\n",
    "search_bar.send_keys(search_term)\n",
    "search_bar.send_keys(\"\\n\")\n",
    "\n",
    "time.sleep(3)  # wait for the search to load\n",
    "\n",
    "# get number of matches\n",
    "num_matches_raw = driver.find_element(By.XPATH, \"//div[@id='stats']\").text\n",
    "num_matches = int(re.search(r\"(\\d+) results\", num_matches_raw).group(1))\n",
    "print(\"Number of expected results: \", num_matches)\n",
    "print()\n",
    "\n",
    "matches = []\n",
    "current_page = 1\n",
    "\n",
    "while True:\n",
    "    print(driver.find_element(By.XPATH, \"//div[@id='stats']\").text)\n",
    "    page_matches = []\n",
    "    for element in driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\"):\n",
    "        page_matches.append(get_relevant_info(element))\n",
    "    matches.extend(page_matches)\n",
    "    print(\"Current length: \", len(matches))\n",
    "    print(\"Current page: \", current_page)\n",
    "    driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
    "    time.sleep(1)  # wait for the scroll to load\n",
    "\n",
    "    try:\n",
    "        next_button = driver.find_element(By.XPATH, \"//a[@class='plink pg_n']\")\n",
    "        next_button.click()\n",
    "        current_page += 1\n",
    "        time.sleep(3)  # wait for the next page to load\n",
    "    except NoSuchElementException:\n",
    "        break  # break out of loop if the next button isn't there\n",
    "    print()\n",
    "\n",
    "print(\"Final length: \", len(matches))  # takes about a minute to run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Curating our data\n",
    "\n",
    "We now have a lot of useful datafrom PHI, but it's not in a useful format. For example, there are some results that have one passage of three lines and others that have multiple passages and citations in them. This is a problem. To create a good dataset, the shape of our data should be homogeneous."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one match per result\n",
    "matches[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# multiple matches in one result\n",
    "matches[-2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "splits = matches[-2][0].split(\"\\n\")  # \\n is a helpful delimiter\n",
    "len(splits)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "splits[0]  # citation info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "citations = re.search(r\"[A-z\\s]+, [A-z\\s]+ (.*)\", splits[0]).group(\n",
    "    1\n",
    ")  # regex to separate the numbers\n",
    "citations.split(\", \")  # split on the comma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've separated out the citations and the passages, we need to align the citation to correct passage. Unfortunately, each line is just separated by `\\n` and there is nothing separating the passages from each other. Thus, below I show how we can separate the raw data into a list of lists of three lines each, using the `range` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(\n",
    "    0,  # where to start\n",
    "    len(splits[1:]),  # how far to go\n",
    "    3,  # how many steps to take for each iteration\n",
    "):\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(\n",
    "    0,  # where to start\n",
    "    len(splits[1:]),  # how far to go\n",
    "    3,  # how many steps to take for each iteration\n",
    "):\n",
    "    print(splits[1:][i])  # current line\n",
    "    print(splits[1:][i + 1])  # the next line\n",
    "    print(splits[1:][i + 2])  # the last line\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[splits[1:][i : i + 3] for i in range(0, len(splits[1:]), 3)]  # as a list comprehension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passages = [\n",
    "    \"\\n\".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)\n",
    "]  # joining lists into a single passage\n",
    "passages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "citations = citations.split(\", \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(zip(citations, passages))  # zip to align them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# full code for formatting all of the info\n",
    "data = []\n",
    "for one_match in matches:\n",
    "    splits = one_match[0].split(\"\\n\")\n",
    "    author_title = re.search(r\"[A-z\\s]+, [A-z\\s]+\", splits[0]).group(0)\n",
    "    author = author_title.split(\", \")[0]\n",
    "    title = author_title.split(\", \")[1]\n",
    "\n",
    "    if len(splits) > 4:  # the case where we have more than one passage per result\n",
    "        # same code as above\n",
    "        citations = re.search(r\"[A-z\\s]+, [A-z\\s]+ (.*)\", splits[0]).group(1)\n",
    "        citations = citations.split(\", \")\n",
    "        passages = [\n",
    "            \"\\n\".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)\n",
    "        ]\n",
    "        data.extend(\n",
    "            [\n",
    "                (title, author, r[0], r[1], one_match[1])\n",
    "                for r in list(zip(citations, passages))\n",
    "            ]\n",
    "        )\n",
    "    else:  # the case were one result has one match\n",
    "        # this is new\n",
    "        citation = re.search(r\"[A-z\\s]+, [A-z\\s\\.]+ (.*)\", splits[0]).group(\n",
    "            1\n",
    "        )  # gets citation the same way\n",
    "        text = \"\\n\".join(splits[1:])  # collects and joins text\n",
    "        data.append((title, author, citation, text, one_match[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# putting it in a dataframe\n",
    "df = pd.DataFrame(data, columns=[\"title\", \"author\", \"citation\", \"text\", \"cite_link\"])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final function\n",
    "\n",
    "Yay! We got the data we wanted and in the form we wanted it in. Now, we can create a function that just takes in the user query and gives back the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def search_phi(query):\n",
    "    # collecting the data\n",
    "    driver = gs.Chrome()\n",
    "    driver.get(phi_url)\n",
    "\n",
    "    # click on enter button\n",
    "    enter = driver.find_element(By.CLASS_NAME, \"lic\")\n",
    "    enter.click()\n",
    "\n",
    "    # click on \"Word Search\"\n",
    "    list_elements = driver.find_elements(By.TAG_NAME, \"li\")\n",
    "    list_elements[1].click()\n",
    "\n",
    "    # submit search term\n",
    "    search_term = query\n",
    "    search_bar = driver.find_element(By.TAG_NAME, \"input\")\n",
    "    search_bar.send_keys(search_term)\n",
    "    search_bar.send_keys(\"\\n\")\n",
    "\n",
    "    time.sleep(3)  # wait for the search to load\n",
    "\n",
    "    # get number of matches\n",
    "    num_matches_raw = driver.find_element(By.XPATH, \"//div[@id='stats']\").text\n",
    "    num_matches = int(re.search(r\"(\\d+) results\", num_matches_raw).group(1))\n",
    "    print(\"Number of expected results: \", num_matches)\n",
    "    print()\n",
    "\n",
    "    matches = []\n",
    "    current_page = 1\n",
    "\n",
    "    while True:\n",
    "        print(driver.find_element(By.XPATH, \"//div[@id='stats']\").text)\n",
    "        page_matches = []\n",
    "        for element in driver.find_elements(By.XPATH, \"//div[@id='results']/ul/li\"):\n",
    "            page_matches.append(get_relevant_info(element))\n",
    "        matches.extend(page_matches)\n",
    "        print(\"Current length: \", len(matches))\n",
    "        print(\"Current page: \", current_page)\n",
    "        driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
    "        time.sleep(1)  # wait for the scroll to load\n",
    "\n",
    "        try:\n",
    "            next_button = driver.find_element(By.XPATH, \"//a[@class='plink pg_n']\")\n",
    "            next_button.click()\n",
    "            current_page += 1\n",
    "            time.sleep(3)  # wait for the next page to load\n",
    "        except NoSuchElementException:\n",
    "            break  # break out of loop if the next button isn't there\n",
    "        print()\n",
    "\n",
    "    print(\"Final length: \", len(matches))  # takes about a minute to run\n",
    "\n",
    "    # format data\n",
    "    data = []\n",
    "    for one_match in matches:\n",
    "        splits = one_match[0].split(\"\\n\")\n",
    "        author_title = re.search(r\"[A-z\\s]+, [A-z\\s]+\", splits[0]).group(0)\n",
    "        author = author_title.split(\", \")[0]\n",
    "        title = author_title.split(\", \")[1]\n",
    "\n",
    "        if len(splits) > 4:  # the case where we have more than one passage per result\n",
    "            # same code as above\n",
    "            citations = re.search(r\"[A-z\\s]+, [A-z\\s]+ (.*)\", splits[0]).group(1)\n",
    "            citations = citations.split(\", \")\n",
    "            passages = [\n",
    "                \"\\n\".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)\n",
    "            ]\n",
    "            data.extend(\n",
    "                [\n",
    "                    (title, author, r[0], r[1], one_match[1])\n",
    "                    for r in list(zip(citations, passages))\n",
    "                ]\n",
    "            )\n",
    "        else:  # the case were one result has one match\n",
    "            # this is new\n",
    "            citation = re.search(r\"[A-z\\s]+, [A-z\\s\\.]+ (.*)\", splits[0]).group(\n",
    "                1\n",
    "            )  # gets citation the same way\n",
    "            text = \"\\n\".join(splits[1:])  # collects and joins text\n",
    "            data.append((title, author, citation, text, one_match[1]))\n",
    "\n",
    "    return pd.DataFrame(\n",
    "        data, columns=[\"title\", \"author\", \"citation\", \"text\", \"cite_link\"]\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instar_df = search_phi(\"instar\")\n",
    "instar_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# visualize titles\n",
    "instar_df.title.value_counts()[:10].plot(kind=\"barh\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# visualize authors\n",
    "instar_df.author.value_counts()[:10].plot(kind=\"barh\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  },
  "mystnb": {
   "execution_mode": "off"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}