!pip install google-colab-selenium
Webscraping II: Dynamic websites#
In the last notebook, we saw how to use BeautifulSoup to scrape data from a website and read that data into a powerful data structure called a pandas DataFrame. In this notebook, we’ll do something very similar again. We’ll be taking a website url, passing it through 3rd party software and extracting useful information that we can use to populate a DataFrame. This time, however, we will be scraping a dynamic website, that is a website whose HTML code is generated by an application.
Our task#
The Packard Humanities Institute runs this site, which collates and presents various textual data on classical Latin literature. It is a very useful resource, but they do not allow users to download anything,
so we will give in a search query and scrape the resulting data to a DataFrame and save it.
Goals:#
Understand what a dynamic website is and how it is different from a static website
Install
Seleniumalong with it’s associated dependenciesNavigate the content of the site both in
SeleniumandBeautifulSoupGet more experience with generating DataFrames
What is a dynamic website#
Let’s delve a bit deeper into what a dynamic website is and why we can’t just use BeautifulSoup to parse it as we can with static websites. While a static webpage would require a manual update before content on the site can change, a dynamic website takes advantage of client and server-side scripting to be more adaptable to a user’s needs.
Client-side scripting: code that is executed by the user’s browser, generally using JavaScript. This scripting renders changes to the site when the user interacts with it. This can be anything from selecting a choice in a drop down menu to full fledged games like Wordle. This type of scripting is also common in many static sites.
Server-side scripting: code that is executed by the server before sending content to the user’s browser. This code can be written in a wide variety of languages like Ruby (
RubyOnRails), JavaScript (VueJS,NodeJS) and Python (Django,Flask). This code generally gets inputs from querying a database associated with the site and outputs HTML code from a template. This way, programmers can update elements in their sites without having to rewrite large sections of it. But, it also means that the HTML is not yet generated when we do a get request.
from IPython.display import IFrame
phi_url = "https://latin.packhum.org/"
IFrame(phi_url, width=800, height=500)
We can easily pass the HTML for this page into BeautifulSoup, but we can’t interact with the page. We need to click the “Click here if you agree to this License”, but we don’t yet know how. What tool can we use to interact with the webpage and give us back the result of that interaction?
The answer is: Selenium.
Getting started with Selenium#
The syntax for selenium can be quite confusing compared to BeautifulSoup. Almost everything you do in selenium will go through a WebDriver object, below called driver. You should think of this a browser window on your own computer. We can do a variety of things with a driver, but most important are:
Submit a GET request to a webapge;
Find elements on a page (like
BeautifulSoup);Create a screenshot of what the page looks like;
Click on elements;
And write characters into an input box.
import google_colab_selenium as gs # from the google-colab-selenium package, will be different if you want to do this locally
driver = gs.Chrome() # driver object is the main entry point for selenium
# just like using requests, we will use a GET request
driver.get(phi_url) # this time its method on driver
from selenium.webdriver.common.by import By # allows us to select by different things
enter = driver.find_element(By.CLASS_NAME, "lic") # selecting by class name
enter.click() # selenium allows us to click!
Starting on our task#
As I mentioned above, our task in this notebook is to collect search results. PHI will take in a search term, as we showed above, and return all times that that term appears in its corpus of Latin texts. It will give one line above and one line below. We would like to get all of this passages and organize them into a DataFrame, making sure to track the author, title and citation of the work.
XPATH#
Above we used CLASS_NAME and TAG_NAME to select web elements, but what if we need a more complex query? When using BeautifulSoup, we can use the attrs keyword argument in find and find_all to create complex queries. In Selenium we use the By.XPATH option. What is XPATH? XML Path Language or XPATH is a simple expression language designed to query markup language documents, like XML and HTML. The syntax can look hard, but here are the key things to keep in mind:
/navigates through the markup tree through the tags between the slash;//gets the following tag name and starts the query there;@indicates an attribute.
The query below //div[@id='results']/ul/li tells Selenium to start at the first div tag with attribute id='results' and then get the first ul tags and then get all of the li tags.
To learn more visit this page.
matches = driver.find_elements(By.XPATH, "//div[@id='results']/ul/li")
len(matches)
Great! We were able to use XPATH to get the results from PHI. But how do we then go to the next page?
How would you do this in a browser? You would scroll down to the bottom of the page and then click on “Next”. We can do the same in Selenium.
# this code scrolls to the bottom of the page
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
) # this is running a JavaScript snippet, not Python code
take_and_show_screenshot(
driver, "scroll_down.png"
) # nice we scrolled, now we need to click
next_button = driver.find_element(
By.XPATH, "//a[@class='plink pg_n']"
) # find the next button
next_button.click() # click!
take_and_show_screenshot(driver, "next_page.png") # and now were on the next page
matches.extend(
driver.find_elements(By.XPATH, "//div[@id='results']/ul/li")
) # lets do the same as above, get the matches and add them to our list
len(matches)
matches[0].text # huh? why's it not working
What is a StaleElementReferenceException and why is the error so weird looking? Selenium is not a native Python package. It is written in another programming language called Java (not JavaScript either). As a result, it is not being run in the same runtime as our normal Python code. Thankfully, we don’t really need to worry about this until there’s an error, at which point, we won’t be able to see exactly what caused the error.
In this case, a StaleElementReferenceException refers to what happens when Selenium attempts to access a web element that is no longer on the page This can happen a lot, especially if you are scrolling and moving between pages. To deal with this problem, we need to get all of the text/data we need from a web element on its original page.
# a function that strips all of the relevant info
def get_relevant_info(one_match):
text = one_match.text
cite_link = one_match.find_element(By.TAG_NAME, "a").get_attribute("href")
return text, cite_link
matches = driver.find_elements(By.XPATH, "//div[@id='results']/ul/li")
for one_match in matches:
print(get_relevant_info(one_match))
print()
Pulling it all together#
Now that we’ve explore the first few pages of results, we are ready to pull all of the pieces together and create a short script that will take in a user query and scrape all of the relevant data from it.
import re
# last thing: there's a tracker at the top of the page that tells us how many results to expect
# we can get the text there to make sure we have everything
num_matches_raw = driver.find_element(By.XPATH, "//div[@id='stats']").text
num_matches = int(re.search(r"(\d+) results", num_matches_raw).group(1))
num_matches
from selenium.common.exceptions import (
NoSuchElementException,
) # the error when the search bar isn't there
# reinitializing the driver
driver = gs.Chrome()
driver.get(phi_url)
# click on enter button
enter = driver.find_element(By.CLASS_NAME, "lic")
enter.click()
# click on "Word Search"
list_elements = driver.find_elements(By.TAG_NAME, "li")
list_elements[1].click()
# submit search term
search_term = "artifex" # feel free to change
search_bar = driver.find_element(By.TAG_NAME, "input")
search_bar.send_keys(search_term)
search_bar.send_keys("\n")
# get number of matches
num_matches_raw = driver.find_element(By.XPATH, "//div[@id='stats']").text
num_matches = int(re.search(r"(\d+) results", num_matches_raw).group(1))
print("Number of expected results: ", num_matches)
matches = []
while True:
page_matches = []
for element in driver.find_elements(By.XPATH, "//div[@id='results']/ul/li"):
page_matches.append(get_relevant_info(element))
matches.extend(page_matches)
print("Current length: ", len(matches))
print("Current length: ", len(matches))
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
next_button = driver.find_element(By.XPATH, "//a[@class='plink pg_n']")
next_button.click()
except NoSuchElementException:
break
print("Final length: ", len(matches))
Huh! Why didn’t it work? What is a NoSuchElementException. Like the last error we saw, the traceback is not very helpful, though it does tell us what XPATH set off the error: //div[@id='stats'].
This is getting thrown because Selenium is trying to find the results amount before it as loaded into the page. We get a NoSuchElementException because, to Selenium in the moment that we run this line: driver.find_element(By.XPATH, "//div[@id='stats']").text, the XPATH //div[@id='stats'] does not exist, likely nothing exists on the page.
As a result, we need to have the driver wait before we run that code. We can use time.sleep to do this. Note: There are ways of doing this in Selenium as well, but they’re a little hit-or-miss, and most programmers use time.sleep anyways.
import time
# reinitializing the driver
driver = gs.Chrome()
driver.get(phi_url)
# click on enter button
enter = driver.find_element(By.CLASS_NAME, "lic")
enter.click()
# click on "Word Search"
list_elements = driver.find_elements(By.TAG_NAME, "li")
list_elements[1].click()
# submit search term
search_term = "artifex" # feel free to change
search_bar = driver.find_element(By.TAG_NAME, "input")
search_bar.send_keys(search_term)
search_bar.send_keys("\n")
time.sleep(3) # wait for the search to load
# get number of matches
num_matches_raw = driver.find_element(By.XPATH, "//div[@id='stats']").text
num_matches = int(re.search(r"(\d+) results", num_matches_raw).group(1))
print("Number of expected results: ", num_matches)
print()
matches = []
current_page = 1
while True:
print(driver.find_element(By.XPATH, "//div[@id='stats']").text)
page_matches = []
for element in driver.find_elements(By.XPATH, "//div[@id='results']/ul/li"):
page_matches.append(get_relevant_info(element))
matches.extend(page_matches)
print("Current length: ", len(matches))
print("Current page: ", current_page)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # wait for the scroll to load
try:
next_button = driver.find_element(By.XPATH, "//a[@class='plink pg_n']")
next_button.click()
current_page += 1
time.sleep(3) # wait for the next page to load
except NoSuchElementException:
break # break out of loop if the next button isn't there
print()
print("Final length: ", len(matches)) # takes about a minute to run
Curating our data#
We now have a lot of useful datafrom PHI, but it’s not in a useful format. For example, there are some results that have one passage of three lines and others that have multiple passages and citations in them. This is a problem. To create a good dataset, the shape of our data should be homogeneous.
# one match per result
matches[-1]
# multiple matches in one result
matches[-2]
splits = matches[-2][0].split("\n") # \n is a helpful delimiter
len(splits)
splits[0] # citation info
citations = re.search(r"[A-z\s]+, [A-z\s]+ (.*)", splits[0]).group(
1
) # regex to separate the numbers
citations.split(", ") # split on the comma
Now that we’ve separated out the citations and the passages, we need to align the citation to correct passage. Unfortunately, each line is just separated by \n and there is nothing separating the passages from each other. Thus, below I show how we can separate the raw data into a list of lists of three lines each, using the range function.
for i in range(
0, # where to start
len(splits[1:]), # how far to go
3, # how many steps to take for each iteration
):
print(i)
for i in range(
0, # where to start
len(splits[1:]), # how far to go
3, # how many steps to take for each iteration
):
print(splits[1:][i]) # current line
print(splits[1:][i + 1]) # the next line
print(splits[1:][i + 2]) # the last line
print()
[splits[1:][i : i + 3] for i in range(0, len(splits[1:]), 3)] # as a list comprehension
passages = [
"\n".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)
] # joining lists into a single passage
passages
citations = citations.split(", ")
list(zip(citations, passages)) # zip to align them
# full code for formatting all of the info
data = []
for one_match in matches:
splits = one_match[0].split("\n")
author_title = re.search(r"[A-z\s]+, [A-z\s]+", splits[0]).group(0)
author = author_title.split(", ")[0]
title = author_title.split(", ")[1]
if len(splits) > 4: # the case where we have more than one passage per result
# same code as above
citations = re.search(r"[A-z\s]+, [A-z\s]+ (.*)", splits[0]).group(1)
citations = citations.split(", ")
passages = [
"\n".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)
]
data.extend(
[
(title, author, r[0], r[1], one_match[1])
for r in list(zip(citations, passages))
]
)
else: # the case were one result has one match
# this is new
citation = re.search(r"[A-z\s]+, [A-z\s\.]+ (.*)", splits[0]).group(
1
) # gets citation the same way
text = "\n".join(splits[1:]) # collects and joins text
data.append((title, author, citation, text, one_match[1]))
import pandas as pd
# putting it in a dataframe
df = pd.DataFrame(data, columns=["title", "author", "citation", "text", "cite_link"])
df
Final function#
Yay! We got the data we wanted and in the form we wanted it in. Now, we can create a function that just takes in the user query and gives back the results.
def search_phi(query):
# collecting the data
driver = gs.Chrome()
driver.get(phi_url)
# click on enter button
enter = driver.find_element(By.CLASS_NAME, "lic")
enter.click()
# click on "Word Search"
list_elements = driver.find_elements(By.TAG_NAME, "li")
list_elements[1].click()
# submit search term
search_term = query
search_bar = driver.find_element(By.TAG_NAME, "input")
search_bar.send_keys(search_term)
search_bar.send_keys("\n")
time.sleep(3) # wait for the search to load
# get number of matches
num_matches_raw = driver.find_element(By.XPATH, "//div[@id='stats']").text
num_matches = int(re.search(r"(\d+) results", num_matches_raw).group(1))
print("Number of expected results: ", num_matches)
print()
matches = []
current_page = 1
while True:
print(driver.find_element(By.XPATH, "//div[@id='stats']").text)
page_matches = []
for element in driver.find_elements(By.XPATH, "//div[@id='results']/ul/li"):
page_matches.append(get_relevant_info(element))
matches.extend(page_matches)
print("Current length: ", len(matches))
print("Current page: ", current_page)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # wait for the scroll to load
try:
next_button = driver.find_element(By.XPATH, "//a[@class='plink pg_n']")
next_button.click()
current_page += 1
time.sleep(3) # wait for the next page to load
except NoSuchElementException:
break # break out of loop if the next button isn't there
print()
print("Final length: ", len(matches)) # takes about a minute to run
# format data
data = []
for one_match in matches:
splits = one_match[0].split("\n")
author_title = re.search(r"[A-z\s]+, [A-z\s]+", splits[0]).group(0)
author = author_title.split(", ")[0]
title = author_title.split(", ")[1]
if len(splits) > 4: # the case where we have more than one passage per result
# same code as above
citations = re.search(r"[A-z\s]+, [A-z\s]+ (.*)", splits[0]).group(1)
citations = citations.split(", ")
passages = [
"\n".join(splits[1:][i : i + 3]) for i in range(0, len(splits[1:]), 3)
]
data.extend(
[
(title, author, r[0], r[1], one_match[1])
for r in list(zip(citations, passages))
]
)
else: # the case were one result has one match
# this is new
citation = re.search(r"[A-z\s]+, [A-z\s\.]+ (.*)", splits[0]).group(
1
) # gets citation the same way
text = "\n".join(splits[1:]) # collects and joins text
data.append((title, author, citation, text, one_match[1]))
return pd.DataFrame(
data, columns=["title", "author", "citation", "text", "cite_link"]
)
instar_df = search_phi("instar")
instar_df
# visualize titles
instar_df.title.value_counts()[:10].plot(kind="barh")
# visualize authors
instar_df.author.value_counts()[:10].plot(kind="barh")