{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Static Webscraping with BeautifulSoup\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, I will walk you through the basics of webscraping. As we will see, webscraping, or programmitically extracting data (usually text) from a webpage, is a very common and useful part of a wide variety of natural language processing tasks. Often times, webscraping will be the first step in your NLP pipeline and so the results we get from webscraping have the potential to affect all of the subsequent steps of your pipeline. Too, smart webscraping can save you a lot of time upfront that you would have had to spend on optimization later on in the process.\n",
"\n",
"## Our task in this notebook\n",
"I will take the Wikipedia page of the [list of all Roman Emperors](https://en.wikipedia.org/wiki/List_of_Roman_emperors) and recreate the tables in it in Python. This is a very common task in NLP. A lot of the time, we need to cross reference a source we are interested in with a secondary source like Wikipedia. I will show you the basics in this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Goals\n",
"* Access a static webpage in Python using the requests library\n",
"* Navigate through raw html code to find the data we are interested in\n",
"* Arrange that data into a useful format\n",
"* Clean the data to fit whatever we want to do with it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## we want to scrape all of the tables from this wikipedia article and make our own tables of the same same data\n",
"from IPython.display import IFrame\n",
"\n",
"IFrame(\n",
" \"https://en.wikipedia.org/wiki/List_of_Roman_emperors#Principate_(27_BC_–_AD_284)\",\n",
" width=1000,\n",
" height=500,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `requests`\n",
"`requests` is an incredibly useful Python package that comes pre-installed on most distributions of Python. If you need to interact with the web from a .py file or notebook, you will be using requests in some form. Let's take a look at how it works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## note: although requests is very common, if this cell returns a ModuleNotFound error, uncomment the line below to install it\n",
"# !python3 -m pip install requests\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(requests)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## using the .get method we can access any webpage\n",
"r = requests.get(\"https://en.wikipedia.org/wiki/List_of_Roman_emperors\")\n",
"\n",
"## let's see what r is\n",
"type(r)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## a Response object can give us all of the information we need\n",
"## we can call .text to see the raw html of a webpage\n",
"html = r.text\n",
"type(html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `BeautifulSoup`\n",
"Now that we have our HTML output, we need to parse it as a structured form of text. Above we see that our html object is a string, meaning we can only use the attributes and methods of strings on it. While we could create a parser from scratch that lets us take advantage of the structured nature of the HTML code, as with many tasks in Python, we don't need to because one already exists for us. There are many packages that parse HTML, but we'll be using the most popular, [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*A note on copyright*\n",
"
\n",
"Just because you can access and scrape a website doesn't always mean you should. Most websites are public and they have to be to share the information they seek to, but that does not mean that it is free. Most data on the internet can be used without a problem. These sources are generally protected under certain licences like the MIT or Creative Commons Attribution-ShareAlike 3.0. Generally, if a site does not have one of these licenses, you should stay away. Stealing data this way can be dangerous and can break the terms of service of many sites."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## bs4 is in most distributions of Python, but if this cell does not work try:\n",
"## !python3 -m pip install bs4\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## the BeautifulSoup parser takes in any string and attempts to parse it as HTML (or XML)\n",
"soup = BeautifulSoup(r.text, features=\"html\")\n",
"type(soup)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## with this BeautifulSoup object, we can navigate through the tag soup in a systematic way\n",
"soup\n",
"## but now we need to isolate the data we want\n",
"## tranisition to Chrome Developer Tools/HTML highlighting"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(soup.find_all(\"tbody\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## now we have a iterable of each tbody object\n",
"for tbody in soup.find_all(\"tbody\"):\n",
" print(tbody)\n",
" print(\"-\" * 20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## importantly each element of this iterable can be navigated like the whole tree did\n",
"tbody = soup.find_all(\"tbody\")[0]\n",
"for tr in tbody.find_all(\"tr\")[\n",
" 1:\n",
"]: ## the first element (0th in the list) is the columns names, so I have gotten rid of it\n",
" # print(tr)\n",
" # print('_________')\n",
" row = tr.find_all([\"td\", \"th\"])\n",
" print(\n",
" row[0].a[\"href\"],\n",
" row[1].b.get_text(),\n",
" row[1].small.get_text(),\n",
" row[2].get_text(),\n",
" row[3].get_text(),\n",
" row[4].get_text(),\n",
" sep=\"\\n\",\n",
" )\n",
" print(\"_________\")\n",
"## now we're getting somewhere"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I am going to use a very versatile package called `re` or regular expressions (regex) to do some advanced string parsing. This regex function, finditer, takes in a pattern and a text and returns all of the times that pattern occurs in the text. These patterns can look very complicated, but, in this case, it is '(?<=\\)', which means: *find all of the places between an end paraenthesis and the rest of the string*.\n",
"
\n",
"\n",
"You can read more about regex [here](https://librarycarpentry.org/lc-data-intro-archives/04-regular-expressions/index.html) and you can play around with your own regex at [regex101](https://regex101.com/). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## going back to the full list of tbodys and populate a dictionary with all our data\n",
"import re\n",
"\n",
"emp_dict = {}\n",
"for tbody in soup.find_all(\"tbody\"):\n",
" for tr in tbody.find_all(\"tr\")[1:]:\n",
" row = tr.find_all([\"td\", \"th\"])\n",
" if len(row) == 5: ## check if each row is of the correct length\n",
" if not isinstance(row[0].a, type(None)): ## check if each row has an image\n",
" img_url = f\"en.wikipedia.org{row[0].a['href']}\"\n",
"\n",
" life_details = re.split(\n",
" \"(?<=\\))\", row[4].get_text()\n",
" ) ## here I am using regex to search for the place between an end parenthesis ')' and the rest of the string\n",
" if (\n",
" len(life_details) > 1\n",
" ): ## checking if there is a parenthesis, if there isn't then the cause of death recorded\n",
" life_date = life_details[0]\n",
" cod = life_details[1]\n",
" else:\n",
" life_date = life_details[0]\n",
" cod = \"None found.\"\n",
"\n",
" if not isinstance(\n",
" row[1].small, type(None)\n",
" ): ## checking if there is a full name associated with an emperor\n",
" full_name = row[1].small.get_text()\n",
" else:\n",
" full_name = \"None found.\"\n",
"\n",
" emp_dict[row[1].b.get_text()] = [\n",
" img_url,\n",
" full_name,\n",
" row[2].get_text(),\n",
" row[3].get_text(),\n",
" life_date,\n",
" cod,\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"emp_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"with open(\"emp_dict.pickle\", \"wb\") as handle:\n",
" pickle.dump(emp_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Using `pandas`*\n",
"
\n",
"A dictionary is useful, but with more and more data, you'll find that a dictionary is to simple of a data structure for most webscraped data. A `pandas` dataframe is much more scalable and makes tabular data much easier to work with. That being said, one is NEVER supposed to fill a dataframe with a loop, as we filled the dictionary above. Instead, we can use the dictionary we created above and turn it into a dataframe. This way we can keep both forms of the data in case we need dictionary representation later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd ## you will almost always see pandas imported like this, the 'pd' alias is a very useful shorthand\n",
"\n",
"## Let's see what happens when we input the dictionary directly\n",
"pd.DataFrame(emp_dict)\n",
"## it's close but not quite what we wanted"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## pandas is programmed to look for numerical indices, which dictionaries (because they're an unordered data type) do not have\n",
"## we can coerce it though to accept string values as the index with the 'from_dict' method and the 'orient' keyword argument\n",
"## the 'reset_index' method will then turn our index into a column and give us an index for the rows\n",
"emp_df = pd.DataFrame.from_dict(emp_dict, orient=\"index\").reset_index()\n",
"emp_df = emp_df.rename(\n",
" columns={\n",
" \"index\": \"name\",\n",
" 0: \"img\",\n",
" 1: \"full_name\",\n",
" 2: \"reign\",\n",
" 3: \"succession\",\n",
" 4: \"life_dates\",\n",
" 5: \"cause_of_death\",\n",
" }\n",
") ## last, this is one way to rename columns\n",
"emp_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yay! 🎉 🥳 Our data has been scraped! 🥳 🎉\n",
"... but what can we do with it 🤔\n",
"
\n",
"\n",
"Let's try to plot all of the ages of the emperors, as we have that data in the life_dates column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## we have a slight problem though...\n",
"## Take Trajan's row for instance\n",
"string = emp_df.loc[emp_df[\"name\"] == \"Trajan\"].life_dates.iloc[\n",
" 0\n",
"] ## gets a contents of the life_dates cell in the Trajan row\n",
"print(string)\n",
"print(str.encode(string))\n",
"## what is \\xc2\\xa0??"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It might not seem like it, but the difference between the two lines above will be very significant in cleaning our data to be used in visualizations.\n",
"\n",
"These collections of letters and numbers preceded by a backslash are byte representations of characters at the index position of the characters themselves. In fact these representations are slightly different characters than they might seem and we will have to normalize them in order to interact with them. To put a long story short, these characters come from a different text encoding (ISO-8895-1) than what Python expects (utf-8), so we must convert these non-standard characters into their standard output. We can use a package called unidecode, which also comes with most distributions of Python."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install unidecode\n",
"import unidecode\n",
"\n",
"print(string)\n",
"print(unidecode.unidecode(string))\n",
"print(str.encode(unidecode.unidecode(string)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def getAges(life_dates):\n",
" ld = unidecode.unidecode(life_dates)\n",
" age = re.search(\n",
" \"(?<=aged )([0-9]+)|(?<=aged approx. )([0-9]+)\", ld\n",
" ) ## more regex to extract the age\n",
" if age:\n",
" return int(age.group(0))\n",
" else:\n",
" return None ## there are some emperors for whom we have no dates for"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"getAges(string)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## apply takes in a function and applies it to all of the members of a column\n",
"emp_df[\"age\"] = emp_df[\"life_dates\"].apply(getAges)\n",
"emp_df[\"age\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import plotly.express as px\n",
"\n",
"fig = px.scatter(x=emp_df[\"name\"], y=emp_df[\"age\"])\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reviewing what we learned\n",
"* The basics of the `requests` library\n",
"* Navigating HTML using BeautifulSoup\n",
"* How to construct a dictionary for our data\n",
"* Turning that dictionary into a `pandas` dataframe\n",
"* Cleaning our data for a specific purpose with `.apply`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a challenge, try to do what I did for the ages of the emperors, but with the length that they reigned for. This is a much more difficult question and can be done in a couple different ways. You will likely have to use the `datetime` package in Python. If you have trouble or just want to show off how you did it, feel free to reach out and let me know at peter.nadel@tufts.edu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Thanks for reading"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
},
"mystnb": {
"execution_mode": "off"
},
"vscode": {
"interpreter": {
"hash": "93687b5df5ceb999eb7461347bbd13994c20900ea84c03a9391096e02392089d"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}