{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Art of Webscraping III: Scraping Reddit\n", "\n", "In the past two notebooks in this webscraiping series, we saw how we could use Python to automate getting data from websites. First `beautifulsoup` gave us a method of navigating the HTML of a static webpage and then `selenium` allowed us to parse dynamically generated pages.\n", "\n", "In this notebook, we'll look at a specialized source of data that we can pull from, Reddit. Reddit.com is a collection of forums where uses can discuss topics of shared interest. A lot of people use Reddit, so many NLP researchers use it as a place to gather data for novel datasets. In this example, we'll collect a swathe of text from [r/latin](reddit.com/r/latin) and save it as a CSV.\n", "\n", "Over the past few years, Reddit has made it very difficult to get large chunks of their data. That said, another group, pullpush.io, have saved and hosted terabytes of historical Reddit data for public use. We'll be using their API in this notebook.\n", "\n", "**Nota Bene**: pullpush's API service is designed for academic use only! It is an incredible resource, especially because it is free. Please do not abuse it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# imports\n", "import requests\n", "from datetime import datetime, timezone, timedelta\n", "from tqdm import tqdm\n", "import random\n", "\n", "random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because pullpush is an API service. We will ask for data using an URL. This URL will include information like what subreddit we want to search through, the dates we want to search in, and how the results should be ordered. Se below for an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ex_url = \"https://api.pullpush.io/reddit/search/submission/?subreddit=latin\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break that down into its component parts:\n", "\n", "\n", "* *https://api.pullpush.io/reddit/search/*: This part of the URL should never change. This is the base URL that we'll be adding to depending on our purposes.\n", "* *submission/*: This addition tells pullpsuh that we want to search posts (submissions) and not comments. As we will see later, there is different string that we can use instead that will allow us to search for comments.\n", "* *?*: This question mark is the start of our specific query. It tells pullpush that we are going to be giving it instructions about what data we are expecting to get.\n", "* *subreddit=latin*: This section tells pullpush we want data from the r/latin subreddit. This is very simple but there are many we can nuance this significantly.\n", "\n", "This is the simplest type of query we can run so let's see what it gives us.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using the requests library to make a GET request\n", "response = requests.get(ex_url)\n", "print(response.status_code) # status code 200 means it worked" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = response.json()[\"data\"]\n", "len(data), type(data[0]) # 100 dictionary responses" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collecting Submissions (Posts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, each one of these dictionaries from the `data` list, holds the information from an individual post. But why are there only 100? Pullpush only allows users to get 100 posts per request, meaning that we'll have to get creative with how we request data from Pullpush.\n", "\n", "To do so we'll have to take advantage of the other request modifiers besides just \"subreddit.\" A list of all of these can be found [here](https://pullpush.io/#docs).\n", "\n", "One method that we can try is using timestamps to segment the data into chunks less than or equal to 100. Pullpush allows us to ask for posts given a specific time block. We can then loop through these time blocks until we get the data that we want.\n", "\n", "Below I'll walkthrough getting data for a single day. According to the pullpush documentation, there is a \"before\" and an \"after\" modeifier, but these only accept an \"Epoch value\". What does that mean?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# normal python date\n", "_date = datetime(2022, 1, 1)\n", "_date, type(_date)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# epoch value\n", "dt_with_timezone = _date.replace(tzinfo=timezone.utc)\n", "int(dt_with_timezone.timestamp())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"epoch value\" or Unix Timestamp is a special method of encoding dates for computers. It is a standard which represents dates as the number a seconds that have elapsed since January 1, 1970. This might seem abritary and that's because it is! That said we can create a few functions to make translating between normal Python datetime objects and epoch values easier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def convert_utc_to_date(ts):\n", " \"\"\"\n", " Converts a UTC timestamp to a local datetime object.\n", " \"\"\"\n", " utc_datetime = datetime.utcfromtimestamp(ts).replace(tzinfo=timezone.utc)\n", " local_datetime = utc_datetime.astimezone()\n", " return local_datetime.strftime(\"%Y-%m-%d %H:%M:%S\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def convert_date_to_utc(dt):\n", " \"\"\"\n", " Converts a local datetime object to a UTC timestamp.\n", " \"\"\"\n", " dt_with_timezone = dt.replace(tzinfo=timezone.utc)\n", " return int(dt_with_timezone.timestamp())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " convert_utc_to_date(convert_date_to_utc(_date))\n", ") # should print 2022-01-01 00:00:00" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's try adding this to our request URL and retrieve a day worth of posts. A day is 86400 seconds so all we need to do is convert our datetime object to UTC and then add 86400." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_date = datetime(2024, 5, 23)\n", "utc_ts = convert_date_to_utc(start_date)\n", "\n", "url_query = f\"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin\" # can use & to join modifiers\n", "url_query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "res = requests.get(url_query)\n", "if res.status_code == 200:\n", " data = res.json()[\"data\"]\n", " print(f\"Number of posts: {len(data)}\")\n", " print(\n", " f\"Most recent post: {convert_utc_to_date(data[0]['created_utc'])}, {data[0]['title']}\"\n", " )\n", " print(\n", " f\"Least recent post: {convert_utc_to_date(data[-1]['created_utc'])}, {data[-1]['title']}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wonderful! Now we have a way to get all of the posts for a single day. Now we can create a loop where we go through every day between a start date and an end date, collecting all of the data in between. To facilitate this we are going to create a generator which does so.\n", "\n", "Generators look like functions, but they're slightly different. Instead of using the `return` keyword, an generator using the `yield` keyword, which acts like an index in a list. Refer to the example below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# generating squares\n", "def gen_squares(n):\n", " for i in range(n): # loop through each number in range 0 to n\n", " yield i, i * i # return the number and the number's square\n", "\n", "\n", "for i in gen_squares(5): # computation only occurs here\n", " print(i)\n", "print() # prints empty line\n", "\n", "type(gen_squares(5)) # type = generator" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# out date generator\n", "def date_range_generator(start_date, end_date):\n", " current = start_date # sets current to the start date\n", " total_days = (\n", " end_date - start_date\n", " ).days + 1 # defines the amount of days we want to loop through\n", "\n", " for _ in range(total_days): # for each day\n", " yield current # give back the current date\n", " current += timedelta(\n", " days=1\n", " ) # add a day to the current date, move to the next day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# one last thing... adding a progress bar\n", "def date_range_generator(start_date, end_date):\n", " current = start_date # sets current to the start date\n", " total_days = (\n", " end_date - start_date\n", " ).days + 1 # defines the amount of days we want to loop through\n", "\n", " for _ in tqdm(\n", " range(total_days), desc=\"Processing Days\", unit=\"day\"\n", " ): # for each day, now with a progress bar\n", " yield current # give back the current date\n", " current += timedelta(\n", " days=1\n", " ) # add a day to the current date, move to the next day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# giving it a try!\n", "start_date = datetime(2024, 5, 1)\n", "end_date = datetime(2024, 5, 7) # just a week of data\n", "\n", "data = []\n", "for day in date_range_generator(start_date, end_date):\n", " utc_ts = convert_date_to_utc(day)\n", " url_query = f\"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin\"\n", " res = requests.get(url_query)\n", " if res.status_code == 200:\n", " for post in res.json()[\"data\"]: # loop through each post and...\n", " if post not in data: # checking if it is already in our data list\n", " data.append(post) # if not, then we can add it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(data) # more than 100!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a good way of getting our data, we can dump it into a DataFrame and save it as a CSV. Most of the information here is either repetitive or not useful, so we can select only a subset of the most valuable data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "cols_of_interest = [\n", " \"author\", # username\n", " \"created_utc\", # when it was posted\n", " \"id\", # id of thread, useful for comments\n", " \"num_comments\", # number of comments\n", " \"score\", # upvotes - downvotes\n", " \"selftext\", # text of the post\n", " \"title\", # title of the post\n", " \"url\", # url to the thread\n", "]\n", "\n", "df = pd.DataFrame(data)\n", "df = df[cols_of_interest]\n", "df[\"created_utc\"] = df[\"created_utc\"].astype(int)\n", "df[\"date\"] = df[\"created_utc\"].apply(\n", " convert_utc_to_date\n", ") # convert utc to normal date format\n", "df.to_csv(\"r_latin20240501to20240507.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collecting Comments\n", "\n", "Now that we've successfully collected all of the posts in a given time frame, we can turn to collecting comments for each post as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# choosing an example from the work above\n", "\n", "more_than_one_comment = df[\n", " df.num_comments > 1\n", "] # filtering out posts with more than one comment\n", "ex_post = more_than_one_comment.iloc[0] # first one\n", "ex_post" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we'll need the value at the id column\n", "ex_id = ex_post[\"id\"]\n", "ex_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of using \"submission\" in our query url, we will use \"comment\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "comments_url = f\"https://api.pullpush.io/reddit/comment/search?link_id={ex_id}\"\n", "data = requests.get(comments_url).json()[\"data\"]\n", "len(\n", " data\n", ") # there will be a mismatch between this number and the num_comments column as this includes responses to existing comments" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define a function\n", "def get_comments(id):\n", " comments_url = f\"https://api.pullpush.io/reddit/comment/search?link_id={id}\"\n", " return requests.get(comments_url).json()[\"data\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As opposed to the posts, a list of comments like this does not work well with CSV data. CSVs prefer data to be all the \"shape\" meaning all of the rows have all of the same values. Comments can be tricky because the number of them will always be different, which means it's impractical to have column for each comment, as we would end of having a lot of empty columns if a certain post gets a large amount of comments.\n", "\n", "We can coerce is into a CSV but JSON format would suit this data much better, so we'll compose a JSON file which has the unique IDs from our DataFrame of posts so that they can be link together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# filtering by the data we're interested in\n", "fields_of_interest = [\n", " \"score\",\n", " \"replies\",\n", " \"id\",\n", " \"author\",\n", " \"parent_id\",\n", " \"body\",\n", " \"created\",\n", "]\n", "\n", "to_json = {}\n", "for id in df.id: # loop through our ids\n", " comments = get_comments(id) # get our comments\n", " comments_by_id = [] # empty list to to hold the comments\n", " for comment in comments:\n", " comments_by_id.append(\n", " {k: v for k, v in comment.items() if k in fields_of_interest}\n", " ) # filter by our fields of interest\n", " to_json[id] = comments_by_id # assign the list of comments to the original id\n", "\n", "# this loop will take some time because we have to submit a GET request for every ID, for ~100 ids it took 5 minutes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json # python json library\n", "\n", "with open(\"r_latin20240501to20240507_comments.json\", \"w\") as f:\n", " json.dump(to_json, f) # saves dictionary as json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The whole process\n", "\n", "Below are all of the steps we followed as a class in Python. This format allows us to customize our inputs without having to worry about the core functionality working." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class RedditScraper:\n", " def __init__(self, start_date, end_date, subreddit):\n", " self.start_date = start_date\n", " self.end_date = end_date\n", " self.subreddit = subreddit\n", " self.output_file = f\"r_{self.subreddit}_{start_date.strftime('%Y-%m-%d')}_{end_date.strftime('%Y-%m-%d')}\"\n", "\n", " self.cols_of_interest = [\n", " \"author\",\n", " \"created_utc\",\n", " \"id\",\n", " \"num_comments\",\n", " \"score\",\n", " \"selftext\",\n", " \"title\",\n", " \"url\",\n", " ]\n", " self.fields_of_interest = [\n", " \"score\",\n", " \"replies\",\n", " \"id\",\n", " \"author\",\n", " \"parent_id\",\n", " \"body\",\n", " \"created\",\n", " ]\n", "\n", " def convert_utc_to_date(self, ts):\n", " \"\"\"\n", " Converts a UTC timestamp to a local datetime object.\n", " \"\"\"\n", " utc_datetime = datetime.utcfromtimestamp(ts).replace(tzinfo=timezone.utc)\n", " local_datetime = utc_datetime.astimezone()\n", " return local_datetime.strftime(\"%Y-%m-%d %H:%M:%S\")\n", "\n", " def convert_date_to_utc(self, dt):\n", " \"\"\"\n", " Converts a local datetime object to a UTC timestamp.\n", " \"\"\"\n", " dt_with_timezone = dt.replace(tzinfo=timezone.utc)\n", " return int(dt_with_timezone.timestamp())\n", "\n", " def date_range_generator(self, start_date, end_date):\n", " \"\"\"\n", " Yields next day between start_date and end_date.\n", " \"\"\"\n", " current = start_date\n", " total_days = (end_date - start_date).days + 1\n", "\n", " for _ in tqdm(range(total_days), desc=\"Processing Days\", unit=\"day\"):\n", " yield current\n", " current += timedelta(days=1)\n", "\n", " def scrape_posts(self):\n", " \"\"\"\n", " Scrapes posts from Reddit and dumps output in a DataFrame.\n", " \"\"\"\n", " start_date = self.start_date\n", " end_date = self.end_date\n", " data = []\n", " for day in self.date_range_generator(start_date, end_date):\n", " utc_ts = convert_date_to_utc(day)\n", " url_query = f\"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit={self.subreddit}\"\n", " res = requests.get(url_query)\n", " if res.status_code == 200:\n", " for post in res.json()[\"data\"]: # loop through each post and...\n", " if post not in data: # checking if it is already in our data list\n", " data.append(post) # if not, then we can add it\n", "\n", " df = pd.DataFrame(data)\n", " df = df[self.cols_of_interest]\n", " df[\"created_utc\"] = df[\"created_utc\"].astype(int)\n", " df[\"date\"] = df[\"created_utc\"].apply(self.convert_utc_to_date)\n", " self.df = df\n", " return self.df\n", "\n", " def save_post_data(self):\n", " \"\"\"\n", " Saves DataFrame to CSV.\n", " \"\"\"\n", " self.df.to_csv(f\"{self.output_file}_posts.csv\")\n", " return self.df\n", "\n", " def get_comments(self, id):\n", " comments_url = f\"https://api.pullpush.io/reddit/comment/search?link_id={id}\"\n", " return requests.get(comments_url).json()[\"data\"]\n", "\n", " def scrape_comments(self):\n", " \"\"\"\n", " Scrapes comments from Reddit given the ids from self.df.\n", " \"\"\"\n", " to_json = {}\n", " for id in tqdm(self.df.id):\n", " comments = self.get_comments(id)\n", " comments_by_id = []\n", " for comment in comments:\n", " comments_by_id.append(\n", " {k: v for k, v in comment.items() if k in self.fields_of_interest}\n", " )\n", " to_json[id] = comments_by_id\n", " self.to_json = to_json\n", " return to_json\n", "\n", " def save_comment_data(self):\n", " \"\"\"\n", " Saves comments to JSON.\n", " \"\"\"\n", " with open(f\"{self.output_file}_comments.json\", \"w\") as f:\n", " json.dump(self.to_json, f)\n", " return self.to_json\n", "\n", " def run(self):\n", " self.scrape_posts()\n", " self.save_post_data()\n", " self.scrape_comments()\n", " self.save_comment_data()\n", " return self.df, self.to_json" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "latin_scraper = RedditScraper(datetime(2020, 1, 1), datetime(2020, 2, 1), \"latin\")\n", "latin_scraper.run()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "mystnb": { "execution_mode": "off" } }, "nbformat": 4, "nbformat_minor": 0 }