The Art of Webscraping III: Scraping Reddit#
In the past two notebooks in this webscraiping series, we saw how we could use Python to automate getting data from websites. First beautifulsoup gave us a method of navigating the HTML of a static webpage and then selenium allowed us to parse dynamically generated pages.
In this notebook, we’ll look at a specialized source of data that we can pull from, Reddit. Reddit.com is a collection of forums where uses can discuss topics of shared interest. A lot of people use Reddit, so many NLP researchers use it as a place to gather data for novel datasets. In this example, we’ll collect a swathe of text from r/latin and save it as a CSV.
Over the past few years, Reddit has made it very difficult to get large chunks of their data. That said, another group, pullpush.io, have saved and hosted terabytes of historical Reddit data for public use. We’ll be using their API in this notebook.
Nota Bene: pullpush’s API service is designed for academic use only! It is an incredible resource, especially because it is free. Please do not abuse it!
# imports
import requests
from datetime import datetime, timezone, timedelta
from tqdm import tqdm
import random
random.seed(42)
Because pullpush is an API service. We will ask for data using an URL. This URL will include information like what subreddit we want to search through, the dates we want to search in, and how the results should be ordered. Se below for an example.
ex_url = "https://api.pullpush.io/reddit/search/submission/?subreddit=latin"
Let’s break that down into its component parts:
*https://api.pullpush.io/reddit/search/*: This part of the URL should never change. This is the base URL that we’ll be adding to depending on our purposes.
submission/: This addition tells pullpsuh that we want to search posts (submissions) and not comments. As we will see later, there is different string that we can use instead that will allow us to search for comments.
?: This question mark is the start of our specific query. It tells pullpush that we are going to be giving it instructions about what data we are expecting to get.
subreddit=latin: This section tells pullpush we want data from the r/latin subreddit. This is very simple but there are many we can nuance this significantly.
This is the simplest type of query we can run so let’s see what it gives us.
# using the requests library to make a GET request
response = requests.get(ex_url)
print(response.status_code) # status code 200 means it worked
data = response.json()["data"]
len(data), type(data[0]) # 100 dictionary responses
data[0]
Collecting Submissions (Posts)#
As we can see, each one of these dictionaries from the data list, holds the information from an individual post. But why are there only 100? Pullpush only allows users to get 100 posts per request, meaning that we’ll have to get creative with how we request data from Pullpush.
To do so we’ll have to take advantage of the other request modifiers besides just “subreddit.” A list of all of these can be found here.
One method that we can try is using timestamps to segment the data into chunks less than or equal to 100. Pullpush allows us to ask for posts given a specific time block. We can then loop through these time blocks until we get the data that we want.
Below I’ll walkthrough getting data for a single day. According to the pullpush documentation, there is a “before” and an “after” modeifier, but these only accept an “Epoch value”. What does that mean?
# normal python date
_date = datetime(2022, 1, 1)
_date, type(_date)
# epoch value
dt_with_timezone = _date.replace(tzinfo=timezone.utc)
int(dt_with_timezone.timestamp())
An “epoch value” or Unix Timestamp is a special method of encoding dates for computers. It is a standard which represents dates as the number a seconds that have elapsed since January 1, 1970. This might seem abritary and that’s because it is! That said we can create a few functions to make translating between normal Python datetime objects and epoch values easier.
def convert_utc_to_date(ts):
"""
Converts a UTC timestamp to a local datetime object.
"""
utc_datetime = datetime.utcfromtimestamp(ts).replace(tzinfo=timezone.utc)
local_datetime = utc_datetime.astimezone()
return local_datetime.strftime("%Y-%m-%d %H:%M:%S")
def convert_date_to_utc(dt):
"""
Converts a local datetime object to a UTC timestamp.
"""
dt_with_timezone = dt.replace(tzinfo=timezone.utc)
return int(dt_with_timezone.timestamp())
print(
convert_utc_to_date(convert_date_to_utc(_date))
) # should print 2022-01-01 00:00:00
Now let’s try adding this to our request URL and retrieve a day worth of posts. A day is 86400 seconds so all we need to do is convert our datetime object to UTC and then add 86400.
start_date = datetime(2024, 5, 23)
utc_ts = convert_date_to_utc(start_date)
url_query = f"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin" # can use & to join modifiers
url_query
res = requests.get(url_query)
if res.status_code == 200:
data = res.json()["data"]
print(f"Number of posts: {len(data)}")
print(
f"Most recent post: {convert_utc_to_date(data[0]['created_utc'])}, {data[0]['title']}"
)
print(
f"Least recent post: {convert_utc_to_date(data[-1]['created_utc'])}, {data[-1]['title']}"
)
Wonderful! Now we have a way to get all of the posts for a single day. Now we can create a loop where we go through every day between a start date and an end date, collecting all of the data in between. To facilitate this we are going to create a generator which does so.
Generators look like functions, but they’re slightly different. Instead of using the return keyword, an generator using the yield keyword, which acts like an index in a list. Refer to the example below.
# generating squares
def gen_squares(n):
for i in range(n): # loop through each number in range 0 to n
yield i, i * i # return the number and the number's square
for i in gen_squares(5): # computation only occurs here
print(i)
print() # prints empty line
type(gen_squares(5)) # type = generator
# out date generator
def date_range_generator(start_date, end_date):
current = start_date # sets current to the start date
total_days = (
end_date - start_date
).days + 1 # defines the amount of days we want to loop through
for _ in range(total_days): # for each day
yield current # give back the current date
current += timedelta(
days=1
) # add a day to the current date, move to the next day
# one last thing... adding a progress bar
def date_range_generator(start_date, end_date):
current = start_date # sets current to the start date
total_days = (
end_date - start_date
).days + 1 # defines the amount of days we want to loop through
for _ in tqdm(
range(total_days), desc="Processing Days", unit="day"
): # for each day, now with a progress bar
yield current # give back the current date
current += timedelta(
days=1
) # add a day to the current date, move to the next day
# giving it a try!
start_date = datetime(2024, 5, 1)
end_date = datetime(2024, 5, 7) # just a week of data
data = []
for day in date_range_generator(start_date, end_date):
utc_ts = convert_date_to_utc(day)
url_query = f"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin"
res = requests.get(url_query)
if res.status_code == 200:
for post in res.json()["data"]: # loop through each post and...
if post not in data: # checking if it is already in our data list
data.append(post) # if not, then we can add it
len(data) # more than 100!!
Now that we have a good way of getting our data, we can dump it into a DataFrame and save it as a CSV. Most of the information here is either repetitive or not useful, so we can select only a subset of the most valuable data.
import pandas as pd
cols_of_interest = [
"author", # username
"created_utc", # when it was posted
"id", # id of thread, useful for comments
"num_comments", # number of comments
"score", # upvotes - downvotes
"selftext", # text of the post
"title", # title of the post
"url", # url to the thread
]
df = pd.DataFrame(data)
df = df[cols_of_interest]
df["created_utc"] = df["created_utc"].astype(int)
df["date"] = df["created_utc"].apply(
convert_utc_to_date
) # convert utc to normal date format
df.to_csv("r_latin20240501to20240507.csv")
df.head()
Collecting Comments#
Now that we’ve successfully collected all of the posts in a given time frame, we can turn to collecting comments for each post as well.
# choosing an example from the work above
more_than_one_comment = df[
df.num_comments > 1
] # filtering out posts with more than one comment
ex_post = more_than_one_comment.iloc[0] # first one
ex_post
# we'll need the value at the id column
ex_id = ex_post["id"]
ex_id
Instead of using “submission” in our query url, we will use “comment”.
comments_url = f"https://api.pullpush.io/reddit/comment/search?link_id={ex_id}"
data = requests.get(comments_url).json()["data"]
len(
data
) # there will be a mismatch between this number and the num_comments column as this includes responses to existing comments
# define a function
def get_comments(id):
comments_url = f"https://api.pullpush.io/reddit/comment/search?link_id={id}"
return requests.get(comments_url).json()["data"]
As opposed to the posts, a list of comments like this does not work well with CSV data. CSVs prefer data to be all the “shape” meaning all of the rows have all of the same values. Comments can be tricky because the number of them will always be different, which means it’s impractical to have column for each comment, as we would end of having a lot of empty columns if a certain post gets a large amount of comments.
We can coerce is into a CSV but JSON format would suit this data much better, so we’ll compose a JSON file which has the unique IDs from our DataFrame of posts so that they can be link together.
# filtering by the data we're interested in
fields_of_interest = [
"score",
"replies",
"id",
"author",
"parent_id",
"body",
"created",
]
to_json = {}
for id in df.id: # loop through our ids
comments = get_comments(id) # get our comments
comments_by_id = [] # empty list to to hold the comments
for comment in comments:
comments_by_id.append(
{k: v for k, v in comment.items() if k in fields_of_interest}
) # filter by our fields of interest
to_json[id] = comments_by_id # assign the list of comments to the original id
# this loop will take some time because we have to submit a GET request for every ID, for ~100 ids it took 5 minutes
import json # python json library
with open("r_latin20240501to20240507_comments.json", "w") as f:
json.dump(to_json, f) # saves dictionary as json
The whole process#
Below are all of the steps we followed as a class in Python. This format allows us to customize our inputs without having to worry about the core functionality working.
class RedditScraper:
def __init__(self, start_date, end_date, subreddit):
self.start_date = start_date
self.end_date = end_date
self.subreddit = subreddit
self.output_file = f"r_{self.subreddit}_{start_date.strftime('%Y-%m-%d')}_{end_date.strftime('%Y-%m-%d')}"
self.cols_of_interest = [
"author",
"created_utc",
"id",
"num_comments",
"score",
"selftext",
"title",
"url",
]
self.fields_of_interest = [
"score",
"replies",
"id",
"author",
"parent_id",
"body",
"created",
]
def convert_utc_to_date(self, ts):
"""
Converts a UTC timestamp to a local datetime object.
"""
utc_datetime = datetime.utcfromtimestamp(ts).replace(tzinfo=timezone.utc)
local_datetime = utc_datetime.astimezone()
return local_datetime.strftime("%Y-%m-%d %H:%M:%S")
def convert_date_to_utc(self, dt):
"""
Converts a local datetime object to a UTC timestamp.
"""
dt_with_timezone = dt.replace(tzinfo=timezone.utc)
return int(dt_with_timezone.timestamp())
def date_range_generator(self, start_date, end_date):
"""
Yields next day between start_date and end_date.
"""
current = start_date
total_days = (end_date - start_date).days + 1
for _ in tqdm(range(total_days), desc="Processing Days", unit="day"):
yield current
current += timedelta(days=1)
def scrape_posts(self):
"""
Scrapes posts from Reddit and dumps output in a DataFrame.
"""
start_date = self.start_date
end_date = self.end_date
data = []
for day in self.date_range_generator(start_date, end_date):
utc_ts = convert_date_to_utc(day)
url_query = f"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit={self.subreddit}"
res = requests.get(url_query)
if res.status_code == 200:
for post in res.json()["data"]: # loop through each post and...
if post not in data: # checking if it is already in our data list
data.append(post) # if not, then we can add it
df = pd.DataFrame(data)
df = df[self.cols_of_interest]
df["created_utc"] = df["created_utc"].astype(int)
df["date"] = df["created_utc"].apply(self.convert_utc_to_date)
self.df = df
return self.df
def save_post_data(self):
"""
Saves DataFrame to CSV.
"""
self.df.to_csv(f"{self.output_file}_posts.csv")
return self.df
def get_comments(self, id):
comments_url = f"https://api.pullpush.io/reddit/comment/search?link_id={id}"
return requests.get(comments_url).json()["data"]
def scrape_comments(self):
"""
Scrapes comments from Reddit given the ids from self.df.
"""
to_json = {}
for id in tqdm(self.df.id):
comments = self.get_comments(id)
comments_by_id = []
for comment in comments:
comments_by_id.append(
{k: v for k, v in comment.items() if k in self.fields_of_interest}
)
to_json[id] = comments_by_id
self.to_json = to_json
return to_json
def save_comment_data(self):
"""
Saves comments to JSON.
"""
with open(f"{self.output_file}_comments.json", "w") as f:
json.dump(self.to_json, f)
return self.to_json
def run(self):
self.scrape_posts()
self.save_post_data()
self.scrape_comments()
self.save_comment_data()
return self.df, self.to_json
latin_scraper = RedditScraper(datetime(2020, 1, 1), datetime(2020, 2, 1), "latin")
latin_scraper.run()