Logo

Reddit Subreddit Metadata Scraper

Updated 12.13.2021

Mountains of Data

In my work in the Computational Communication Lab at UC Davis, a significant source of data for study comes from Reddit. In this project, I wrote a script to gather as much data from Reddit subreddits as possible. This included, but was not limited to, subreddit rules, images, butttons, links, descriptions, and wiki pages. The script was written in Python and data was scraped with PRAW.

Much of the difficulty with the project came from the length of the runtime of the program. In order to reduce the runtime, I needed to continually find ways to reduce API calls, which are throttled to about 30 per minute by Reddit. I also configured the script to run with Supervisord, a process control system, in order to run the script uninterrupted for the many months it may take. To address the possibility of interruption or error, the script was also configured to send regular update emails and emails on fail. More than anything, this project was an opportunity to develop my shell skills.

Below is a sample of just a single subreddit worth of data! In total, 2.8 million subreddits of data was gathered over a 6 month run.

Duration

4 weeks
March - April 2021

Tools

Python
Supervisord
PRAW

Metadata code sample