Reddit Subreddit Metadata Scraper
Updated 12.13.2021
Mountains of Data
In my work in the Computational Communication Lab at UC Davis, a significant source of data for study comes from Reddit. In this project, I wrote a script to gather as much data from Reddit subreddits as possible. This included, but was not limited to, subreddit rules, images, butttons, links, descriptions, and wiki pages. The script was written in Python and data was scraped with PRAW.
Much of the difficulty with the project came from the length of the runtime of the program. In order to reduce the runtime, I needed to continually find ways to reduce API calls, which are throttled to about 30 per minute by Reddit. I also configured the script to run with Supervisord, a process control system, in order to run the script uninterrupted for the many months it may take. To address the possibility of interruption or error, the script was also configured to send regular update emails and emails on fail. More than anything, this project was an opportunity to develop my shell skills.
Below is a sample of just a single subreddit worth of data! In total, 2.8 million subreddits of data was gathered over a 6 month run.
Duration
4 weeks
March - April 2021
Tools
Python
Supervisord
PRAW