Getting Data Quickly with Data Dumps and APIs

Nick Gregory
WhizeCo
Published in
6 min readOct 27, 2020

One of the most fundamental problems in building a search engine is gathering the content you want to index and have searchable. Websites are so large now that even ones that serve a relatively small group of people can have tens of millions of pages, which puts crawling them out of the grasp of most due to scale and time constraints. Take for example StackOverflow which primarily serves programmers. The site has upwards of 30 million pages at the time of writing. If we were to index Stack with a normal crawler at a respectful rate of 2 requests per second (rps), it would take just shy of 6 months to scrape. If we want to get the index time down to a week, the crawl rate would have to be increased to nearly 50rps. And these numbers don’t even include the overhead that crawlers encounter, since they usually crawl through a number of non-content pages like listings of most recent posts, posts in this category, etc. Hitting a site at 50rps for a week straight would generally be considered “abusive” (and the crawler would probably be blocked), so what are we to do?

There are a few options:

  • Data Dumps
  • APIs
  • Common Crawl

Data Dumps

Luckily, StackOverflow and a handful of other sites provide “data dumps,” or complete snapshots of the (public) content of their sites in files that are easily parsable. These come in all shapes and sizes, as there’s no single standard for how sites should provide them. StackOverflow provides 7zip’d XML files, while Wikipedia provides bzip’d XML, others provide compressed JSON, etc. etc. Since nearly all of the sites that provide these use standard formats at the end of the day, it’s generally very easy to quickly build tooling to process these data dumps and extract whatever is needed. For Stack, it’s as simple as decompressing the 7zip files and parsing the XML in your favorite programming language. There are even some pre-built libraries available, e.g. https://github.com/kjk/stackoverflow.

In addition to site-provided dumps, there are also a few sites which have “community generated” data dumps, where the sites themselves don’t publish the dumps but other users have been able to reconstruct the entire content of the site (generally via the site’s APIs which we’ll discuss shortly), and publish it. One of the most well-known of these is Pushshift which provides both historical dumps and near-real-time streaming of Reddit posts and comments. Pushshift has proven to be extremely valuable to researchers (https://scholar.google.com/scholar?cites=7671696188192149307), as they provide a comprehensive view of what people are doing on one of the largest user-driven sites on the internet. It’s quite an interesting data set to look through, and they even provide a small page showing recent statistics of activity across Reddit: https://pushshift.io/. There are also similar user-driven projects for GitHub, and Hacker News that are queryable and/or exportable through Google BigQuery.

Pushshift has proven to be extremely valuable to researchers…

APIs

Site APIs can also be of use. While very few sites have data dumps available, many more sites have public APIs especially given the trend in recent years to have multiple frontends (website, native phone apps, etc.) on top of the same data. These are often well documented (or even self-explanatory), generally aren’t throttled/rate limited as heavily, can sometimes respond to bulk requests (returning more than one thing at once), and best of all, usually provide the actual content in an easily accessible form, not mangled in HTML. When available, they’re great to scrape from instead of manually crawling a site. That’s not to say they are without their own downsides, however. First and foremost, sites develop their own APIs for their own needs, which means anyone wanting to use that API needs to implement their own client for each and every site they want to get data from.

We’ve used these data sources and more to build our search engine. Check it out over on whize.co

These clients can be written relatively quickly, but it’s still work. Secondly, depending on what the API was built for, it may not provide a way to iterate over all of the content on the site. It’s not uncommon for API results to be paginated, and as part of that, there’s often upper limits (usually in the low thousands) on the number of pages you can go before the API either silently breaks, or rejects your requests. Lastly, public APIs may have usage requirements that must be abided by, and these rules sometimes state that they cannot be used to mirror the site content, despite normal crawlers being allowed to scrape the site.

Common Crawl

Finally, there is Common Crawl. Common Crawl isn’t a normal data dump, but rather provides a limited, but entirely public crawl of the web nearly every month. Each dump has a few billion pages which, while it sounds like a lot, generally leaves gaps in the index of nearly every website you’ve used. This makes it useful for doing very broad analysis of the web, but not very useful for analysis on a single site where a complete copy of the site is needed for the best results.

crawling the internet

It’s also worth mentioning that Common Crawl (and page crawling in general) has another, generally less considered downside: the content on each page crawled needs to actually be extracted. There is a ton of extraneous content on each page that is downloaded (styling, scripts, titles, headers, etc.) that isn’t the main focus of the page, and determining what text is the real “meat” of the page is not trivial. Common Crawl provides “WET” files which extract all text in the page <body> (removing styling markup, and scripts), however, it leaves a lot of non-body content interleaved in the results which can pollute any analysis being done on the pages (e.g. menu titles, breadcrumbs, etc.). There have been some efforts make it easier for sites to be more descriptive and make their content easier to recognize and extract (e.g. the <article>, <header>, <section>, and <footer> HTML tags), however many sites still don’t use these, and even with those tags, there can still be irrelevant bits of text in them.

A small request to site owners

At the end of the day, very few sites have data dumps available, making “quick” analysis of sites infeasible. Most of the time, website operators see providing these as extra work with little to no return, assuming they even consider releasing them in the first place. Some sites do have good reasons not to provide them (primarily private content, too much content, etc.), however, the vast majority of sites out there could offer dumps but do not.

Given that Common Crawl exists, it’s clear that there is a desire to have website content readily available for analysis. Instead of everyone having to duplicate work in crawling, we would like to see an increase in the number of sites that offer data dumps in some form or another, for the benefit of anyone who wants to do their own research or analysis. Specifically, we see the most value in data dumps of sites that store knowledge (Q&A, news, etc.), as these are typically very large sites with significant potential to be used in research. In general, it would not be hard to develop opt-in plugins for common content management systems (CMSs) to provide dumps automatically, which eliminates potential pain from the site operators point of view, while allowing new research and tools to be built on the content.

Thank you for reading, if you found this interesting you can head on over here and sign up for our newsletter. You’ll receive a once a month compilation of more of our interesting content!

--

--

Nick Gregory
WhizeCo
Editor for

Nick is a cybersecurity researcher, sysadmin, and now CTO of Bismuth Cloud.