How to make fast, reliable APIs out of slow, unreliable websites

Greg Dingle
Feb 13, 2019 · 6 min read

For a recent project, we had to automate a workflow out of several different bioinformatics web sites that were never intended to be automated. This posed some interesting problems that we managed to solve quite nicely in the end. I’ll detail the problems and solutions here so that if you’re ever in a similar situation, you can avoid repeating our efforts.

Background

The finished product was to be a web tool that scientists in the Chan Zuckerberg Biohub and elsewhere could use to design and analyze CRISPR edits in batch (now live at crispycrunch.czbiohub.org). With rapid advances in the field of genome engineering, scientists are devising ever-larger experiments that require automation to be practical. At the Biohub, Manuel Leonetti’s team is inserting fluorescent proteins into all 22,000 protein-coding genes in the human genome. Similarly, Ryan Leenay edited 1,521 different genome locations in order to train a model that predicts edit outcomes in immune system T-cells. In order to do experiments of this scale, the scientists needed to automate their manual workflows. Part of CZI’s mission is to apply modern engineering and product methods to accelerate biological research.

Software architecture

The scientists we worked with were already using two key web tools in their CRISPR work: Crispor and Crispresso. A scientist would design a CRISPR edit in Crispor, run their experiment in the wet lab, then analyze the results with Crispresso. Our software, named CrispyCrunch (notice a theme here?), unites the two functions of design and analysis, and it enables batch processing — up to 96 edits in one experiment, the standard size of a cell culture plate.

How CrispyCrunch depends on outside tools

We decided to build CrispyCrunch on top of the existing tools of Crispor and Crispresso because they were already trusted by our scientists, we wanted to have a working prototype ASAP, and we could replace them with our own components later. The risk was in integration. These tools were not designed to be part of a larger system.

Asking permission

As preface to all the technical hacks below, it’s worth saying that we asked the owners of Crispor and Crispresso early on whether they would be okay with us treating their sites as APIs. While we didn’t expect them to do any work for us, we hoped they would not block CrispyCrunch at the first sign of trouble. Gladly Luca and Max agreed.

Problem #1: How to make an API out of a website

This is the classic problem of web scraping which nearly every developer runs into sooner or later. The problem is not so much getting the data out — this is easy with libraries such as BeautifulSoup, which extract data from web pages by CSS selectors — but in making the results reliable and errors meaningful.

To that end, we employed the following techniques:

  • Use HTML selectors that are least likely to change. For example, prefer HTML IDs over element types.
  • Select the data you want from the DOM directly by classnames or IDs. Avoid selecting child or parent elements which may easily change location in the document hierarchy.
  • When POSTing, don’t rely on default values, which may change. Provide explicit values for all fields.
  • Learn all possible types of errors by systematically varying GET and POST params.
  • Promote string errors to exceptions. A HTTP request may return 200 OK while the web page text contains a blatant “ERROR!!!” .
  • Use assertions to codify assumptions. For example, assert that some HTML list has a non-zero number of elements.
  • Expect the unexpected. Log copiously and get notified of exceptions by email or by a service such as Sentry.
  • Retry to workaround intermittent failures. Use backoff as explained below.

Problem #2: How to increase an API client’s throughput

Having got the base functionality we needed, the next problem was to make it fast. The obvious solution was to send independent requests in parallel.

Using python and the ubiquitous requests library, you could do something as simple as this:

import requests
from concurrent.futures import ThreadPoolExecutor
paths = [
'/view_report/mNGplate1',
'/view_report/mNGplate2',
'/view_report/mNGplate3',
# ...more
]
def _request(path):
return requests.get('http://crispresso.rocks' + path)
with ThreadPoolExecutor() as pool:
results = pool.map(_request, paths)

Note: ThreadPoolExecutor is typically more efficient than ProcessPoolExecutor for web requests because a thread uses less memory than a process, and the python GIL is not relevant when blocking on I/O.

Problem #3: How to make an API client resume a partially succeeded batch

When a batch of requests takes more than a few seconds, it’s desirable to be able to restart without losing progress so far. This is desirable in several common situations:

  • Intermittent failure
  • Debugging
  • The user needs to go back and change something

The solution here again is simple: cache HTTP responses so that a request for a previously downloaded resource returns a cached copy. When restarting a batch, only incomplete requests will be re-executed.

For caching, we used the handy requests_cache module. Some tips in its usage:

  • Unlike the tutorial, construct an explicit cache session object to avoid side-effects.
  • You should explicitly delete the cache for a URL if you determine that the content is somehow invalid, so you can try again with a fresh request. In other words, errors that are not HTTP errors need be handled specially.
  • requests_cache will not cache POST requests that include files, because the file boundary will be a random string. You can fix that by monkey-patching urllib3. (This took a while to figure out!)
urllib3.filepost.choose_boundary = lambda: ‘crispycrunch_super_special_form_boundary’
  • The remote server may cache responses–sometimes in a way you don’t want. For example, one of the APIs we used cached error responses 🙁 The fix was to append a random string to the URL to make it always new from the server point of view, and to tell requests_cache to ignore the random param, so the URL is the same from the client point of view.

Problem #4: How to make an API client testable in isolation

If you have complex response parsing code, you should have unit tests for it that you can run isolated from the internet. The problem is that constructing test data can be time-consuming and awkward. HTML pages can be hundreds of KBs in size.

Caching comes to the rescue again. You can use requests_cache to store a full response in SQLite, then commit the SQLite file to version control. Instant test isolation! Vary the name of the file in test mode, and add it as an exception to your gitignore file (or equivalent). Note: you can verify that a request came from the cache with the from_cache attribute.

Problem #5: How to make your client not crash the API

Even official web APIs may be brought to their knees by surprisingly little traffic. The only solution to this problem is to stress test the API so that you discover traffic limits before your users do and rate limit accordingly–see #7 below. Note: you may want to alert the API owner before you do so!

For stress testing, you can use Apache Bench (ab) command line tool. Note: servers themselves may cache in various ways so be careful to make your load test realistic.

Problem #6: How to avoid race conditions in API calls

A nasty problem we discovered when chaining API calls was intermittent 404s. This is symptomatic of a race condition in which a resource is not ready although preceding HTTP response codes imply that it is. Such race conditions may not be revealed on a website until is used in a programmatic way because a human user cannot act as fast as a computer client.

The solution to race conditions in a system you don’t control is either:

  • Introduce an artificial delay such as
import time; time.sleep(1)
  • Block on the resource becoming available
import requestsdef _request(path):
return requests.get(‘http://crispresso.rocks' + path)
while _request(path).status == 404:
print(f’Waiting on {path}’)
time.sleep(1)

Note: if you are caching responses as described above, you won’t pay the round-trip cost of an extra request in the case of no waiting.

Problem #7: How to keep just under API rate limits

Some APIs might be nice enough to report when you are stressing them too much. If you want to optimize within the limit, you can use a client-side rate limiter such as ratelimit. Combined with exponential backoff, you can have a respectful and performant client.

import requests
from ratelimit import limits, RateLimitException
from backoff import on_exception, expo
@on_exception(expo, RateLimitException, max_tries=8)
@limits(calls=15, period=900) # 15min
def call_api(url):
return requests.get(url)

Match calls and period to the observed or stated rate limit of the API.

Conclusion

These are the problems we had to solve to make CrispyCrunch work smoothly with the Crispor and Crispresso web tools which were never intended to be web services. We hope you find the details helpful. If you’d like to learn more about CZI Engineering, check out our Technology page.

CZI Technology

The Technology team at Chan Zuckerberg Initiative

Greg Dingle

Written by

Software, science and startups

CZI Technology

The Technology team at Chan Zuckerberg Initiative

More From Medium

More from CZI Technology

More on Science from CZI Technology

More on Bioinformatics from CZI Technology

More on Bioinformatics from CZI Technology

miniwdl

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade