Speed up your HTTP Web Requests with the Asyncio Library in Python
Summary
This is an article about using the Asynciolibrary to speed up HTTP requests in Python using data from stats.nba.com. It’s intended for people who aren’t Python experts but are running into input/output (I/O) bottlenecks with large numbers of HTTP requests (Python code is at the end of the article or available at https://github.com/tkpca/Python-Scripts/tree/master/Web%20Requests).
Bottom line: I was able to complete 3690 queries (1230 games in an NBA Season x 3 different queries/season) in ~5–6 minutes using Asyncio versus ~2–4 hours using the standard requests library. (It could actually run faster but I found pushing it would result in an IP ban.)
Here is an example of a scatterchart that was built using the data pulled from above (details on how this was made interactive in a future article).
Who is this intended for?
If you are a hobbyist programmer and looking for a script you can repurpose for API calls the code (posted at the end of the article) can be lifted and probably edited for your purposes.
The script I used for the player data is a lot longer, so for this example I pulled the NBA Combine information (https://stats.nba.com/draft/combine/). Both the standard and Asynchio versions of the code are on my Github page, https://github.com/tkpca/Python-Scripts/tree/master/Web%20Requests.
Background
I have been looking to brush up on my Python skills, which means finding a personal project. Fortunately, I am a big fan of the NBA and the NBA happens to track a ton of data (there were some challenges figuring out how to use the API but that will be addressed in a separate set of posts). Eventually I figured out how to pull the JSON data from stats.nba.com in Python but found it took forever…I got to the point where I could pull all player data (basic and advanced box score stats) by season, except it was taking 3–4 hours to pull one season of data.
After doing some research I noticed this is somewhat of a limitation of the Python requests library. There are people that can explain the technical reasons for this in more detail(https://hackernoon.com/are-your-python-programs-running-slow-heres-how-you-can-make-them-7x-faster-3d6758cd3305), but, in a nutshell I was doing multiple web queries in my code that all had to happen in sequence…so query the server, wait for the result, process, go the next query, etc.
Why Asynchio?
A friend suggested setting up multiple threads but quite frankly I wasn’t there in Python and wanted something similar. All I wanted to do was to scrape data so Asyncio + Aiohttp, while admittedly confusing, seemed more reasonable.
Also, make sure you are on Python 3.75+. I can’t emphasize this enough since the library seemed to change quite a bit. I didn’t find a lot of the explanations super useful either, though this one was quite helpful, https://realpython.com/async-io-python/.
What does this mean in English?
Realpython has a good analogy on how/why Asynchio works for I/O-bound tasks, but in a nutshell, imagine that you are going grocery shopping and need to pick up fruits and some meat from the butcher counter where it takes 5 minutes for the butcher to cut your meat and 10 minutes to grab your fruits; assume the time to tell the butcher what you want is negligible to keep it simple.
1) You could get your fruits, go the meat counter, and then wait for your meat to be prepared (or vice-versa). Total time is 15 minutes (5+10). This is analogous to the standard Requests approach.
2) You go to the deli counter and order your meat, then go to get your fruits. Total time is now 10 minutes. This is analogous to the Ayschio/Aiohttp approach.
In other words rather then stopping to wait for the butcher, you let him do his thing and go about doing something else.
How does it work?
The attached Juptyer notebook walks through specifics, but at a high-level, here is how I interpreted what needs to be done:
1) You need to create a coroutine. Again there are more technically elegant explanations but in nutshell I would describe as a function that returns a value, except it starts with async and any values need have the await keyword before they are returned. Async tells Python it is a coroutine and await ensures that it waits for the results before it returns results. The distinction here is that unlike a standard function, a coroutine can let the rest of the code execute while it waits for the value.
2) Once you have some coroutines you need set up a loop using that coroutine. Asynchio calls this “asyncio.create_task” — basically call the coroutine as you would a normal function except you add asyncio.create_task(<your coroutine here>) before calling the coroutine. This schedules the execution of the coroutine. (Note: if you use Python 3.6 or earlier this was called asyncio.ensure_future().
3) Then you need to gather the results. Realpython explains this better than I can but basically you need to set a variable = await asyncio.gather(<variable with coroutine results from create_task>). This puts all the results in to a single variable that you can pass back to the main routine.
a. Finally, you set a variable in your main routine to get all of this running. Assuming your function is in a routine called main — and this threw me off — it would be: variable = await main()
if you are using a Jupyter notebook. Outside Jupyter, the pattern is variable = asyncio.run(main())
— most articles seem to reference this version so if you’re using Jupyter hopefully that saves you some frustration.
And that’s about it. See the embedded code to actually walk through a practical example.