My data is taking too long to get!
Or is it?
This is a question/comment we’ve received quite a bit.
“I think there is an issue with your API, when I try to get one ticker it works fine, but many tickers seem to take a REALLY long time.”
— Random Frustrated Data Engineer
The answer, luckily, is very simple, you call the data in parallel as opposed to in series.
Imagine that we had 10 tickers that we wanted to get historical data from, and then we wanted to measure how fast we got that data. We can do a parallel calls by using Threading or Asynchronous programming. Let’s look at a simple python program.
You’ll see 2 methods here; get_tickers_series and get_tickers_paralell. get_ticker_series is the same as sending one person to the data store 10 times to get the data. That is not very fast. Imagine it is the same as driving your car to the grocery store 10 times for each of the ten items that you want. It is much faster to get them all in one go.
So we want to send the requests in parallel as opposed to the way the first one does it, in series. This is a concept in computer science known as “multi-threading”.
Multi-threading is the process of creating multiple “threads” that will exercise commands given at the same time, as opposed to waiting in line for each one to finish. A thread is a separate flow of execution, meaning your code will now be doing many things at the same time. Almost every language has a library to help you do this, and are many ways to do this in python. The above code makes use of the threading python library.
The line:
ticker_thread = threading.Thread(target=ts.get_daily, kwargs={‘symbol’:ticker, ‘outputsize’: ‘full’})
Makes a new thread, and ticker_thread.start() starts it.
It is important later to join the threads. Joining threads means waiting for the thread to finish, and we want to wait until all 10 threads are done, to see which one was fastest. The timeit library just helps us time the different methods, here is the output from my running of this:
CLEARLY, it is faster to do things in parallel, by a factor of 10 in this case!
NOTE: If you try to have a LOT of threads at the same time, you may get errors like: ZMQError: Too many open files because your computer can’t handle the load*
Async
Now threading is great, but it makes use of a lot of workers, and not all computers can make all those calls. Another way that’s less resource-intensive is using asynchronous calls. We break down when to use each in “Which should you use, multithreading or asynchronous”.
import asyncio
from alpha_vantage.async_support.timeseries import TimeSeries
symbols = ['AAPL', 'GOOG', 'TSLA', 'MSFT']
async def get_data(symbol):
ts = TimeSeries()
data, _ = await ts.get_quote_endpoint(symbol)
await ts.close()
return data
loop = asyncio.get_event_loop()
tasks = [get_data(symbol) for symbol in symbols]
group1 = asyncio.gather(*tasks)
results = loop.run_until_complete(group1)
print(results)
Here we have another code sample, using async instead of multi-threading. This also is much faster, and it’s less resource-intensive, and for the most part, this is what we’d want to use!
The Alpha Vantage python package also has native support for async, and if you read the README.md, shows a few examples of how to do this.
Python — advanced threading
Python also has a few other functions to make this even simpler to write out, but you have to be a little sharper with python:
This uses a ThreadPoolExecutor object. This means instead of explicitly initializing each thread we want to use, tell this object “Ok, you can have a maximum of 10 threads at a time (max_workers=10), have them each be a version of the ts.get_daily function, and give them arguments of the different ticker in tickers, and outputsize=full”
NOTE: A lot of computers can have a max of 4 parallel threads at a time — you can test this by changing the number of threads and see if it throws an error.
This is because the ThreadPoolExecutor takes a function and iterables as their inputs. Hit these links for more information on lambda, generators, and iterables.
For more information on how the executor.map method interacts with the function passes, check out this on stack overflow.
Now, I do have to give a bit of a disclaimer, as threading in python works a little differently than other languages.
Threading in Python runs concurrently instead of parallel. There is only one thread runs at a time so it only benefits when we use it for I/O-bound tasks. It won’t make any difference when using multithreading for CPU-bound tasks (use multiprocessing in this case).
Thanks Toan Quoc Ho!
Let us know what you think, or better ways you found to get data in parallel, or if something was a little confusing here!
For more information on stock APIs, artificial intelligence, or anything else, be sure to follow me on Medium!
#fintech #python #stockapi #performance #multithreading