Scaling Python Data Science Apps

This post focuses on using Python libraries to solve data science problems while leaving scaling considerations to the infrastructure. No clever patterns, nor tricky code, needed.

Python is, arguably, the current tool for data science applications. Numpy and Scipy have a range of algorithms with great performance. Pandas is an easy-to-use, powerful time series library. However, scaling apps built on top of Python libraries to handle large volumes of data in a concurrent fashion is not a trivial task.

‘Classic’ Solutions

Traditionally, if your application wanted to get data from an external source, you had two options. The first option is to adopt a client-server approach.

A client-server approach.

Clients send data to the application whenever they need something processed. Because you don’t know when a client might send data, it requires constant readiness. If another client’s data is already in process, doing both concurrently demands multiple threads of execution, which Python does not excel at. Furthermore, you have to deploy and manage multiple instances of your server application.

The second option is to have clients store data on a server, and have your application retrieve it.

Data server model.

There are several potential pitfalls here. How do you return the data to the requesting client? If multiple instances of your application want to process client data, how do you track which data has already been processed or is currently being processed? What if the intermediary runs out of disk space? Again, these problems are best handled using other tools.

Kafka and Kubernetes to the Rescue

At Wireless Registry we use Kafka and Kubernetes to handle high throughput requirements. Kafka moves the burden of delivering and storing data outside of data science applications. Kubernetes then dynamically manages the number of application instances without facing concurrency problems (i.e., Kubernetes does our “horizontal scaling”).

Kafka handles the tricky bits.

In the Kafka world, your application is known as a consumer. Clients that need data processed are producers. Producers will send data (or messages) to Kafka.

When your application is ready, it requests a single message (unit of data) from Kafka. You don’t have to worry about threads and synchronization. Should the application have multiple instances, each instance is given a different message, guaranteed.

For example, let’s say your algorithm testing was done by processing data from a CSV file. The code may look as follows:

import csv
with open("data.csv") as data:
data_reader = csv.reader(data)
for line in data_reader:
process(line)

Pretty straightforward as all the work is done in the process() function. How does the same code look when Kafka is involved?

import os
from
kafka import KafkaConsumer
topic = os.getenv("TOPIC")
group = os.getenv("CONSUMER_GROUP")
consumer = KafkaConsumer(topic, group_id=group)
for message in consumer:
process(message)

Not that different! You read messages from a consumer just like you read lines from a CSV file. All the synchronization and network connections are handled by Kafka.

As mentioned above, Kubernetes configures your application, enabled through the os.getenv() function. There’s no need to know what the topic and group are, because Kubernetes will tell your application what they their values are when your instance is deployed.

Summary

At Wireless Registry we use Python to solve data science problems. We use Kafka and Kubernetes to scale those solutions. As a result, our data scientists can focus without worrying about scaling. And as a bonus they can add to their skill sets additional ‘big data’ tools.