Our first scalable and distributed architecture for Computer Vision

Published in

Wassa

6 min readJun 27, 2017

With the rise of Deep Learning and strengthened from our mobile and back-end expertise, we began to develop a cloud based solution for our algorithms. It would allow us to use computation power currently unreachable by smart devices like heavy GPU-based algorithm.

At first, we wondered “How will we achieve that ?”. Considering we are quite a small company, we had some extra constraints on human resources and technology choices.

The requirements for this architecture were the following:

Scalability

We need to ensure the design support an heavy ramp-up. Currently, the traffic on our product is tiny compared to companies like Google.
But these days we can’t build anymore these kind of architecture without thinking : “What if sales go crazy?”.
We learned from what happened to Pokémon Go and although we want the same success :). (If we got their strain we will clearly have problem too)

Modularity

We want our architecture able to be adapted to our previous, current and future products with as few work as possible. But we know this platform, which was the first for us, will have loopholes.
We made technical and technological choices that could go wrong in the future (if not already questionable).
To prevent these choices to have too much impact, we tried to isolate the use of each tools (media server, API, …) to do simple tasks using standard interfaces with each others. This would allow us to replace / update theses modules as our needs grow.

Security

We have three main things to protect.

Customers data: our customers will upload their data on our platform
Our Expertise: obviously we don’t want our work to be disclosed
Product availability: we need to know how the system will react to heavy load, even DDOS or other kind of attack

User friendly

At Wassa, we develop the full stack frontend, backend, architecture and the algorithm. The aim of this new architecture is to reduce interaction between module and to full define these interactions. We can develop each part alone and integrate them without changing other parts.

Few tools

We needed to focus on a limited set of tools to lower as much as possible the learning curve of new technology. Also, by limiting the tools, we were able to keep development continuity and make bug fixing easier.

The API

This part is composed of a Symfony application that handles clients sessions and security. This is the only thing that our final clients know about and will serve as an entry point for our customers. Each users will be granted an API key to authenticate their API call allowing us to add a limit either in bandwidth or number of captures allowed by client.

When a client sends a capture, it will check the data and either return a comprehensible error or start filling a work queue described as follow.

The Task Queue

The Task queue is in fact composed of 4 main lists:

The “Task Queue” is a FIFO queue that contains all jobs to do.

2. The second list (“Worker List”) contains active workers connected to this queue task. It contains worker id and some of its characteristic like GPU capability, CPU optimization (BLAS, NNPACK etc.).

3. “Failed list” contains all failed jobs with error message from the algorithm

4. “Result list” contains results if the job succeed.

In addition to these, each worker will generate its own list containing its current work.

The queue use the Redis command RPOPLPUSH that do almost all the job. Thanks to Redis implementation, RPOPLPUSH is an atomic operation. So 2 workers can’t process the same job. In pseudo code the worker can be describe as follow

Media Server

Our code uses large databases. They are not large in number of items but large because most of our data are images. Deep learning algorithms use millions of images for main training and at least few thousand for fine-tuning.

This data is accessible from the API (the customers) and from the compute process. So we need a centralization system to avoid unnecessary duplication.

For the first release, we used the Bottle.py framework with the Tornado http server. This simple configuration gives really good performances for the simple task of serving images. Nevertheless, we added a cache system using Redis with expiring keys to accelerate images transfer to deep learning process.

The media server is not accessible for customers for several reasons, including security and performance. Clients need to ask the API server, so we can check their permissions to a requested image and let the media server do its job.

Web socket

We added a NodeJS server to allow real-time notification and to monitor the work in progress.

We designed the architecture with one entry point for algorithms. This entry point is a Redis instance hosting a task queue and a result queue. We also use this Redis instance to pass Pub/Sub message from on going process.

But synchronous and asynchronous world was hard to plug together, so we use this NodeJS server to convert Pub/sub Redis message in WebSocket and web hook to customer or internal API.

Full Design

Stress test on Raspberry

When we tried to stand back, the architecture complexity scared us. We were afraid that the complexity increase computation time because the data is transferred between lots of module. Furthermore, each module needed to manage errors and availability of the next the module in the process.

To properly estimate process of the whole system, we deployed our modules on Raspberry Pi 3. We tested one of our algorithm (Facelytics) that detects people age and gender with this configuration.

Obviously, algorithm part is not on a Raspberry Pi but all other parts run on a Raspberry Pi. A total of 3 Raspberry are used. One for Redis, one for backend, one for media server.

Installation

Even if Raspberry Pi are ARM base, installation was easy. Thanks to the community almost every library exist with there ARM version.

Performances

We test 2 configurations.
One with Facelytics on CPU only architecture and one with a GeForce GTX 1070. Both architecture use only one worker at a time. GPU based algorithm are too powerful for this architecture.

As shown on the following image, the computation time is lower than image upload time on the Raspberry. It doesn’t really represent regular configuration.

The second test uses CPU only version of the process. This configuration allowed to observe the job queuing.

We got really good result even on this low cost configuration. Faster than our expectation.

Conclusion

We have now our first architecture. It can already support most of our algorithms. But it can greatly evolve in the future.

The use of containers to manage auto scaling of workers.
We currently use our own monitoring tool. If we completely switch to Kubernete or Docker Swarm, we will have access to reliable monitoring tools like Istio.
We will improve crash robustness with Redis Sentinel and replication.

Do you want to know more about Wassa?

Wassa is an innovative digital agency expert in Indoor Location and Computer Vision.

Wether you are looking to help your customers to find their way in a building, enhance the user experience of your products, collect data from your customers or analyze the human traffic and behavior in a location, our Innovation Lab brings scientific expertise to design the most adapted solution to your goals.

Find us on: