Design principles for a real time machine learning framework

Introduction

6 min readOct 10, 2018

Traditional approach to ML has been to operate on a bunch of data offline, create models using supervised learning methods and then use those trained models for inference in an online environment. While this has had great success in the past, a new class of applications is emerging. These applications want to infer in real-time environment, react to sensory inputs, perform continuous small micro simulations and then take actions. This is a typical flow in reinforcement learning based approaches. Generally learning from the real environment can be very resource heavy or can be limited by bandwidth. Hence one approach has been to update simulated environments and perform many such simulations using methods such as Monte carlo tree search. This implies that system would need to be able to perform simulations faster than the real-time. Some of the existing frameworks such as Spark or TensorFlow don’t appear to be sufficient for this kind of workload. Mostly these frameworks support static dataflow graph building as they are not generally good at extending these dataflow graphs in real-time. Also data-centric frameworks such as MapReduce are focused on bulk parallel operations, as opposed massive parallelization for short tasks which is needed for this new class of applications. This paper from RISELab in UC Berkeley explores the design principles for a framework that might be needed to support such real time needs. A prototype implemented for a reinforcement learning based agent that plays Atari, using the principles outlined below, performs 63x faster than spark.

Requirements for the new framework

The requirements for a new framework that can schedule such new age applications and their tasks, can be broadly classified in three categories: Performance, Execution Model, Practicality.

Performance

Low latency: Since these ML applications need to react in real time, quick end-to-end response, in orders of milliseconds is needed. (R1)
High throughput: Since monte carlo simulations used in RL can create up to millions of small tasks, the scheduler framework needs to ensure high throughput. (R2)

Execution Model requirements

3. Dynamic task creation: New tasks will get created dynamically(e.g. using monte carlo tree search) based on execution flow of other tasks and also duration of tasks. (R3)

4. Heterogeneous tasks: Tasks will vary along the dimensions of resources needed and duration of execution. (R4)

5. Arbitrary dataflow dependencies: New small tasks will depend on execution of other tasks and create a graph of tasks that depend on each other and that too dynamically. (R5)

Practical usage requirements

6. Fault tolerance: This is specially tricky in high throughput and low latency environments. But it is needed in any distributed environment(R6)

7. Debugging and Profiling: ML applications and the ones that rely one reinforcement learning are compute intensive. So profiling for performance issues quickly is important. Similarly debugging capabilities in a distributed and stochastic environments are useful for quicker troubleshooting. (R7)

Given these requirements, overarching idea of the framework centers on having a hybrid approach. A centralized scheduler with state can help with fault tolerance and data-graph lineage replay and helps with cluster level resource assignments. At the same time, having stateless and bottom-up local schedulers can make efficient local machine level decisions and take quick decisions on most tasks using data locality.

A real world example

Let’s look at a real world example that will make requirements R1 to R5 more concrete. See a representative system that can be used by a robot moving around using video and LIDAR input.

Robot using different ML components for making decisions in real-time. Box sizes and colors represent duration and diversity of tasks.

Such a robot needs to listen on multiple sensory inputs from the physical world and use that data to make decisions. It might be using RNN based model for inference and policy decisions. It could be using MC tree search to simulate more physical conditions. In the tree search, it may decide to follow certain sub trees or ignore others depending on time it is taking compute the completion of the subtree. Obviously most of this needs to happen in real-time for the robot to be effective.

As we can see, R1 is necessary for real-time needs of the system. R2 is necessary for supporting MC tree searches. R3 is needed as one takes different tree branches and spawns tasks corresponding to those. R4 is apparent due to the need to support RNN inference tasks- different functions at different layers, sensory input processing, various physics simulations. As it can be seen from the diagram, R5 is needed because of complex data flow dependency present in the system.

Architecture

Based on the requirements R1-R7 above and the real world example, let’s first look at what primitives need to be built that form the basis of a real-time machine learning architecture.

System Primitives

Task creation needs to be non-blocking and a “future” can be returned after task creation. Tasks will be executed asynchronously and futures can be obtained as needed.
Any function call can be deemed as remote execution to support various tasks types and their execution on different hardware types. The function call can take simple data values or futures as arguments. When a future is used as an argument, then that means there is a dataflow dependency(DAG) between the task that provides the future and the task that needs the future as input.
Any task can create new tasks which may need to be executed remotely. This also helps to remove the dependency on throughput of that specific worker/task where the original task is running.
“get” method on a future will return the value of the task when it is ready and this is a blocking call.
“wait” method can use multiple futures as inputs and a timeout value. It returns the futures that have completed on a timeout or if a certain threshold of tasks have been completed already. Such wait primitives can help with the latency requirements because they provide an ability to ignore certain tasks which can take a long time and might provide only marginal improvements which can be common in ML algorithms.

Architecture

At a high level, each node in the cluster has a local scheduler that is managing multiple worker processes which share some information via some shared memory. There is also a global scheduler that is responsible for managing centralized control information. This makes for a hybrid scheduling approach.

Centralized scheduler

The global scheduler uses a database for managing centralized control state of the system. As long as the database is fault-tolerant, the system can recover from component failures. In addition, a pub-sub system manages the updates on significant event changes that different components in the system can react to. Database can be sharded for better scalability.

Hybrid scheduling

Given R1 and R2, it makes sense to be able to make local scheduling decisions to avail data locality. The worker on the given node can submit tasks to local scheduler. Local scheduler can decide to schedule the task on the given node or submit it to global scheduler if there is spillover. Global schedulers are looking at the overall global state, data placements and resource availability and making the scheduling decisions. As tasks can create some other tasks, those can be scheduled locally for optimal execution.

Conclusion

The system design approach outlined in this paper seemed like a good way to design a new-age ML framework. Principles lead the way to the hybrid scheduling approach. In subsequent posts, I intend to cover Ray framework, which i believe is based on these principles .