Topic-based versioning architecture for Scalable AI Application

In this article, I explore software design enhancements the Source AI team and I learned at Gamma that can maximize reproducibility, testability and scalability in production-ready AI-driven applications.

Published in

GAMMA — Part of BCG X

9 min readMar 23, 2020

We have refined this design in projects at large European retailers and airlines, and in our own Source AI platform. Today, these enhancements power machine learning and optimization models at scale, enabling our teams to iterate faster and deliver without fear of losing valuable insights.

Establishing Principles

Our first step in enhancing software design is to establish principles for our algorithms: They must always return a result, never change the state outside of their local environment and, given a set of unchanging input parameters, always return the same result. In establishing these principles, we make the algorithms deterministic, cacheable and testable. This, in turn, enables applications that are efficient but, if the source data changes, not reproducible.

Guided by these principles, we can assert that when given the same application configuration, input arguments, input data and source code, we will obtain the exact same output result. As an alternative to total-input versioning, we can use topic-based organization for our input data. As per our principles, topics can only be appended and messages must be immutable, linearly ordered and consistent within a topic. Furthermore, input data should be referenced by its topic, message and cursor for all consumed topics in the application. This data layout, alongside our design principles, enables total reproducibility and scalability for all AI-driven applications.

The Uniqueness of Production Software

I’d like to share some knowledge I acquired while working at BCG Gamma as a Software Engineer. One aspect of my job is to think about to best software design when given constrained development time and tight deadlines. In my experience, a specific design pattern has proven time and again to be useful when manipulating data frames in Analytics or Data Science projects. I’m not claiming that this pattern is novel or original, nor am I the first to reveal these principles. In fact, you can find similarities with internal database architecture in the form of a shared commit log. My reason for including it here is to help sprout ideas and discussions around the subject by focusing on organizing and sharing knowledge.

Our shared objective is clear: To improve reproducibility, testability and scalability in AI Applications. We have accomplished all three first-hand at BGC Gamma while implementing AI at scale in production projects at large European retailers, airlines and at our own Source.ai platform. Let me be clear: production software has very different requirements compared to a PoC or an MVP: Something considered acceptable in an earlier stage of project may have consequences in later phases of production that would not occur with these other projects. This is to say that the initial application design matters because defects are harder to correct when scaling up in width (feature) and depth (productionization).

It is equally important to note the importance of retaining all experiment data when prototyping. I’ve witnessed many teams struggle to recover or recreate results they just presented in a meeting. Failing to retain data not only damages a team’s credibility. It can also lead to the loss of valuable insights that may play a key role in follow-up work or comparison. Suffice to say: Your data must be accessible at all times.

Of course, extra performance is always welcomed — especially in software that will be used several times a day and drive the company’s bottom line.

Term Definitions

SIDE EFFECT

In computer science, a function or expression is said to have a side effect if it modifies some state outside its local environment (such as when reading a file or writing to a socket).

PURE FUNCTION

In computer science, a pure function is a function that evaluates with no side effects.

DETERMINISTIC

A deterministic algorithm is an algorithm that, given a particular input, will always produce the same output.

CACHEABLE

An expression is called cacheable (referentially transparent) if it can be replaced with its corresponding value without changing the program’s behavior.

MEMOIZATION

Memoization is an optimization technique used to speed up programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

TESTABLE

There is a strong correlation between good design and a testable system. Code that has weak cohesion, tight coupling, redundancy and lack of encapsulation is difficult to test.

Designing AI-Driven Scalable Applications

First, let’s establish some principles for the algorithms in our applications:

Functions must always return a result.
Functions must never change the state outside of their local environment (i.e. with no side effects).
Functions must always return the same result given a set of input parameters (referential transparency).

You may wonder how, if functions are forbidden from creating side-effects, they can interact with other programs or humans. For that purpose, an application is built with a “pure” algorithmic core and an “impure” wrapper around it. Core functions can call only other pure functions. Only the wrapper is free to call any pure or impure functions and allowed to create side effects.

The wrapper must do its best to retain the deterministic property of the algorithm. In other words, an algorithm can expect to receive from the wrapper a sales table extracted from a SQL database. In this case, the wrapper is allowed to initiate the connection and fetch the table in question, but not to update rows at that time. Some class of errors cannot be avoided (think database connections, full disk), thus true deterministic behavior is only a best effort. From these practical considerations, we can derive the following best practices:

Wrappers should read data only before the algorithm is called.
Wrappers should write results only after the algorithm has returned.
Wrappers are responsible for reading configurations and transmitting their value to algorithms.
Applications should fail fast, hard and explicitly. Defer the retrying logic to their scheduler.
Write unit tests for the pure core, and integration tests for the impure parts.

For your algorithms to remain pure you must avoid some functions such as unseeded random. Instead of using this function, pass the seed as a parameter or configuration value. The use of database select or file read must be avoided and pass their result as a parameter. Database update/delete or file write can also change the state of the filesystem or the network interface. Logger calls are an exception to the rule and, as such, can be used without restrictions.

From Deterministic to Reproductible

These principles enable testable and scalable applications. But how can you also enable reproducibility if the source data changed? With the application’s algorithm being deterministic we can assert that, given the same configuration, input arguments, input data and source code, we must obtain the exact same output result — or receive no result if the application encountered an error. We assume dependencies are properly pinned to explicit versions. Some metadata are easier to obtain than others. Configuration, input arguments, and source code version do not usually pose problems, but input data may change since the application will not have any control over its content or availability.

One possible approach to this problem is to collect all data from all sources into a controlled, versioned environment. This requires that the application be able to copy the entire dataset for each time it is being run, which can prove to be quite expensive. In the next part, we will explore a different approach — one that allows for less-expensive storage, as well as for some optimizations.

Topic-based Architecture

Term Definitions

TOPIC

An ordered folder, table, or collection containing messages.

MESSAGE

A file, row, or entry containing an arbitrary amount of data and following an established schema.

CURSOR

A comparable pointer to a specific message within a topic. Usually an integer.

As we did before, let’s establish some principles:

Topics are append-only: You cannot delete any messages once they are committed to a topic.
Messages are immutable: You cannot update messages once they are committed.
Messages are linearly ordered within a topic.
Messages should be consistent within a topic, sharing identical or assimilable schema.

Now, let’s unwrap those principles into practical considerations. First, applications can read from multiple topics and write results to a different one. Although reading and writing messages to the same topic within the same application is possible, this practice should be kept to specific use-cases only. Logs and metadata can also be written to dedicated topics, the schema used for reading and writing may differ, and the only way to remove previously omitted messages is to delete the entire topic.

To enable reproducibility, applications must take a reference to their input data as topic and cursor pair. This approach allows a standalone application to read from several sources — considering their state at a given point in time — without mandatory duplication.

Integrating with other sources is always more tedious, and this pattern is no exception to the rule. Reading data from externally controlled relational databases will, for example, often include a copy step to extract the data from there and import them into our system. This step becomes entirely optional if you can guarantee the external database is upholding our design and following our principles.

A system implemented with those principles, can only propose the following functions:

- get : cursor -> topic ->message
- put : message ->topic ->cursor

Please appreciate the symmetry of their definition. You may also define optional support functions, such as:

- list : topic -> cursors
- delete : topic -> ()

The following illustration is a simple example of topic-based architecture in an AI-driven application:

Now that we covered the basics, we can dive into more advanced usages.

Reproductible Data Science

A Pipeline is an ensemble, mesh or dag of functions (or applications) sharing similar properties and scheduled on different intervals. Each pipeline step may read from a set of topics, compute one or more results, and write those results to dedicated topics. To enable easier auditing, pipelines may compile their configuration, input arguments, and git commit hash into a single message and dump it to a “run configuration” topic. When the pipeline is completed, aggregated logs and other metadata should be written to a consolidated message in a “run logs” topic[i]. This ensures that the cursor 0 in “run configuration” will match the cursor 0 in “run logs.” Finally, you can now build a “run explorer” that will scroll those topics and match each run configuration with their logs. Neat!

Simple Timeseries Storage

Timeseries can take advantage of the total ordering of messages within topics by expressing individual data points as messages. Usually, cursors are defined as unsigned integers, but it is possible to consider them as offsets to a timestamp of origin. Then, a function that is given the time resolution, the origin and the current cursor can retrieve the timestamp of the entry — and vice-versa.

Event Sourcing

Last but not least, schema updates mandate writing migration functions (up/down) to allow the reading of “old” data with the “new” schema: Doing otherwise would violate the consistency rule. As is often the case, writing migrations on early prototyping projects can be quite cumbersome, and erasing entire topics and starting over can lead to information loss. We don’t want to encourage the latter, so we can use “snapshots” instead. We define snapshots as new states aggregating multiple messages into a single message. Let’s see how they are created:

First, read history from topic A with schema S until the branch point, where the schema change.
Then, create a new message from the current application state.
Write the state to a new topic B with schema T.
Copy the remaining messages, after the branch point, from topic A to topic B.
Finally, erase topic A and schema S.
Now all messages reside in topic B and are readable with schema T.

This approach is very close to something we would do using another pattern called “event sourcing.” Event sourcing also operates under the assumption that the application is capable of aggregating multiple messages into a single-state representative of several entries without causing information loss. Event sourcing is the natural evolution of such applications. (Learn more about this at https://docs.microsoft.com/en-us/azure/architecture/patterns/event-sourcing.)

Conclusion and Annexes

Topic-based architecture is the missing piece to most AI-driven application. Unlike other practices, data science really benefits from the ability to retain earlier data and reproduce results. This design pattern is usable from PoC, to MVP, to production and will not require any software architecture change between stages. Adopting topic-based versioning in your data science project is a first step toward a truly reproductible AI pipeline. For further reading and information on similar or connected topics, please refer to:

· https://kafka.apache.org/documentation/#intro_topics
· https://martinfowler.com/eaaDev/EventSourcing.html
· https://docs.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
· https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs
· https://blog.ploeh.dk/2016/03/18/functional-architecture-is-ports-and-adapters/

[i] Capable frameworks would provide the capabilities of opening a message, streaming content to it, and committing it on close. This interface would be very handy for streaming logs!

Topic-based versioning architecture for Scalable AI Application

In this article, I explore software design enhancements the Source AI team and I learned at Gamma that can maximize reproducibility, testability and scalability in production-ready AI-driven applications.

Written by Quentin Leffray