Starting a journey: A look at the engineering behind Seia

They say a journey of a thousand miles begins with a single step. The story of Seia is no different: it is just now taking its first step. As a quick introduction, Seia is a tool that collects data fed to and generated by machine learning models, detects drifts during the course of normal operations and informs the users on their probable causes. This text is the account of the backend engineering decisions behind the implementation of Seia. We’ll go through what worked and what did not work and the reasoning for the decisions. Keep in mind these are the first steps; the requirements simply aren’t there to justify a more complex infrastructure, even though once the application gains traction we’ll have to adapt to that evolving scenario. Right now, Seia is in an open beta stage, which means anyone that has a use case can request access through this link: https://seia-reliableai.deus.ai/

Functionality

The two main actions a user can take are to send data to Seia and to consult the processed data. The kind of data Seia requires is simple: it needs the predictions the models make (along with the corresponding inputs) and the ground truths for those predictions. The ground truths are technically not necessary, but the usefulness of the platform will be greatly diminished. The user can send the data in bulk or stream it as the models make the predictions and the ground truths surface. That’s it for requirements! This calls for an API to collect the data and expose the processed data to the frontend. It also calls for some sort of database to store the data as well as some other system that processes the raw data and store it back into the database. Simple!

API

The API is written in Python. Though at first we were going to use some specialized libraries in the API, we ended up deciding against it. The role of the API is to be just an interface (some might say an application interface, even) between the user actions and the database. That’s what we went for. This makes its scope much more manageable and, since we don’t depend on specific Python libraries, we can port it to another faster language if we ever feel the need. Though there’s no hint that is necessary any time soon, it’s nice to have the option. What’s that? I’m glad you asked: yes, it’s stateless. Yes, it scales infinitely. No, we just have a few instances for redundancy, stop asking :D

Workers

The workers is what we call the system that takes the raw data from the database, processes it and writes it back. It is also written in Python because it uses pandas, numpy, scipy, all that good stuff. Periodically, they search the database for data that has not already been processed and then does it. The architecture of the code is very modular, with each transformation entrypoint being its own class and there is a single loop that sleeps for a predetermined amount of time and then executes the list of active transformations. It works essentially as a DAG, but we don’t need all the functionalities of a DAG, so the transformations just run one after the other, in sequence, one at a time. Eventually, we will outgrow this solution and transition to a DAG implementation (airflow seems like a good candidate) for more flexibility and functionality. Scaling the workers is not easy as straightforward as the API and we will eventually reach a limit. The first thing we can do is have one instance of the workers for each tenant (a tenant is a client of the Seia product). We can further scale by having an instance of the workers per model (each tenant may have multiple models active at the same time). If we require a solution of that nature it means we have a number of big tenants working with us. I’d say it’s a pretty good problem to have :)

Database

Then there is the database. It’s Postgres, so we’re off to a great start already. It has all the functionality we need for the queries we are going to run and is in general a great product. One thing we had in mind from the beginning was that we would need to segregate the users’ data. We did this using different databases for each tenant. This way they are completely separate from one another. A nice side effect is that scaling the database is just a matter of moving a tenant’s database to a different instance. Other solutions after that may require some sort of sharding or partitioning of data. No sense in thinking about it right now. To further cement the separation, the user credentials are hardcoded to only be able to access a single tenant at a time. We have two kinds of databases: the admin database for things that are not related to any tenant in particular and the tenant databases, of which there are many, all of them with the same schema. The evolutions of the admin and tenant databases are controlled by migrations handled in the API. We had to have the migration code somewhere and the API felt more appropriate, even though the workers also access the database. This has ramifications for stored procedures, which we will see in a bit. Yes, we (I) love stored procedures.

There are a couple of issues with them. First, changing or adding them requires a new migration. Migrations are free, but they live in the API repo. If we are working on the workers, does it really make sense to also create a change on the API just for the SQL? Probably not. Furthermore, in that case, the implementation of the SQL lives in another repo, which is an obstacle to understandability and visibility of what the code does. It’s much cleaner to have the SQL inside a method in Python and then invoke the Python method with the required parameters. It’s easier to change too, when necessary. Unfortunately, that means that if the API and workers need to perform the same operation, the solution we have is copy-pasting it, with all its downsides. We could conceivably have a separate library that implements the access to the database and is used by the API and workers. That’s more structure than we feel is worth it and it still comes with the downside of the double-change mentioned earlier. So the decision was made to scrap the stored procedures and now going forward all the SQL code would be written inside the application code (segregated into their own classes so they’re easier to test and mock).

A sad consequence is that we can’t use Python in the stored procedures (because we don’t use any). As you may be aware, you can use languages other than plpgsql in stored procedures, like python and perl. But did you know that, with some pretty easy effort, you can use pandas, numpy or any third party library in stored procedures? Now that’s a real abomination, I loved it! It sped up the workers’ processing so much because data never has to travel back and forth, and we could use all the features of the Python libraries, not just plpgsql. Unfortunately this brings even more problems in addition to the ones mentioned before. Let’s say you’re creating a new stored procedure for Python: will your editor think it’s an SQL file? Then the Python code is all wrong. Is it a Python file? Then the function declaration headers are a syntax error. Either way, you’ll be drowning in editor errors and will have a hard time reading/writing/developing/debugging the code. If you know of any solution to this, let me know. I’m thinking of a custom plugin that knows what plpython3u is, but if a plugin is not required that’s even better (please don’t say to inspect the code text of a Python function object, concatenate with create procedure headers and footers, send to the server and execute. That's crazy enough to work, but also, you know, crazy).

Public API gateway

To finalize everything we have the API gateway. It’s straightforward: it terminates SSL and load balances the application instances. It’s nginx. The main reason we are using nginx and not a cloud offering that implements the same functionality is two-fold: cost and knowledge. Basically, we already have the knowledge to put the pieces in place and connect them, but not in a cloud environment with managed services. This means we either devote time to learning how it works or we develop new features for the product. I’m sure this is a familiar situation for any developer. It has been a mixed bag, but essentially we are drawing on our existing (not necessarily obsolete (after all, the system is running just fine!), but rather out of fashion) devops knowledge to deploy the system and learning things as they become relevant for the task at hand. For now the cost-benefit of a managed API gateway instead of nginx is just not favourable.

In all, the system is up and running and you can check out Seia for yourselves at https://seia-reliableai.deus.ai/ . If you have any questions or would like to get in touch for a chat, you can reach me on LinkedIn.

--

--