Clarity AI Engineering Runtime

Ruben Eguiluz
Clarity AI Tech
Published in
10 min readJul 10, 2024

Introduction

In the Platform area, one of our goals is to ensure that product development teams (stream-aligned) have autonomy and can innovate quickly with a low cognitive load. With this goal in mind, the Foundational Engineering team has implemented an internal product called ‘Engineering Runtime’ that allows ClarityAI employees to create and evolve their applications and infrastructure easily.

We believe in a Platform as a product, where the people in the company have the tools for self-servicing platform or infrastructure resources (for instance an S3 bucket needed for one application), instead of having to deal with tickets to request them from other people. In this way, we improve the agility of tech teams and decrease the unplanned work in the platform area among other things. We are not covering 100% of the cases in the company but almost 90% of them.

Users can independently create and manage the whole application stack with the Engineering Runtime. An application includes the base code of the application, pipelines, deployments, environments, and infrastructure. For instance, a user can create a Python application base code from an existing template, define the environments to deploy the application, set the runtime properties in terms of CPU, memory, etc., and define the application infrastructure needs autonomously. Then the application repository and pipelines would be automatically created, the application would be deployed in defined environments and the infrastructure would be configured. How does this magic work? Keep reading to find it out…

Previous solution

Before the Engineering Runtime, in ClarityAI we implemented the Helm-Factory framework to try to solve this problem. This framework aimed to improve the operational freedom of the tech team and facilitate the application’s creation for ClarityAI developers.

This framework tries to abstract the Kubernetes concepts from developers, but also keep the people 100% aware of the impact of their actions on the Application without having to deal with all the internals involved. The developer can define the application compute characteristics (only Kubernetes native resources), without Kubernetes knowledge. The way to define and set up the Kubernetes resources is by using a simple config file like this:

metadata:
type: http
env: dev
project: https://xxxx.domain.ai/infra/tools/letes
squad: platform-sre

main:
image: platform/letes
port: 8620
replicas: 1
memory: 32Mi
cpu: 50m

monitoring:
metrics:
enabled: true

probes:
liveness:
path: /metrics
readiness:
path: /metrics

From the beginning, one of the challenges was to provide an interface where our end users (Clarity’s tech team) could operate an application with autonomy and confidence without exposing all the internals’ complexity.

This approach allowed us to get key insights about the needs of our applications, in terms of architecture, operation, and complexity. With the Helm-Factory, we started the Engineering Runtime initiative from scratch with much more certainty.

Know your users

We did exhaustive research about the existing applications and our engineering needs, making the development process pretty straightforward. Also, it allowed us to project some of the future needs and incorporate them into the design without implementing them.

During the development phase, we had sessions with different groups of users to get some feedback from their side so we could adjust our decisions accordingly.

An imperfect solution

We knew the Helm-Factory was not the ultimate or optimal solution to solve the problem. It still presented several challenges and operational limitations:

  • It didn’t provide infrastructure components self-service (S3 buckets, IAM permissions, Databases…)
  • Poor isolation between teams and projects.
  • Limited capacity for anything beyond pure Kubernetes manifest composition.
  • Solution based on helm templates which are not very friendly to develop and test.
  • Rudimentary observability.

It was a cheap step forward with a massive impact on the tech team, both technically and culturally. It homogenized the Kubernetes packaging of the 99% (*) of the applications and democratized the common operation tasks over them.

Engineering Runtime

This framework is not a product completely developed by Clarity AI. It’s a bunch of existing tools so we don’t want to reinvent the wheel, and some custom implementations act as glue. These are the existing tools that use the framework:

Argo CD

Argo CD logo

Argo CD is an open-source continuous delivery tool designed to automate the deployment of applications to Kubernetes clusters. It follows the GitOps methodology, which means that the desired state of the application is declared in a Git repository, and Argo CD ensures that the actual state of the deployed application matches the declared state.

One of the key features of Argo CD is its ability to automatically detect changes to the desired state stored in the Git repository and synchronize these changes with the target Kubernetes clusters. This enables developers to implement a declarative approach to managing their application deployments, reducing manual intervention and ensuring consistency across environments.

Kubernetes manifests can be specified in several ways: customize apps, helm charts, jsonnet files, plain directory of YAML/JSON manifest, or any custom config management tool configured as a config management plugin.

Overall, Argo CD simplifies and streamlines the process of deploying and managing applications on Kubernetes clusters, making it an essential tool for organizations adopting cloud-native technologies and practices.

Argo CD is the deployer layer of the system. We needed a real production-grade approach to manage the deployment of the Applications, and that’s exactly what we found on Argo CD. It reduces the complexity of several parts of the system:

  • No systematic cross-clustering is needed to access EKS clusters from the CI.
  • No systematic cross-accounting is needed to access EKS clusters from the CI.
  • No systematic RBAC-IAM mapping to operate Kubernetes resources.
  • Built-in mechanisms to encapsulate Applications into their Project scope.
  • Native isolation capabilities to avoid conflicts between Applications.
  • It removes the need to use client-side helm upgrade operations.

Argo CD is only in charge of applying the manifests. It uses Appernetes (see below) as a Plugin to generate all the manifests for a given Application.

Crossplane

Crossplane logo

Crossplane is an open-source Kubernetes add-on that enables the implementation of infrastructure as code and managing cloud resources directly from Kubernetes. It extends the Kubernetes API, allowing users to provision infrastructure resources such as databases, storage, and managed services using Kubernetes-style declarative syntax.

Key features of Crossplane include a declarative Infrastructure Management Infrastructure, the composition of complex infrastructure stacks by combining and reusing existing infrastructure abstractions, and extensibility (ecosystem of providers) among others.

A game-changer component of the stack as it allows the framework to package Infrastructure as part of the Application itself. So, the final Kubernetes artifact is a composition of pure Kubernetes manifests.

For now, we are NOT using native Crossplane abstractions (Claims), or Appernetes synths resources, but we are using the CRDs of Crossplane Providers instead.

GitLab

GitLab logo

GitLab is a complete DevOps platform that provides a single application for the entire software development lifecycle. It offers integrated tools for source code management, continuous integration/continuous deployment (CI/CD), issue tracking, code review, and collaboration, all within a single interface.

One of GitLab’s standout features is its robust CI/CD capabilities, known as GitLab CI/CD. With GitLab CI/CD, teams can automate their applications’ building, testing, and deployment, streamlining the release process and ensuring consistent quality across releases. CI/CD pipelines in GitLab are defined using YAML configuration files, making them easy to version control and maintain alongside the codebase.

Appernetes

This is one of the tools implemented by Clarity AI. It’s a piece of software that transforms a specific YAML file to Kubernetes YAML manifest of native and Crossplane custom resources.

The tool generates all the manifests required to put an Application on production. This composition is also known as the Kubernetes artifact. Appernetes is a wrapper of cdk8s that reads the desired state of an Application from a simplified YAML file. This is an example of a .run.yaml file.

application:
name: fast-api-app
owner: foundational-engineering
project: foundational-engineering
repo: https://xxxx.domain.ai/product/fg/fast-api-app

services:
- name: fast-api-app
image: clarity/fast-api-app
cpu: 250m
memory: 256Mi
replicas: 2
port: 5000

public:
name: fast-api-app

interfaces:
- s3://foundational-engineering/buck:ro

buckets:
- name: my-buck

databases:
- name: cache
engine: redis

Within the YAML file, the user defines the following:

  • application: Use it to define everything about the HTTP Application the user wants to deploy. It includes how it interacts with the subsystems that make it a functional component in production (tracing, governance, routing, etc.) and any needed Infrastructure Component like in-memory caches or S3 buckets.
  • service: The smallest deployable unit of computing within an Application. Every Application must contain 1 or more Services.
  • buckets: To create cloud storage components as part of the Application Infrastructure.
  • databases: Persistence layer components operated within the context of the Application

ChatCommands

This is another tool developed by Clarity AI. It’s a Slack bot that tries to make several engineering tasks easily, for example:

  • The SaaS release (BE, FE, Report Builder).
  • Resetting AWS keys.
  • Resetting the AWS console password.
  • Several k8s tasks.
  • Generating a temporary AWS admin token.
  • Creating an incident/EDD.
  • Etc.

Some of the commands of the tool are regarding ClarityAI Engineering Runtime Framework like “create application”: This command would create a base new runtime application in GitLab from an existing template. The new application would include all the resources needed in the Runtime framework, like .run.yaml file and pipelines.

The Slack bot is one of the clients of chatCommands application, there is also a CLI application called ctool to perform all the commands in the terminal interface.

Architecture

How does it work?

This process would start with a user creating a new application using ChatCommand in Slack or using command line ctool, for instance in Slack would be like this:

/clarity create runtime application myapp in environment dev owned by myteam project myproject template fast-api admin_token 4f45ds

This command creates the runtime application myapp, and will be deployed in the dev environment. The application would be owned by myteam team and would belong to myproject project, a project is a logical group of Infrastructure and Applications that share a specific product context or contribute to the same functionality.

Behind the scenes, this command creates a git repository from a template. In this case, we used the fast-api template, which is the base code to build a REST API in Python. Apart from creating the repository in GitLab, it does the following:

application:
name: fast-api-app
owner: foundational-engineering
project: myproject
repo: https://xxxxx.yyyy/product/myproject/fast-api-app

services:
- name: fast-api-app
image: clarity/fast-api-app
cpu: 250m
memory: 256Mi
replicas: 2
port: 5000

public:
name: fast-api-app

In this example, the app would be exposed at fast-api-app.dev.domain.com endpoint.

At this point, Argo CD already discovered this new application, and it would deploy the application manifest in the dev Kubernetes cluster. How? Argo CD is configured to discover new repositories under predefined projects in GitLab automatically. In this example, we are adding this new application to “myproject” project, so this project should be already configured as a project to watch in Argo CD.

But how does it know that the repository should be deployed in a dev environment? When creating the new repository, the ChatCommand tool creates the GitLab pipeline too, and one of the steps in that pipeline is to tag the code with an specific onetag/dev tag. This tag is telling Argo CD that this tag should be deployed in the dev Kubernetes cluster.

Argo CD now is able to discover the new application, that the repository belongs to a Runtime project, the code is tagged to be deployed to the Kubernetes cluster and the repository has the file .run.yaml. Because Argo CD is configured to use Appernetes as a plugin, it would use the manifest generated by Appernetes to be applied in the Kubernetes cluster.

What happens if we want to deploy the application in another environment? For instance, if we want to deploy the application in a production environment, it would be as simple as tagging the desired state of the repository with onetag/pro tag. Argo CD would automatically deploy the application (Kubernetes and Crossplane resources) in a production environment. This means that the Kubernetes resources would be deployed in the production cluster and the infrastructure resources would be created in AWS.

What if we want to include a database in the application?

In this case, we just have to add these lines to the .run.yaml file:

databases:
- name: my-database
engine: postgresql

Push the code, and automatically the database will be created in AWS and the database credentials will be injected into Kubernetes pods as environment variables.

Conclusions

In summary, we are excited to see how this framework transforms how we develop and deploy applications at ClarityAI. We are confident it will provide significant value to all our users.

For this first version, we collaborated with the Data Science team, where they acted as beta testers and were able to create StreamLit applications. These were full web applications with infrastructure resources such as S3 buckets, without needing to ask for help or resource requests from the SRE team.

The tool just has been released and we know there are a lot of features to implement, but this is just the starting point for prioritizing delivering value to our users step by step and focusing on our user’s real needs.

--

--