Microcosm: Yes, Kenshoo has also built a PaaS

The Kenshoo Microservices Platform

Published in

skai engineering blog

11 min readMay 20, 2020

This post is based on a talk I gave in Reversim 2017 — “One platform to rule them all”, refactored for the written form. As you can see some time has passed, but we’re still enjoying this great platform, so we thought we’d share some more details and update just as we’re in the final stages of pushing the next gen of this platform. It’s named Kubiverse and will be based on, you guessed it, Kubernetes. So, in addition to the platform being solution agnostic, the infrastructure it runs on will also be cloud agnostic. Enjoy!

Back in 2006–2007 some teams in Kenshoo started extracting code from our monolith into microservices. After some experimentation and a lot of production events later, we were pretty happy with pushing ahead with this concept. Our main reason, beyond just adopting the buzz architecture, was that using microservices we could reduce dependencies between teams, enabling faster development cycles with shorter feedback loops.

The problem

Kenshoo had made a strategic decision to use microservices for its architecture. Many efforts & investments were made to do this, but we weren’t happy with our progress. We had over 40 microservices at that point, but we felt that:

Not enough of our teams had taken the leap. Most had never even “touched” a microservice, let alone developed a new one.
Most features were still being developed inside the monolith by default, and our teams were avoiding the transition.

Why? A theory emerges

We developed a theory: While microservices were working just great for anyone who got one up, it was quite difficult to start a new one. Teams had considered making the move and developing new features as a microservice, but without prior experience, teams would (correctly) estimate the timeline to create a new service — including design, code, testing & deployment pipelines, monitoring and alerting — somewhere in the vicinity of 6+ months to have a working service in production. It was just too long, especially when compared to the couple of weeks it would take to create the same feature in the monolith. Granted, development would be less fun, but the monolith is a robust environment with a lot of built-in functionality.

The cliff

The learning cliff, a chart of tools & concepts that need to be learned to build services (Load Balancer, Auto Scaling, etc)

So we broke it down further. The “time to service up” comprised many different parts, each quite different, and requiring its own learning curve for developers and teams who aren’t experienced in these areas:

Starting a new code project from scratch. Teams are used to working on a monolith.
Continuous integration pipe (Jenkins jobs) for this project (CI).
Deployment pipe to push the artifacts to production.
Creating live environments — production, lab, etc.
Debugging all these parts to understand why something broke…
Adding metrics, logging and many other different things.
Containers: Docker was a tool we were just starting to introduce into Kenshoo.
Production know-how: AWS, auto-scaling, using a load balancer.

Up to this point our teams relied on quite a few operations teams for these:

A new repo in GitHub requires permissions and setting up webhooks and other security settings. Developers do not have admin access and need to open a ticket.
Though we started transitioning to Jenkins DSL and jobs were beginning to be documented in code, this was far from the norm. In order to create a new job, developers had to submit a ticket to the SRE team.
Creating environments: Most of the org was running in our own data center which required manual VM provisioning, while we were also deployed in the cloud (AWS). We had to submit tickets for new resources, including instances, DNS records, and more.

Issues aren’t over once the service is live:

Trial by fire: Many configuration decisions by uninformed teams can cause serious issues. Examples include: ELB brought the entire service down because the team configured it to use a health check that didn’t reflect the availability of the instances, or a service would go down because it used files internally but not enough storage was allocated, and many more. We needed to find a way to centralize our learnings on these so that the knowledge could be actively shared and acted upon.
Maintenance tools such as log viewing and alerts, and operations such as database backups, and much more.

Up to this point, each team had to do it all from scratch for each project, which obviously meant a lot of cut & paste. Many of the issues we ran into stemmed from which project you copied from, which resulted in these 40 existing microservices having exactly 40 different CI/CD pipelines, each known by the direct team or developer who created them. This created a very hard to maintain stack for all parties involved, which in turn created a “Don’t touch that” culture: if it works leave it be, since it’s so hard to reason about why something isn’t working when there’s so much variance.

People prefer avoiding something that is new, hard & complicated when they have a monolith that is easier to reason about and doesn’t require as much upfront effort from them
(Kenshoo Architect)

So what could we do?

We needed to help our team make the transformation into a more “dev-ops” mindset, and jump on the microservices bandwagon.

If we could reduce that ramp-up time to, let’s say, a week, and resolve the issues, would more teams take up the challenge? We thought — Yes!

We considered several options:

More education: Workshops, courses, etc. Help our developers learn all these skills, and they will hopefully put it to practice.
Buy a packaged solution that enables these changes.
Build our own solution.

Build vs. Buy?

Kenshoo invests heavily in education, many internal and external workshops and courses. Team Leaders can request specific training, and developers have a personal budget they can spend on any online course or book that interests them. However, this only helps with parts of the issue, and takes time. So we felt that we also needed a technological solution.

We had an organizational problem, and as we know — our company is based on this principle — the right technology implementation can help solve organizational problems. So how do we acquire this technology? Do we build it or do we buy it?

The obvious question is, why develop something new? There are many tools out there that do these things. All major cloud offerings come with a PaaS solution (AWS Beanstalk, GCP App-Engine, Heroku, the list goes on). It’s a valid question, and we asked it ourselves.

We investigated many of the tools out there, and realized that those tools don’t provide what we needed, they don’t lower any entry barrier, they just make the cliff higher.

We thought we could build something that would solve all our requirements, and allow us to implement (and enforce) our organization’s internal best practices and “battle-tested” (production) configurations.

We were quite aware that we don’t have the resources to invest in creating a full solution on our own. So, what we built is not a full PaaS. In fact, we used existing platforms and tools as much as possible, and kept our “PaaS” as a very thin unifying layer, hiding the underlying implementation from the developers. Giving us both control and flexibility to change what we needed.

Mission

Our mission statement going into the project, which we felt expressed the desired end state, was:

“A self-service platform where developers can manage their services, allowing them to innovate quickly.”

This mission in turn defined various targets and principles:

Platform as a service: Developers won’t need to copy/paste. Everything they need to start coding will be provided by the service. An upgrade to the service provides upgrades to the users.
Best-in-class automated and software-defined resource provisioning and deployment procedures: Create new environments without manual provisioning.
Operations are an integral part. All configurations used by the platform when provisioning a new service will be according to our best practices. This includes wiring in tools used by operations teams to manage and monitor production.
Eating our own dog food: The platform will be self-deploying using its own infrastructure. The developing team will need to use the platform to manage the platform. This means that each upgrade/feature is tested on the platform services first…
Has to be a Managed Platform: No tickets to some IT queue to wait for completion.

And… we start(ed)

Our first target was a quick POC — taking a specific project that was already being packed as a docker, we did a quick wrap around AWS Elastic Beanstalk and created code that would deploy it to the environment. A few more days of coding and we had something that sort of worked with a lot of hand-holding.

We then went on to a more robust experiment, starting a real project on the platform from start to finish. The team started a new project with the dedicated support of the platform developer who worked on the POC, so anything they needed was very quickly added/resolved. All the while the platform was being advanced as we found more things that needed to be done.

This service was deployed to production within a month (down from 6!), since the platform was essentially growing with the service. One major milestone after that first month was: the platform (at that point one microservice called microcosm) was deploying itself to production! (like a self-compiling compiler).

The experiment was now a proven success: Service reached production in record time, continuous deployment was working, the team was happy, etc.. We identified additional features that could further improve the platform, making it both more self-service and further improving time to production (TTP).

We brought our operations teams on board. They took some getting used to the idea, but once we showed how this tool will both enforce their policies out of the box and reduce their ticket load dramatically (their workload was very high), they became invested in the platform, helping us figure out many things. In fact we structured the solution so that various parts would be the responsibility of the corresponding operations teams: Cloudformation files we use for provisioning a DB were created and are maintained by our DBAs, our Big-Data team took provisioning rabbitMQ servers and Kafka clusters, our Network team worked with us on automating Route53 entry creation and certificate installation on our ELBs, and the list goes on.

The platform became reality.

The Platform

A diagram of the Kenshoo Microcosm platform architecture — Microcosm architecture diagram

The platform elements

Microcosm CLI command line output screenshot — Microcosm CLI command line output

CLI: We developed a command line tool (written in Python + Click), that allows developers to access any functionality provided by the platform. With a simple install, developers have access to our PaaS from their own computer.
Microcosm: A microservice that handles the requests from the CLI, initiates deploys, environment creation, deletion and updates and more.
Skipper: Controls the various CI/CD pipelines we define for each service, creates new jobs for each new microservice in Jenkins, and makes sure they are updated to the latest and greatest version.
Bootstrap Templates: For each use case we created a sample project that compiles and builds, and serves as a quick copyable template for new projects. The template includes all of our standard dependencies, examples of HTTP endpoints, API documentation (Swagger/OpenAPI), End To End tests using Karate and Wiremock, and much more.

Tools and services we use

GitHub: We create and configure new repos using the GitHub API.
Jenkins: We use Jenkins DSL to define jobs, views, and more. Jenkins performs all the jobs, running the continuous integration and deployment.
AWS: Boto, Cloudformation, Beanstalk, Route53, RDS.
Docker

Where are we now?

As I mentioned earlier, our PaaS has been in production for several years now, and is proving its value every day. Here are some numbers to illustrate:

Speed

Developers can create a new project and its corresponding staging and lab environments, all done through CLI commands that create the repo in GitHub, the CI/CD jobs and more — all from their computer. This includes the spin-up time of the instances in under 1 hour (after some experience it can go down to <10 minutes)… At that point, even though it’s essentially a test app, a merge to master will build, test, and deploy to production, and the service will have an accessible (internal/external-facing) DNS address.
Deployment of a new version takes around 30 minutes from merge to production.

Adoption

We’ve had more than 200 services deployed using the PaaS (some of these were short term POCs and already got scrapped) — around 80–90 in production.
All of our R&D teams are using the platform. Most have created and manage their own services, and some have even migrated old services to the new platform! Operations teams have also realized they can use it, and so internal tools are also being launched using the PaaS.
In the past, a typical feature design document in Kenshoo only would only refer to changes in the monolith. Today, designing a new feature in Kenshoo usually involves at least two microservices as part of the architecture, and often includes creating a new microservice (or two) as well. Ramp-up time is no longer a factor. Developing in the monolith is still very much a thing, but it’s not the go-to solution anymore. Instead, we focus on what is the right architecture or place for the specific feature.

Other cool stuff

From one starter bootstrap template (Java/Dropwizard, which is still the go-to for many), we expanded our offering to include many more options. Our PaaS now supports creating new repositories and CI/CD flows for:

Python/Flask — mainly, but not only, for our data scientists and BI teams
Scala/Dropwizard
React apps — our Frontend engineers wanted in on the CI/CD action!
Node.js
Nginx — high-performance web servers
Airflow workflows

The huge success of this internal developer tool inspired more such projects and the formation of infra tool teams (Frontend, Backend, and Testing). These teams create huge value every day.

The unified infrastructure has also contributed to more “give-back”. We see many contribution suggestions (PRs) to our bootstrap projects. When a team hits an issue or finds a neat solution, these get contributed back to the templates, helping future projects.

Our unified infrastructure has also inspired internal tools and library projects — a job processing-management library, distributed scheduler, and more.

Microcosm TNG (The Next Generation)

We are already hard at work transitioning the underlying infrastructure to Kubernetes. This dovetails well with the operations focus, allowing us to achieve much faster roll-outs and scaling operations, and allowing the operations teams to achieve better resource utilization. We call this version Kubiverse. It builds on all of our learning from Microcosm, and reuses a lot of the services and code we created. Hopefully our blog post about it will come soon enough.