Our microservice stack
This is an introduction to how we’ve implemented microservices at a mid-size scale-up called Jobteaser, with a mix of Go and Ruby service chassis, gRPC APIs and data replication via Kafka.
Foundation: The service chassis
Back in early 2019, when Jobteaser decided to get serious about breaking up its decade-old Rails monolith into microservices, we assembled a Foundation team that started working on an in-house service chassis.
It was soberly coined service
and came in two flavours: the rb-service
framework in Ruby and the go-service
framework in Go. Four years later, they still form the foundation of our fleet of about 20 services.
They provide a lean set of consistent features across the Ruby and Go flavours:
- a gRPC server component to power our gRPC APIs (more below)
- a Kafka consumer, to consume messages from a Kafka message queue
- a Prometheus exporter, to expose monitoring metrics
- a low-level lib (no ORM) to interact with Postgres, that would be the default database of each service, and optionally Redis for services who require one
- consistent logging, metrics and error reporting for those components
- a bunch of CircleCI, Dockerfiles and Kubernetes (k8s) config files to enable automatic deployment in staging and production
- an executable to start the service
- a generator script, to scaffold a new service’s Walking Skeleton with all of the above and get it deployed to prod in less than a day
For a while, the various workloads would all run as threads inside a single process: each instance of a service would run a thread pool for its gRPC server, one for its Kafka consumers, one for its Prometheus exporter, one for its background jobs, etc.
That was great for the local dev environment because you didn’t need to run a different Docker image for each workload, but we quickly came to split the types of workloads into different pods in production, to avoid issues where a buggy background job could take down the API.
Cloud-native Infrastructure
All our services are containerized and run on two Kubernetes clusters, one for production and one for staging, setup and operated with kops on our AWS accounts. Stateful servers like our services’ Postgres databases and Redis are not managed via Kubernetes but provisioned straight in Amazon RDS or Elasticache.
The staging cluster is identical to the prod one, except it runs fewer replicas and slightly less powerful versions of the service pods.
We do Continuous Deployment: whenever we merge a feature branch to master, our CI builds a new Docker image and deploys it to staging then production with the help of helm.
We also have a script that lets us deploy a service’s Git branch to staging to test changes ahead of merging them. But the next merge on master will trigger a deploy that will override it, so prod and staging always end up running the same code again.
Our secrets are stored in Vault. Our monitoring and alerting handled by Prometheus, Grafana and Loki. Gitops-deployed Terraform templates describe our cloud resource provisioning (RDS DBs, S3 buckets) but also our k8s and Github teams and permissions.
APIs: gRPC and Protobuf
You may have noticed one thing missing in my list of features for our service chassis: an http server.
By default, all our APIs are served over Google’s gRPC and Protobuf, rather than the more common http/json.
gRPC serializes the API request and response payloads using a binary format called Protobuf, that forces us to define our API’s request and response formats in .proto
files using Protobuf’s Interface Description Language.
So when we want to create a new endpoint, the first step is always to make a Pull Request to our proto
Github repository, defining a new rpc
line for the new endpoint, along with the types of its request and response formats.
Benefits
The main perks of this setup come from the centralized proto
repo with all the API definitions:
- it gives us a centralized catalog documenting all our API endpoints, necessarily up-to-date
- every API change starts with a discussion of the interface in a dedicated Pull Request that both front and backend devs can review
- linters on the Proto repo catch any breaking change, forcing us into proper API versioning practices and enforcing naming conventions
We also get performance benefits from gRPC’s persistent http2 connections and Protobuf’s small binary payloads.
Costs
The Protobuf step adds a bit of work to all API changes: you need to make a PR to the proto
repo, wait for the CI to generate the proto-ruby
and proto-go
library releases then upgrade the libs in the client and server services.
You also pay a small exoticism tax: gRPC is not as widespread as REST or GraphQL, which means you don’t get quite as much tooling and integrations. There’s a bit of an onboarding curve for newcomers. And off-the-shelf components do not always offer a gRPC interface so you can’t always preserve its performance benefits across your whole infrastructure.
Finally, gRPC complicates load-balancing. Persistent http2 connections are tricky to load-balance and support in infrastructure tools like Kubernetes services has been slow to catch-up. So far we’ve resorted to DNS-based client-side load-balancing via Kubernetes headless services, which doesn’t share the load optimally and requires retries to handle failovers. We’re considering adopting a service mesh to support more advanced gRPC load-balancing, among other features, but it comes at the cost of more complexity.
The API Gateway
gRPC works great for communication between persistent services, but it wasn’t designed to work in the browser. We still want our frontend React Single-Page Applications (SPA) to query the gRPC API endpoints of our backend services using good old http/json requests.
We also need to think about access control. Some API endpoints should only be called inside our infrastructure and not be exposed to the world. Others must verify that the request comes from an authorized user.
So our Foundation team implemented an API Gateway in Go, soberly named apigw
, built on top of the go-service
chassis, and performing the following duties:
- Conversion of the incoming http/json requests of the frontend into a gRPC/Protobuf request that can be dispatched to the gRPC endpoint of a target service (and conversion of the resulting response from proto to json)
- Identification, by checking the incoming request’s session cookie against our authentication service, and getting the corresponding
user_id
- Routing to the right service’s endpoint, by checking the request’s path against a whitelist of endpoints in config files, and performing a gRPC/proto request decorated with the
user_id
in the request’s metadata, so the target service can identify the user making the request.
Like all our other services, the apigw
needs the latest version of our Proto library to send gRPC requests to the new endpoints we add. That part was automated so that whenever we merge a new update of our proto
repo, our CI triggers an upgrade of the apigw
with the newly generated release of theproto-go
library and deploys it to production.
Product services and tech bricks
We tend to classify our services in two broad categories:
- Business services: contain business logic and data, roughly map to modules or domains of our product, built around one main business entity: one service for our job board, one for our career appointment booking module, one for our career event management platform…
- Tech bricks: utility services that hold little to no business logic, a bit like libraries that could almost be open-sourced, and typically serve as interfaces between our infrastructure and clients or external services.
Examples of tech bricks include:
- our API Gateway that handles incoming API requests from our clients
- our transactional email service that sends emails via an external service (Sendgrid) and handles templating, i18n, scheduling and retries, so we don’t have to integrate libs for those concerns into each service. Business services can just make gRPC calls to that service using a generic API
- our CMS service interfaces with a headless Content Management System (Kontent.ai) to expose articles produced by our internal teams of guidance experts into the Jobteaser product using gRPC APIs
Granularity
Finding the right seems to break down a system is one of the biggest challenges with a service-oriented architecture.
We’re not (yet) a tech giant, and despite all the tooling we deployed over the years, creating and maintaining a new service still involves a non-trivial amount of overhead for us, so we had to take a pragmatic approach to keep efficiently delivering value to our users.
That’s why we haven’t gone very micro with our services. A more accurate way to describe our architecture would be “Service-Oriented”, though “microservice” has come to convey the general idea more broadly. A lot of our services are actually small modular monoliths that include several submodules, and we’re intentionally staying very far from the oft-touted ideal of “services you can rewrite in 2 weeks”.
The biggest drawback is that we sometimes end up with several teams working at the same time on a given service, raising similar issues to those you may encounter on a monolith, though at a lesser scale and frequency. This seems like an acceptable trade-off at our current scale, though it may evolve in the future.
Integration: sync and async
We use the database-per-service pattern for isolation so each business service has its own database. That way, if one service is down, it shouldn’t keep the other modules from functioning. And each service can change its internal database schema without impacting the others, as long as we avoid breaking changes in the service’s APIs.
Then of course we often need to integrate the data from several services to produce useful features and that constitutes one of our main challenges.
Communication between services is achieved by a mix of synchronous gRPC APIs, and asynchronous communication via Kafka.
The messages we push into Kafka to propagate changes are also encoded using Protobuf, which provides an API contract for those async messages.
So you can think of gRPC as the service’s synchronous API, and Kafka as its asynchronous API, and the two combined form each service’s public API.
But integration is such a vast topic that we’ll dig into it in a subsequent post.
Stay tuned!