Our microservice stack

Published in

JobTeaser Engineering

8 min readMar 31, 2023

This is an introduction to how we’ve implemented microservices at a mid-size scale-up called Jobteaser, with a mix of Go and Ruby service chassis, gRPC APIs and data replication via Kafka.

Foundation: The service chassis

Back in early 2019, when Jobteaser decided to get serious about breaking up its decade-old Rails monolith into microservices, we assembled a Foundation team that started working on an in-house service chassis.

It was soberly coined service and came in two flavours: the rb-service framework in Ruby and the go-service framework in Go. Four years later, they still form the foundation of our fleet of about 20 services.

They provide a lean set of consistent features across the Ruby and Go flavours:

a gRPC server component to power our gRPC APIs (more below)
a Kafka consumer, to consume messages from a Kafka message queue
a Prometheus exporter, to expose monitoring metrics
a low-level lib (no ORM) to interact with Postgres, that would be the default database of each service, and optionally Redis for services who require one
consistent logging, metrics and error reporting for those components
a bunch of CircleCI, Dockerfiles and Kubernetes (k8s) config files to enable automatic deployment in staging and production
an executable to start the service
a generator script, to scaffold a new service’s Walking Skeleton with all of the above and get it deployed to prod in less than a day

For a while, the various workloads would all run as threads inside a single process: each instance of a service would run a thread pool for its gRPC server, one for its Kafka consumers, one for its Prometheus exporter, one for its background jobs, etc.

That was great for the local dev environment because you didn’t need to run a different Docker image for each workload, but we quickly came to split the types of workloads into different pods in production, to avoid issues where a buggy background job could take down the API.

Cloud-native Infrastructure

All our services are containerized and run on two Kubernetes clusters, one for production and one for staging, setup and operated with kops on our AWS accounts. Stateful servers like our services’ Postgres databases and Redis are not managed via Kubernetes but provisioned straight in Amazon RDS or Elasticache.

The staging cluster is identical to the prod one, except it runs fewer replicas and slightly less powerful versions of the service pods.

We do Continuous Deployment: whenever we merge a feature branch to master, our CI builds a new Docker image and deploys it to staging then production with the help of helm.

We also have a script that lets us deploy a service’s Git branch to staging to test changes ahead of merging them. But the next merge on master will trigger a deploy that will override it, so prod and staging always end up running the same code again.

Our secrets are stored in Vault. Our monitoring and alerting handled by Prometheus, Grafana and Loki. Gitops-deployed Terraform templates describe our cloud resource provisioning (RDS DBs, S3 buckets) but also our k8s and Github teams and permissions.

APIs: gRPC and Protobuf

You may have noticed one thing missing in my list of features for our service chassis: an http server.

By default, all our APIs are served over Google’s gRPC and Protobuf, rather than the more common http/json.

gRPC is a Google framework that lets a client service call a function in a remote server service. It opens persistent http2 connections to offer high performance by skipping the TCP handshake on each request

gRPC serializes the API request and response payloads using a binary format called Protobuf, that forces us to define our API’s request and response formats in .proto files using Protobuf’s Interface Description Language.

So when we want to create a new endpoint, the first step is always to make a Pull Request to our proto Github repository, defining a new rpc line for the new endpoint, along with the types of its request and response formats.

An example of .proto definition. rpc = endpoint, message = type.

A ruby and a Go library are generated from the .proto files. They contain the stubs, i.e. generated Go and Ruby classes that let us call or implement the endpoints and enforce the API’s type-safety. The libraries are then included in each service using the language’s dependency management (Rubygems or Go packages)

Benefits

The main perks of this setup come from the centralized proto repo with all the API definitions:

it gives us a centralized catalog documenting all our API endpoints, necessarily up-to-date
every API change starts with a discussion of the interface in a dedicated Pull Request that both front and backend devs can review
linters on the Proto repo catch any breaking change, forcing us into proper API versioning practices and enforcing naming conventions

We also get performance benefits from gRPC’s persistent http2 connections and Protobuf’s small binary payloads.

Costs

The Protobuf step adds a bit of work to all API changes: you need to make a PR to the proto repo, wait for the CI to generate the proto-ruby and proto-go library releases then upgrade the libs in the client and server services.

You also pay a small exoticism tax: gRPC is not as widespread as REST or GraphQL, which means you don’t get quite as much tooling and integrations. There’s a bit of an onboarding curve for newcomers. And off-the-shelf components do not always offer a gRPC interface so you can’t always preserve its performance benefits across your whole infrastructure.

Finally, gRPC complicates load-balancing. Persistent http2 connections are tricky to load-balance and support in infrastructure tools like Kubernetes services has been slow to catch-up. So far we’ve resorted to DNS-based client-side load-balancing via Kubernetes headless services, which doesn’t share the load optimally and requires retries to handle failovers. We’re considering adopting a service mesh to support more advanced gRPC load-balancing, among other features, but it comes at the cost of more complexity.

The API Gateway

gRPC works great for communication between persistent services, but it wasn’t designed to work in the browser. We still want our frontend React Single-Page Applications (SPA) to query the gRPC API endpoints of our backend services using good old http/json requests.

We also need to think about access control. Some API endpoints should only be called inside our infrastructure and not be exposed to the world. Others must verify that the request comes from an authorized user.

So our Foundation team implemented an API Gateway in Go, soberly named apigw , built on top of the go-service chassis, and performing the following duties:

Conversion of the incoming http/json requests of the frontend into a gRPC/Protobuf request that can be dispatched to the gRPC endpoint of a target service (and conversion of the resulting response from proto to json)
Identification, by checking the incoming request’s session cookie against our authentication service, and getting the corresponding user_id
Routing to the right service’s endpoint, by checking the request’s path against a whitelist of endpoints in config files, and performing a gRPC/proto request decorated with the user_id in the request’s metadata, so the target service can identify the user making the request.

Typical flow of a request to a backend service coming from the browser through our apigw: Conversion, Identification and Routing

Like all our other services, the apigw needs the latest version of our Proto library to send gRPC requests to the new endpoints we add. That part was automated so that whenever we merge a new update of our proto repo, our CI triggers an upgrade of the apigw with the newly generated release of theproto-go library and deploys it to production.

Product services and tech bricks

We tend to classify our services in two broad categories:

Business services: contain business logic and data, roughly map to modules or domains of our product, built around one main business entity: one service for our job board, one for our career appointment booking module, one for our career event management platform…
Tech bricks: utility services that hold little to no business logic, a bit like libraries that could almost be open-sourced, and typically serve as interfaces between our infrastructure and clients or external services.

Examples of tech bricks include:

our API Gateway that handles incoming API requests from our clients
our transactional email service that sends emails via an external service (Sendgrid) and handles templating, i18n, scheduling and retries, so we don’t have to integrate libs for those concerns into each service. Business services can just make gRPC calls to that service using a generic API
our CMS service interfaces with a headless Content Management System (Kontent.ai) to expose articles produced by our internal teams of guidance experts into the Jobteaser product using gRPC APIs

Granularity

Finding the right seems to break down a system is one of the biggest challenges with a service-oriented architecture.

We’re not (yet) a tech giant, and despite all the tooling we deployed over the years, creating and maintaining a new service still involves a non-trivial amount of overhead for us, so we had to take a pragmatic approach to keep efficiently delivering value to our users.

That’s why we haven’t gone very micro with our services. A more accurate way to describe our architecture would be “Service-Oriented”, though “microservice” has come to convey the general idea more broadly. A lot of our services are actually small modular monoliths that include several submodules, and we’re intentionally staying very far from the oft-touted ideal of “services you can rewrite in 2 weeks”.

The biggest drawback is that we sometimes end up with several teams working at the same time on a given service, raising similar issues to those you may encounter on a monolith, though at a lesser scale and frequency. This seems like an acceptable trade-off at our current scale, though it may evolve in the future.

Integration: sync and async

Integration between business services is mostly accomplished by ways of asynchronous messaging using Kafka and Protobuf rather than synchronous gRPC dependencies

We use the database-per-service pattern for isolation so each business service has its own database. That way, if one service is down, it shouldn’t keep the other modules from functioning. And each service can change its internal database schema without impacting the others, as long as we avoid breaking changes in the service’s APIs.

Then of course we often need to integrate the data from several services to produce useful features and that constitutes one of our main challenges.

Communication between services is achieved by a mix of synchronous gRPC APIs, and asynchronous communication via Kafka.

The messages we push into Kafka to propagate changes are also encoded using Protobuf, which provides an API contract for those async messages.

So you can think of gRPC as the service’s synchronous API, and Kafka as its asynchronous API, and the two combined form each service’s public API.

But integration is such a vast topic that we’ll dig into it in a subsequent post.

Stay tuned!