Microservices at gutefrage.net — Part 2 — Continuous Integration and Deployment

Not only, but especially in our services we do 100% automated continuous integration and deployment, today i’m going to show you how.

First of all let’s define what our CI/CD should fulfill:

  • Each service must be independently deployed
  • Every change (git push) must be automatically built and deployed
  • We work only on master to avoid merge/integration hell
  • Tests must be run
  • Changes must be tested automatically for backwards compatibility

Our build steps are tied together to a pipeline. As an example let us consider the pipeline of our “question service”.

Trigger

Every “git push” to “origin” in the question services git repository triggers this pipeline. All our pipelines are configured in a Jenkins-CI.

Step 1 — build question service

Checks out the code, compiles, unit tests and packages it with all its dependencies as a runnable jar. If everything works and all tests pass, this step loads two jar files into our software repository, “questionservice-vXXX.jar” (the service) and “questionservice-integrationtests-vXXX.jar” (integration tests), these jars are needed by the later steps.

Step 2 — deploy canary non live

This step downloads the “questionservice-vXXX.jar” and deploys it on our production Mesos cluster, marked as “canary — not live”, so it doesn’t get live traffic. If the service starts and passes the health checks (Status 200 on GET /admin/healthcheck) the step is fulfilled.

Step 3 — integration tests question service

Downloads and executes “questionservice-integrationtests-vXXX.jar”, with a configuration which tells the tests to use the running canary instance (from Step 2) of the question service. The integration tests do a Blackbox-Testing on the service. Their goal is to hit every endpoint and external dependency, like the database or another service which is used to handle the request, at least once. The tests are running on the live database but with a test-“Vertical”.

Verticals: We have several pages which behave similar but for different themes like gutefrage (good question), autofrage (car question) or finanzfrage (finance question). All these Vertical are handled by the same service. Each request contains the vertical. For the integration tests we have a vertical called testfrage (test question).

Step 4 — regression tests

This step ensures that no incompatible changes are made, which means that every “user” (might be another service or a UI) of the service can still do its requests. To test this we simply run the latest integration tests of all dependent components. These test are also started with a configuration which ensures, that the canary instance (again from Step 2) of the question service is used.

How do we deploy an incompatible change? We have to do it in 3 steps, it’s a bit more work but it doesn’t have to be done often.

  1. Duplicate the endpoint which needs an incompatible change and do the change on the duplicate. Mark the old one as deprecated.
  2. Update every client, so it uses the new endpoint
  3. Remove the old endpoint

Step 5 — deploy canary live

Now the canary instance gets live traffic and we monitor some metrics like error rate, cpu usage and memory usage. If it the metrics get too bad the deployment automatically fails, otherwise it continues.

Caution: After this step still just one instance with the new version is in production, all the other instances have the former version.

Step 6 — deploy production

Now finally the configured number of instances with the new version are started on the production cluster. If all instances are started and their health checks are OK, the instances with the old version are shut down.

Step 7 — upload stable integration tests

After everything is done and working, we mark the new version as the “current stable” and upload the “questionservice-integrationtests-vXXX.jar”, so it can be used for regression tests.

Stop! Working just on master and every push goes immediately to production? Am I crazy?

One important thing i haven’t mentioned yet is Feature Toggling. A feature toggle is basically an if-else statement which executes based on a configuration.

if (featureToggles.isActive("MY-NEW-FEATURE")) { // new code } else { // old code }

We build every new feature wrapped in a feature toggle which is by default inactive, so if someone pushes unfinished code it is indeed deployed to production but is not live. Once the development is finished we can switch the toggle to “active” by changing the configuration. This allows us to roll out the feature step by step such as:

  • Active only for the internal office IP
  • Active only for a specific user group
  • Active for a particular percentage of the users

Once the feature runs stable on production for a while it’s very important to clean up the feature toggle, which means removing the toggle and obsolete code.

Summary

Finally let’s check if we achieved our goals

  • Each service is automatically built, deployed and tested independently
  • Backwards compatibility is tested with regression tests
  • By using feature toggles we are able to work only on master and continuously integrate our changes

This approach works very well for us and we are able to deliver our changes in very small portions. If you have any questions, feel free to leave a comment.