Why we chose to try a Service Mesh on OpenShift

7 min readNov 11, 2022

This is the first part of a series of three blog posts about a Service Mesh in the making. The first post is about the Why. A light weight reasoning on what traits made us consider using a Service Mesh. The second is about the How. A simple walkthrough of what was needed for a POC implementation. The last is about the experience. Did it provide value? Was it easy?

The good old environment

The first time I ran into the concept of Service Mesh was a few years back. At the time I was working for a customer running a stable but not so flexible environment. We had a number of micro services distributed over a few Linux VMs behind load balancers. I will not go into the details, but to get the general idea:

Fairly modern application architecture, with the exception of some rarely changed relics
No containers, no Kubernetes, just plain java applications running as services on linux
Outsourced infrastructure with week long latency on change requests
Homegrown API gateway based on Nginx for OAuth validation, access logging and routing.

The thought of the Service Mesh was born

The main reason we started to dive into Service Mesh resources was with the aim to decouple everything related to communications from the applications. Instead of constantly working with keystores and settings in the applications, we could offload that onto the Mesh. Our own API-gateway was very flexible, but the config files were growing large and hard to manage. At this time, we never found a viable way to benefit from a Service Mesh with the existing infrastructure that we had. The Service Mesh, or more specifically Istio, would continue to be a dream for some time. Meanwhile, we kept it as a promising technology on our radar and made mental notes on areas where we could benefit from having it in case the dream was to become true.

OpenShift and the Service Mesh

As time went by, a decision was made to build a brand new RedHat OpenShift platform in the cloud and migrate the majority of the applications to it instead. This made the idea of Istio a lot more promising. Istio, or a slightly altered version of it, is included as part of the OpenShift platform under the name OpenShift Service Mesh. But for simplicity, we will stick with the name Istio for the remainder of the post.

Anyway, let’s not get ahead of ourselves. The transformation from the bare metal environment to OpenShift itself provides a lot of new functionality. Is Istio really needed?

Let’s look at a few different use cases and compare Istio with OpenShift as well as the legacy solution. A little disclaimer though. This does not cover all the reasoning behind our decisions and it is far from the only way of solving the problems, but it covers a small feature set of Istio that was important to us.

Exposing API:s

In the old solution, we managed templates in ansible that allowed us to add new services quite easily. We kept building those templates to facilitate configurable security checks and header management as we needed, all provided by open-resty, an Nginx adaption.

In vanilla OpenShift the easiest way was to just expose a service by adding it to the Router, but it is kind of blunt and does not come with the same flexibility as in our ansible templates. We didn’t see the router as a great option, at least not by itself.

Adding Istio would give us a broad featureset where most of the ansible magic would be included. What could not be solved in Istio was probably better placed elsewhere. It would be easy to control which paths to expose in an application, and also under what hostname. It also turned out to be easy to do basic token validation.

Managing outbound traffic

For security reasons you probably want to limit outbound traffic and block unknown destinations. If you have rather static IPs on your external destinations, this is easily done in the firewall. However, nowadays we start using more and more SaaS in the cloud and it is not uncommon for them to have ip-ranges that change over time. Keeping track of them is cumbersome. Although not waterproof, one way of mitigating is by whitelisting approved dns names through a proxy.

In our old solution, we had to go through a transparent proxy for outbound traffic where hosts needed to be whitelisted first. Whitelisting was pretty quick but required restarting the proxy nodes. The proxies themselves were allowed to communicate with any host on standard https ports. One of the problems with this is that we needed to configure applications with proxy directives. Many times it is easy with just an environment variable or a system property, but sometimes this needed changes in the code or was not possible at all. We had some fallback solutions for those cases as well, but not pretty ones.

In OpenShift we would have to use the same pattern as we do today, but have to reimplement the proxy solution in our own pods and configure specific egress ips that could be allowed external access. Unfortunately we would still have the problem with configuring the applications with proxy directives.

Using Istio, we can force all traffic through the Service Mesh and require all external hosts to be registered in the mesh in order to be reachable. It’s easy and built in. The best part though, is that it is transparent to the workloads. This works because the Istio init-container sets up iptables rules to redirect all inbound and outbound traffic to the pod to the istio-proxy running as a sidecar.

Observability

In our old solution, we used standard tooling for aggregating logs combined with metrics for providing observability. We used a managed service for log aggregation that collected all logs with a certain path regex from our nodes. And we used our own Prometheus/Grafana with a giant scraping config to collect metrics. We were doing ok with this setup although it was a bit brittle in terms of configuration. One thing we repeatedly realized that we were missing, but never got around to implement, was tracing.
In OpenShift logging and gathering metrics became easy since it is bundled with the platform. Setting up Elasticsearch for logging and gathering metrics to Prometheus/Grafana is done in a whiff. In our applications, before containers, we always produced two different log-files. One for the application log with one format, the other for an access log with another format. In OpenShift log collection is done by tailing stdout meaning that we would need to combine the access and application logs, or ship the access log in some other way. Metrics can be scraped with a simple configuration. Although OpenShift gives us some great tooling, we still need to create our own dashboards etc in order to quickly be able to get an overview of things. We had lots of dashboards for individual services and applications, but other than access logs we didn’t have that much on communication metrics.

Istio to the rescue. With Istio we can get our own communication observability. Both in terms of where metrics and logs are collected but also where it is sent. The istio proxy sidecar collects a number of metrics and could also provide an access log. Metrics are sent to an istio-specific Prometheus/Grafana with preconfigured dashboards although that is optional. One of the main differences from before is that all metrics and access logs are formatted the same way regardless of how the underlying applications are built. Another thing that makes Istio stand out compared to our other alternatives, is built in tracing with Jaeger. A simple Jaeger deployment can be configured and tracing spans are automatically collected from the mesh. The real benefit of tracing may not come before instrumenting the workloads to carry the trace-id downstream, but the basic level provided by Istio still tells you a lot. Last but not least, Istio comes with a really handy UI called Kiali. It uses the capabilities of the service mesh to create a visual representation of how the different workloads connect to each other. Each connection contains basic status and throughput making it very easy to identify problems. Apart from looking at different views of the mesh in Kiali, it is also possible to use it to configure the mesh although I prefer to do that through GitOps.

Summary

The transformation from our old environment to OpenShift was in itself a big step. It opened the doors to a lot of new possibilities. When it comes to the application platform, we strive for ease of use for the teams using it. Ideally neither the team developing the application nor the team managing the platform should need to do a lot of manual work to make new features available. With OpenShift we can use Operators to declaratively configure very complex applications and we can use ArgoCD to make it even easier with GitOps. But we still need good observability to know when things have broken down or even better, when things are nearing the point of breaking down. Most of the features that made us interested in Istio would be possible to solve in OpenShift. But Istio contains so much out of the box and it would take a good bit of effort to build it ourselves. In the end, we decided to go with Istio.

If you know about Istio you probably know that this is just a very small part of the feature set. There are other highly interesting features in Istio that we might want to try in the future. But that might be the topic in a future post.

Don’t miss the next part where we talk about how to perform a basic Istio install.

Author:
Daniel Oldgren
Solution Architect