Our journey with Apache Kafka — from initial bumps and bruises to the backbone of our event-driven architecture

Published in

SailPoint Engineering Blog

11 min readOct 1, 2020

You live in a world where SaaS has successfully been deployed to the cloud. You aren’t racing to get there anymore and you aren’t slowly breaking apart your monolith applications. You probably have a few dozen continuously deploying microservices with their own databases. Congratulations! So why won’t your team lead stop talking about messaging queues and processing events instead of requests?

If you’re anything like me, you didn’t really know what this meant the first (or 5th) time you heard it. What is an event? Isn’t the client hitting an endpoint an event? If so, what isn’t an event? And if mysterious “events” are driving our system instead of the customer, then why would anyone want to use our product?

Events seem like the kind of thing that should already be built into our vast array microservices and, the good news is, they probably are to some extent. If you’re aggregating logs in Kibana or auditing user actions then you are processing events. The question that you should be asking yourself isn’t “How can I make some events so they can drive my architecture?” but rather “Can I add some value by doing more than just saving my events?”.

WHY KAFKA?

This question, along with mounting frustrations with the Redis™* software message handler, led us to Apache Kafka® software. If you don’t already know about Kafka, you should probably study up for two reasons — 1. You will probably be asked about it in your next job interview and 2. It’s pretty awesome. This article won’t dive into the depths of how it works but it’s essentially a “pub/sub” messaging queue. Instead of sending messages to every consumer, you simply post (publish) your message under a certain topic and any interested parties listening (subscribing) to that topic can hear it. It also has some nice features that set it apart:

Can handle large amounts of data in the messages
Able to retain the messages for a long period of time
Data partitioning and replication among servers
Supports parallelism
Built-in retries

It’s not the perfect solution for all the messaging problems in the world but you might ask yourself…

STEP ONE — SHOULD I USE KAFKA TO REPLACE MY CURRENT MESSAGING SYSTEM?

Like I mentioned before, we were using the Redis software message handler but we were pushing it a little too hard and it had a tendency to crash due to the size of messages we were asking it to handle.

Our Certifications team then implemented Kafka. Conveniently, we were trying to solve an engineering problem that required a good messaging queue. Generating a Certification Campaign requires creating objects for every manager in the company to review, with objects for every employee they are reviewing and every piece of access for each employee. If that sounds like a headache… you have no idea.

Making all these objects could easily tie up a thread for hours, even days for our largest customers. The solution was to use Kafka to parallelize the creation process of these objects. Essentially, when a REST call was made to kick off the creation, our Certifications microservice would shotgun 1000 messages to Kafka. Each one contained the information needed to create a (ideally) small number of objects. Once all the messages had been consumed (a problem worth explaining in another article), a finalization process was executed to wrap things up and the objects became available to the user. Simple enough, right? Yes and no.

For certification campaigns small enough, the process worked like a charm and cut our processing time down by a factor of 10. But shoving that many messages into the queue all at once had serious implications. What’s worse, there were times that a single message was actually responsible for creating more objects than we anticipated. And thus, the following lessons were learned the hard way:

Kafka has a default timeout on processing a message. If you can’t handle it fast enough, you shouldn’t handle it at all (or increase the timeout).
Kafka can dish out messages fast; sometimes too fast. Set up too many consumers all hitting the same database and you’re gonna have a bad time.
Throwing hundreds or thousands of messages on the queue is great if all you want to do is consume them (and do nothing else). If you want to handle other messages too, you need to think about fairness algorithms.

The last point was an on-going battle for some time. Not only would other Kafka messages have to “wait in line” to get processed but we only had one microservice dedicated to all things Certifications. This meant that if all threads were busy churning through Kafka messages, a REST call could easily be ignored until it times out. What’s worse is that in the microservice world, one instance can be serving multiple customers, meaning if one customer in the region is choking the service, other customers can suffer.

Naturally, every technical problem has a technical solution and the kinks were worked out. But you, all-knowing smarter-than-us reader, might be thinking “Is Kafka really built for that kind of use case? Couldn’t you get the same result by building a queue into your service?” Maybe. But we did it and it works great now so don’t be snarky. The first outing is never easy but with experience comes knowledge and opportunity.

STEP TWO — SYNCHRONIZING DATA WITH ASYNCHRONOUS MESSAGING

If you’ve used our flagship product, IdentityNow, you will have used the search bar.

The search bar allows admins to search for anything identity-related in their organization — employees, access, roles, etc. Once a query is run and identities/access are returned, you can use the results to do other things in the app like create a Certification campaign or a SOD (Separation of Duties) policy.

As you might guess, our Search capability leverages Elasticsearch. There’s a good chance that you use Elasticsearch at your job to aggregate logs to Kibana and there are a lot of good resources out there on how to use Kafka for this as well.

The challenge with using Elasticsearch to surface real-time data to the users is that we are essentially duplicating our data in two places. As software engineers, we are relentlessly trained not to duplicate our data. It almost always turns into a huge headache and another drop in the tech-debt bucket. But as software engineers, we often forget what we were trained for complicated technical reasons like “Yeah but… what if I want to search for data REALLY FAST?”

So now that our priorities are in order comes the use of our new friend Kafka.

Let’s first address the semantics. The building block in Kafka are messages and topics. To set parity with Redis and the code that understand Redis, we had to establish a concept of events in Kafka. So how do you take, say a message on a topic in Kafka, and say this is so and so event? Well, we created our own paradigm by establishing a layer between Kafka and our services, our very own framework (drumrolls !!) IRIS. Built on top of Kafka, as the name suggests, IRIS adds the wisdom of events to our Kafka implementation, and allows us a seamless transition from Redis to Kafka in terms of business semantics. I can create an event, and IRIS will apply a schema to my event and publish it to the topic. What’s more, this abstraction is also future proof, so if we get a smarter event stream in future (cue in future with flying cars ), IRIS will be able to adapt to it making our transition simpler.

So we figured why we needed Kafka, and we figured how we can use it to work with our existing processes. What’s next?

First, we find all the places in our code that can create, update or delete an identity. Next, we simply post (publish) to Kafka any time we see one of these events and let Search listen (subscribe). Practically speaking, this means that Search is up to date with our identity database typically in less than a minute and a max of 5 minutes. This gap between data being reflected in our database vs in Search is the only downside to the implementation. It’s small enough to almost never bite us in the butt but large enough to spur lively “source of truth” debates in the office. But yay! We are finally using events! Are they “driving” our architecture? Sure, it’s a step in the right direction. A very large number of events are being broadcast and consumed in a scalable, asynchronous manner that doesn’t hang up our REST endpoints.

STEP THREE — SO YOU’RE SAYING MY AWESOME REST MICRO-SERVICES SHOULD BE MAKING FEWER REST CALLS?

Even in the most event-driven architecture, REST calls will never become obsolete. The goal is simply to allow the messaging queue to manage events in such a way that they are consumed in a timely, not debilitating, manner. It allows engineers to take a hard look at the calls that are being made and ask if they could possibly be made asynchronous. Unless the user is specifically waiting for a request to finish, there’s a good chance it can be thrown on the queue instead of making a synchronous call.

An example of Kafka evolving to its final form is how we use it for SOD (Separation of Duties) policies. A quick rundown of SOD — admins can define policies like “No employee should have access to both Accounts Payable and Accounts Receivable”. Then the system flags any such “violations” and surfaces them.

The same paradigm we constructed for Search can be used to feed data into the SOD engine. Any time an identity’s access is changed, a check is run to see if any SOD violations have been found. In fact, the groundwork had already been laid with identity create, change and delete events being published for Search. SOD simply registers as a consumer to those messages and bam! Job done!

But how can this be taken a step further? Events can be more than just “data changed in this other part of the app”. An event can also be published for Access Requests — intercepting the violating access before it’s provisioned. An event can also represent a moment in time, like the end of the quarter. This allows the system to carry out actions like creating a recurring Certification Campaign containing all the current policy violations in the system. What’s more, the SOD engine can publish events to its own topic letting the world know that a policy violation has been found. This makes it easy for the AI/ML engine to subscribe and track risky behavior. Once you start hammering with events, everything looks like a nail.

STEP 4 — WITH GREAT POWER COMES GREAT RESPONSIBILITY!

Now that we have covered the power of Kafka and the ease of scaling and throughput it provides, let’s talk about the responsibility of configuring it right.

Kafka has been designed to be highly configurable which allows it to suit a wide variety of use cases. At the same time, the nuances of configurations may help your application perform better, while improper configurations may make it sluggish. You need to get the balance right. More often than not, the default configuration will suffice. When it doesn’t, answering a few simple questions can go a long way and helping to find the right fit.

How many partitions should a topic need? The number of partitions is the key to parallelization in Kafka. If you create too many partitions, your consumers may soon get overloaded. If you have too few partitions, you’ll have consumers chilling in Hawaii on vacation with a drink and little umbrella in it! Ideally you would want to maintain a number of partitions = number of consumers in a consumer group relation. In the same consumer group, consumers split the partitions among them. Hence, adding more consumers to a group can enhance performance, also adding more consumer groups does not affect performance.

So how many partitions do I really need? Depends on your throughput expectations and speed of read from a partition. This can be evaluated as # Partitions = Desired Throughput / Partition Speed

Each partition maps to one consumer in a consumer group. So what happens if a consumer crashes, or is not available? Kafka keep tracks of healthy consumers by asking them to send heartbeats. Configuring how often a consumer should send heartbeats helps improving performance. Set the value too low and Kafka will assume the consumer is down and will rebalance the partitions among remaining consumers. Set the value too high and it will not know when a consumer crashed or is not available. So how should we set this configuration? A rule of thumb will be to ensure the max.poll.interval is less than the request.timeout configuration. (i.e. you would want your consumer to respond before the request times out). Another rule is to ensure short lived consumers are in their own consumer group, such that if they crash it doesn’t cause rebalancing on other consumers of an otherwise healthy group. You may also need to be conscious if your consumer maintains state, as rebalancing will lead it to drop the state and start afresh from the new data. If we have such consumers, it might make more sense to have higher request timeout and polling interval thresholds.

There are more such aspects of configuring Kafka, which depends on the use case and the load the application expects. A review of your data durability, throughput, idempotency and account throughput requirements will answer most of these questions for you. To get the most out of Kafka it’s important to revisit those areas and evaluate what strategy works and what could go better.

For IdentityNow we have learned from our mistakes and answering questions above has allowed us to get to a more stable and performant event driven architecture. When we started off with Kafka we didn’t account for the impact of rebalancing of partitions on throughput or how often we were actually rebalancing. Taking a closer look at what’s causing frequent rebalancing led us to configure our clusters more robustly and with more informed configurations. Remember rebalancing leads to a new leader election which also adds latency. Reducing rebalancing and leader elections by correctly configuring our partitions & consumers allowed us to improve on this latency as well. We also had to take a look at how we were creating topics. Instead of creating topics per tenant which leads to partitions for a given topic per tenant, we decided to configure fewer topics sized to the infrastructure architecture which allowed us to address the amount of open server files and replication latency and also better utilize partitions. Mapping the number of partitions to expected throughput allowed us to scale the performance horizontally by adding more appropriate number of consumers to a group and that lead to more balanced clusters, improved performance and higher throughputs.

STEP 5 — PROFIT

Kafka is a gift that keeps on giving. Once you implement it, you’ll be surprised how often you find yourself saying “Is this going to scale well? Oh yeah, Kafka is handling the messages so it’ll be fine.” It allows events to drive workflows instead of making the user worry about it and it can even help you parallelize some big jobs.

It was probably a year between “Let’s try this out” and “Let’s Kafka all the things” but the journey was well worth it. Take a look at your current stack and learn from our experience. A good messaging system is like a new mattress — you’ll be surprised at what you were missing and your whole day will just run a little smoother.

Footnotes

Kafka is a trademark of the Apache Software Foundation, registered in the U.S. https://www.apache.org/foundation/marks/list/#registered

Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.

Elasticsearch is a trademark of Elasticsearch B.V., registered in the U.S. and in other countries.

*Redis is a trademark of Redis Labs Ltd. Any rights therein are reserved to Redis Labs Ltd. Any use by SailPoint is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and SailPoint.