Making resilient service using akka.net remote + Kinesis

a success (fun) story of akka.remote and Kinesis

Something happened on production the other day :-(, it come back by itself

Sample Code First: https://github.com/zdjohn/akka-remote-kcl-exapmle

A very brief intro of Kinesis

Kinesis Stream is used to collect and process large streams of data records in real time.

Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).

Kinesis has some key characteristics:

  • It can handle a HUGE throughput
  • It can be consumed by multiple applications simultaneously
  • It will be able to hold your events up to 7 days, and can be replayed from a given checkpoint

How KCL and .NET works together:

based on KCL dotnet Client sample from https://github.com/awslabs/amazon-kinesis-client-net

KCL Multilang-daemon uses STDIN and STDOUT, it makes KCL being language agnostic.

It is not intuitive from the first sight. C# processor is invoked by the worker from java processor, as a sub-thread. if the java processor is hanging or exit, or worker reaches the end of current shard. It would kill the sub-processor with it.

Vice versa, your ingestion service is likely depended on other micro-services and data stores it connects. When Its having internal errors, it would throwing exceptions. (once sub processor shutdown, KCL will shutdown subsequently.) Bad (malformated) events records could interupt good events, in unexpected ways.

When your service need to talk to multiple micro-services to reconcile records (like hundreds millions events a day). Scenarios as above becoming a pain points.

Here is something i’d like to have:

1. I want to process as much records as possible in parallel. (As realtime as possible)

2. Once ingestion service receives records, update kinesis checkpoint accordingly. so progress is tracked.

3. When Java KCL shutdowned for whatever reason, The ingestion service should not be affected. (received/running records will continue to be processed with no interruption)

4. When ingestion service is not healthy, shut down KCL completely. no records (request) will bring down from the Kinesis stream any further, until health check turned healthy again.

5. Once ingestion service is back on, restart KCL, and resume records processing from last checkpoint.

Splitting the responsibilities with akka.net remote

So here is what I end up doing:

1. A dedicated processor brings down the records from KCL Multilang-Deamon, then sends to ingestion service (acing like a proxy). so KCL would only impact c# processor, nothing else.

2. Ingestion service is a constantly running windows service. It contains all the business logic of how records should be handled.

3. A Health checker actor constantly checks healthy status of ingestion server. Incoming messages would be stashed if system is unhealthy.

4. Processor receives ingestion service health status every time when records list was sent. So: 
If the returned health status is false. processor will kill itself. 
If response ack message times out, processor will kill itself.
Once processor is killed, multi-lang Daemon would shut consequently.

5. When ingestion service returning back to healthy status. It invokes KCL multilang daemon. Hence, will restart processor again.

From KCL client perspective, This is a fresh reset. KCL workers will resume from last checkpoint.

How to make your service self-aware

One great feature of akka, that is, actors could have multiple states.

As mentioned above, inside the ingestion actor system, there is 2 types of actors:

Data processing actors, which handles all the complex business logic. They have two states: healthy and pending.

Every time records batch is received. ack message will be replied to c# processor (sender). processor will call KCL checkpoint, if ack status is healthy.

If actors are in pending. Incoming message will be stashed. Processor will receive unhealthy reply, which will trigger remote processor shutting down. So you don’t receive further kinesis records, while system being unhealthy.

Then, here is health checker actor, a scheduler. It checks the system in a set period. (i.e. ping 3rd party apis, data store connections, etc.) of course, you can define your own rules of what “healthy” means. e.g. Idle time.

Inside Health-checker, it has 3 states:

“Green”: system is healthy, notifying other actors its ok to process records.

“Red”: system is unhealthy, notifying other actors to stash new incoming messages.

“Yellow”: When system is started or recovered from “red” state. It will turn itself to Yellow first. While in yellow, it resets KCL processor. Then transit into “green” state. Ingestion server now is ready for records to stream down from KCL.

So the sequence goes as follows:

  1. Ingestion service starts with health-checker in “Yellow” state.
  2. In system healthy case, Health checker invokes external KCL multi-lang daemon, which invokes c# processor. After that, health checker turning itself into “Green” state.
  3. Processor sends kinesis records to ingestion server
  4. Ingestion server receives records, and reply with healthy status.
  5. Processor call KCL checkpoint after healthy response received.
  6. to this point, the healthy loop will form between (#3) and (#5)
  7. when service failure happens, health-check will notice system status unhealthy. It turns itself into “Red” state. Accordingly, unhealthy status emitted via EventStream.
  8. Actors listens to the healthy status events, and turning themselves into pending state after unhealthy status is received.
  9. c# processor continues sending records to ingestion server.
  10. A unhealthy status will reply to processor in this case. process will exit cause KCL multi-lang daemon to shut down. No more records will send to ingestion service from this point on.
  11. Ingestion service stays in “red”, while health checker continue checking health status.
  12. Finally, system recovers, health checker turning itself into yellow state.

Now ingestion server is in the same state as #1. and so, the cycle begins.

KCL will continue from last checkpoint, and start pulling records down via processor. delayed records will be sent to ingestion service, eventually catches with real time stream.

Other Benefits

It is worth pointing out such “Lizard’s Tail” pattern dose NOT limited to Kinesis.

Working with distributed systems, and micro-services architecture. one thing often happens to devOps is cascading alerts. Single service failure could trigger chain reaction, and cause multiple services scream at the same time. This not only increase the stress level, but making thing worse, it adds a lot of noise too.

Self aware services, not only making a resilient system, but also make our life easier in the maze of the micro-services world.

By intentionally design a fragile component inside your application achitecture, and make it fail fast. This protects the system as a whole.

Sample Code: https://github.com/zdjohn/akka-remote-kcl-exapmle

KCL nuget for kcl package, see: https://github.com/awslabs/amazon-kinesis-client-net/issues/14