How we built a better, unified logging platform

Emily McMahon
Remitly
5 min readFeb 22, 2019

--

Before Unified Event Logging (UEL), we had multiple methods for recording events in our products, The data was stored in just as many locations. Most commonly, we used our production RDBMS for event logging. Naturally, the production database incurred a performance hit for recording all this event data.

For most event logging, we don’t need the value adds of an RDBMS: transactions, mutating rows, and low latency. Instead, we want higher throughput. Moreover, as we split out products into individual services (each potentially with their own storage system), we destroy the scalability of analytics as discoverability of data becomes impossible.

Essentially, UEL avoids the O(n²) explosion above, and achieves the O(n) system below.

For every service, there are cross-cutting concerns that are interested in the same stream of data — an event log of “what is happening” within that service. In particular, the service publishing the data

  • doesn’t need to block on delivery and consumption of that data
  • shouldn’t have to make code changes to add more consumers for the same data stream
  • shouldn’t take the performance hit from handling the fan-out itself

Thus, UEL meets the desirable points stated above: UEL publishes streams of events in a central place for discoverability while decreasing the number integration work required of dev teams. Concerns that can be handled asynchronously are handled asynchronously with very low-overhead SDKs, mitigating the performance concerns of additional service integrations. We also shift the burden onto an external system, reducing the load on our production OLTP system.

Architecture

At a high-level, all our hosts send data to a centralized ingestion point. This centralized point validates the events against declared schemas and then fans out the events to various places. We have a periodic job that creates Redshift tables using the schemas and does an incremental load.

The DLQ is actually several SQS queues. One is a true DLQ for events that fail very early on in the validation (typically these are events we cannot even parse due to clobbering). Additionally we have an invalid queue per team for events that are parseable but do not validate against the declared schema. We have one queue per team to make it easy for our users to fix and re-drive their failed events.

On the hosts, every UEL SDK logs to rsyslog. These messages are periodically gathered by heka and shipped off to the collector.

For client-side events we obviously cannot use heka & rsyslog. Instead our apps periodically send batches of events to an ingestion point. This ingestion point is used as a middle man where we can remove any malicious events before forwarding the events to the collector.

The Events

We knew we wanted the data to end up in our analytics warehouse. To facilitate this we require an event schema. That schema is used to

  • validate the events
  • create tables in our warehouse that we periodically load the events into

Every UEL event includes its schema name. The schemas themselves are described using Avro with one major modification: we introduced a composition feature to encourage schema re-use.

Schema composition posed a challenge as it’s much easier to develop tooling for flat data representations. So we took the extra step of defining our schemas in one language, and then transpiling them to a flat Avro Record.

We made small modifications for usability: allowing CSON and JSON (although the CSON is transpiled into JSON) and enforcing the inclusion of a team field (so we can track down the owning team on validation failures).

Data Loss

One of the biggest concerns moving from a transactional database to an event stream was data loss.

To measure this, every Remitly host regularly emits “heartbeat” events. These heartbeat events are emitted with a nonce that allows us to infer our data loss by inspecting for missing events.

Note that this provides an upper bound for data loss. If we’re missing events after the biggest nonce, we’ll never know. This situation is indistinguishable from the heartbeater dying or a service restart so it’s a bit tricky to detect. Until we have reason to suspect that this is a problem we’re fine with this estimate.

Our loss rate of 0.0004% is well within our tolerance for most use cases. We’ve devised other methods for systems that cannot tolerate a lost event, such as financial event logging.

Challenges

UEL takes a very little operational support, especially when viewed relative to its business impact. Our biggest challenges with it stem from the fact that it’s so stable. On the rare occasions where something does break, we have to delve into codebases we haven’t touched in months or years.

--

--