How important is the data you’re losing?

Lewis Moore
Trail Blog
Published in
6 min readJul 1, 2016

Who decides what data is valuable for your business? Product Management? The Sales team? The Directors?

How about the Developers? Surely not… They’re the ones implementing the data model, and the app works fine! It displays all the data fine, so you must be collecting everything you need right?

It might shock you to learn just how much valuable data you can miss using a traditional relational data model. Technical decisions that seem like a no-brainer when made can have a long-lasting impact. If a user changes their name in your system, do you keep a record of the previous version? If a user removes a couple of items from their basket just before checking out, do you know what those items were? If a Todo is reprioritised and rescheduled, can those changes be reverted? Losing this kind of data can affect your ability to produce reports on user behaviour, tailor your experience to each user, reproduce bugs and much more.

At Trail, we’re just starting to dive into a review of our Domain Model and the way in which we store our data. As part of this, we want to ensure we’re capturing everything. Enter, Event Sourcing.

Events are exciting right?

“It might shock you to learn just how much valuable data you’re losing” — Tweet this

Trail is a daily checklist to manage operations — users work through a series of Tasks each day. A key part of our system is the management of these Tasks, and reporting on who completed which ones and when. Our data is stored in a relational database, and a Task is modelled like this:

Task {
id: 524,
title: 'Check Fire Extinguishers',
completed_at: null,
completed_by: null,
...
}

When a user completes a Task, we mutate the current state of that task:

Task {
id: 524,
title: 'Check Fire Extinguishers',
completed_at: '2016–02–27 21:42:31 +0000',
completed_by: 'Peter Bishop',
...
}

Great, looking good so far. We can now display the fact that the Task has been completed in the UI, when it was completed, and who by. If a member of our Business team wants to generate a report on when Tasks are being completed, they can do so. If a customer wants to generate a report on which member of their team is completing the most Tasks, they can do that too.

But wait! Peter made a mistake — he didn’t follow the new guidelines for checking the Fire Extinguishers. The Task will need to be completed again, and this time, William’s going to do it. So he uncompletes (if you’ll excuse the false antonym) the Task:

Task {
id: 524,
title: 'Check Fire Extinguishers',
completed_at: null,
completed_by: null,
...
}

… and then completes it again:

Task {
id: 524,
title: 'Check Fire Extinguishers',
completed_at: '2016-02-27 22:01:47 +0000',
completed_by: 'William Bell',
...
}

Ok, so the Task now reflects the correct current state — it was completed by William just after 10pm. So what’s wrong? Well, let’s go back to those use cases from before. A member of our Business team or one of our customers wants to generate a report on Task completion. This data model will technically still generate those basic reports — but it completely misses the fact that the Task was completed once before! What if they now want a report on how many Tasks are being completed once, then uncompleted and completed again by different users? There’s no way to generate that report using this model; the data is lost.

Obviously there are ways to solve this problem within the relational data model — for example, by maintaining past completed_at and completed_byvalues within the model. Another option is to enable an audit log on your database models (using something like the PaperTrail gem if you have a Rails app) — which is what we’ve done for now to ensure we don’t lose this important history. However, these options are really just a band aid. They don’t solve the root cause — and further band aids will need to be applied for every situation we come across like this. A more permanent solution for all cases would be much better.

Event Sourcing

Event Sourcing is an entirely lossless way of modelling data. Instead of storing and mutating models that represent current state, every action or modification of data is recorded as a single event with a timestamp. By replaying events, ‘current state’ can be deterministically calculated at any point in history.

Let’s take a look at how the Task data from earlier could be modelled using Event Sourcing:

CreateTaskEvent     { id: 524, title: 'Check Fire Extinguishers' }CompleteTaskEvent   { id: 524, completed_by: 'Peter Bishop' }UncompleteTaskEvent { id: 524, uncompleted_by: 'William Bell' }CompleteTaskEvent   { id: 524, completed_by: 'William Bell' }

So we start with a Task as before, but this time rather than creating and storing a Task model, an Event is recorded to signify that a Task has been created. Peter completes the Task, and another Event is recorded. When William later uncompletes the Task and completes it again — you guessed it — two more Events are recorded. No data is ever deleted or modified — new Events are just recorded.

This comes with a number of benefits:

  • You can generate Reports on any piece of historical data — nothing’s lost
  • You can generate time-based Reports, such as ‘users who completed all tasks within 5 minutes of logging out’
  • You can step through Events to see what precise state your app was in at any point in time
  • By storing all User actions as Events, you can easily reproduce any exact User scenario that led to a reported bug at a given time

So that all sounds great doesn’t it? But you may be thinking; how do you query the data? Stepping through thousands (or millions) of events each time you want to check how many Users you have, or get the name of a Task, obviously doesn’t scale.

CQRS and Projections

The CQRS (Command Query Responsibility Segregation) pattern can be applied to help solve this problem. While it’s not a requirement for implementing Event Sourcing, and it has a range of other uses, the pattern fits naturally with event-based programming models. So, if you take Event Sourcing to be the Command part of the pattern, then Projections can be employed as the Query part.

Projections are a rolling snapshot of current state, generating whatever view of the data you need from the Event stream. They simply play back all Events from the beginning and build a picture of state as they go. Once up to date, they can listen for new Events to ensure they stay that way. If there’s ever a problem with the state, it can be re-generated by walking the stream of Events again. To speed up that process, Snapshots can be saved periodically so that the Projection only has to start from the most recent Snapshot to regenerate the current state.

To query your data, simply query your Projection’s current state! This will be the latest Snapshot plus any events since then.

Even better — you can have as many Projections as you need feeding off a single Event stream. Designing a simple new app that only requires a small portion of your data? Write a new Projection to only save that data from the Event stream, and use the state generated by it for your new app.

So, how important is the data you’re not capturing? We hope this prompts some thought around the potential business value that a pattern like Event Sourcing could provide. The review of Domain and Data models we’re currently working through was inspired by Domain-driven Design, Greg Young’s excellent talk on CQRS and Event Sourcing and a range of other resources. We’re looking forward to exploring Event Sourcing and the rest of these topics further, and applying them to the work we have on the roadmap here at Trail. We’d love to hear your thoughts or experiences with any of them!

--

--