Self Contained Systems Part 4— Data SLAs

Philip Borlin
5 min readJan 24, 2018

Don’t miss the other posts in the series:

Imagine you are using an app and in the upper right hand corner there is a measurement of your usage of the service. Next you click through to the billing tab and you see a different number of usage units! This is the nightmare scenario of the data replication world and is super confusing for users. I want to address how to reason about data replication and talk about how to avoid this scenario.

Eventual Consistency

Data replication works in the world of eventual consistency. That means that eventually all parts of the system will tend towards being consistent with each other. This concept can be a bit daunting to those coming from a Relational Database Management System (RDBMS) world.

In the RDBMS world you start with a single server which has no replication delay and so there are no problems right? While it is true the database has no replication delay, I would argue we should take a step back. Really the UI the user is using is the first step. Let’s say the user has created some new data, e.g. adding a new address. We can reason about the propagation delay that it takes to get the newly created address into our RDBMS. Hopefully we can measure this in low double digit ms, but that really depends on the latency of the user (we will assume you have a robust network on your side). So for 10ms or so our database is out of sync with the intent of the user.

But we don’t think about this delay. We usually think of this as instantaneous, but in this case instantaneous is defined by being faster than an acceptable threshold. This is a powerful concept because we can now reason about the definition of instantaneous in regards to a Service Level Agreement (SLA). We can say an update is instantaneous if it is less than a certain amount of time. In fact we can have cascading definitions:

  • <10ms Instantaneous
  • >10ms, <50ms Fast
  • >50ms <100ms Slow
  • >100 Unacceptable

These are example numbers and are very context dependent. For example if we wanted to reason about webpage speeds we might say:

  • <50ms Instantaneous
  • >50ms, <250ms Fast
  • >250ms <1000ms Slow
  • >1000 Unacceptable

Data SLAs

If you talk about how fast something needs to happen your users will almost always say it needs to be instantaneous, but I find that to be rarely true. For an example I think it is widely believed that when a user signs up for a service that a reasonable expectation is for the user to be able to use that service immediately! But that is not really what happens. The user signs up and you send them an email which contains a link that they have to click in order to “activate” their account. How long did it take for that email to get there? No less than seconds, and in the throes of smtp land it can take much longer. As a developer that means you may have seconds (a computational eternity) to get their account setup information propagated.

If you can frame data SLAs in terms of the screens that your users are doing then you can get more honest answers. If the same piece of data is served by two SCSs (thus requiring a propagation) each of which will show that data on different screens then you can look at how long it will take the 99 percentile user to get from one screen to another. If they are adjacent then maybe you talk in single digit seconds. If it requires a complex workflow to get from one screen to another than your SLA can be higher.

Multiple SCSs on the same screen

If you have a component based UI where you have two different SCSs displaying data on the same screen you may need to be a bit creative. Our initial example may fall into that category where you have an SCS displaying your header and a different SCS showing the billing information. In this case there are a few things you can do.

One strategy would be to abstract some of the data into an API. This can be particularly useful in on data in your header or sidebar. You may have quick recommendation data, or other common data you want your users to have at their finger tips in those areas and then have other screens that show that data in a more powerful way. The disadvantage to this strategy is that you are making two calls to the backend to get the same pieces of data.

Another strategy is set up a simple data replication on the front end. You can setup this up by using a pubsub type event emitter in the browser. There are several libraries out there and I am loathe to suggest one because the entire Javascript landscape may change by the time you read this. You can pull the data down once and publish a message with that data that can be read by multiple SCSs. The disadvantage is that we have argued in the past about the power of the single data pull (per SCS) to get us exactly the data that we need. There is a good chance that the data in the header is a simpler form than the data you need for your more full featured main part of the screen. Also please do not grab data from the memory that another SCS owns. This would break encapsulation and make it hard (impossible?) to independently deploy the two SCSs.

We can also add a web socket strategy to your page. In this case we just load the data separately and use a web socket to make the data eventually consistent. In some cases this may even fix the problem before the user even notices. The disadvantage is that this may be a heavier weight solution than you need.

Lastly you can just brand the data differently. For instance you may want to call the data in your header quick picks and on a fuller screen experience you can call them recommendations.

Conclusion

Instantaneous data is a fiction and when people use the term they really are talking about data that is faster than a context dependent SLA. In general you can reason about the acceptability of data replication speeds by framing the discussion in terms of SLAs. Even if you cannot meet your data replication SLAs there are ways to present data to the user in such a way that it doesn’t matter.

Continue Reading in Part 5 — Ownership of Writes

--

--