A Holly, Jolly System Of Record

Ian Varley
Salesforce Engineering
7 min readDec 15, 2016

The Architecture Files, Episode 6

Ho ho ho, readers! I hope those of you on the northerly hemisphere of the planet are enjoying the portion of our orbital journey approaching perihelion. Happy holidays!

You’re reading the Salesforce Architecture Files, a regular series here on Medium about our technology and architecture, and related topics. Our goal is to dig a little bit deeper than the average white paper, and talk about not just what technologies we build, but why we build them the way we do.

Last time, we talked about transactionality, and how it affects data in complex systems. In that article, referring to ACID transactions, I said:

“This behavior is partly what allows our customers to use Salesforce as a system of record for their most critical data.”

Partly! But not completely. It turns out, to be a system of record (SOR), there’s more work to do.

What’s a “system of record”? It means “the authoritative source for a given piece of information”. Your one true copy. Your sine qua non. Your all-eggs-in-one basket. If anything happened to it, it would definitely ruin Christmas (or the winter holiday of your choice).

We can’t control whether people actually use Salesforce as a system of record (that’s up to them). But many do, and it’s our job to make it worthy of that designation. And to do that, we ensure the system upholds 4 key properties. It has to be:

  • Durable
  • Correct
  • Restorable
  • Disaster-Ready

Note: In this post, we’re not going to dig into the specific details of the technologies we use here at Salesforce for these purposes. These details vary pretty widely from system to system, and anyway, that’s too much detail for a post like this. (But if you’re keenly interested in those details, let us know, and we’ll do a follow up.)

1. Durable

This might sound obvious, but: if you accept some data, that data needs to stick around, no matter what. It can’t melt like Frosty when the sun comes out.

Now, of course, nobody goes around building systems that lose data on purpose, when everything is going well. The problem is … problems. Failures, of hardware and software. In a large scale, distributed world, the only thing you can really count on is that failures will happen, all the time.

So, here’s the rule: to be truly durable, any single component failure can’t cause data loss. That includes: a single server, a single disk, a single rack, a single network switch … even a single data center. All of these things can (and do) fail, and you’re not ready to be an SOR if any one of those failures would cause you to lose data.

This implies that you absolutely cannot only keep just one copy of the data, whether in memory or on disk. Nor can you even rely on multiple copies on the same machine. You have to just expect that any component can go south, at any time.

And, this is not trivial … especially if you want the system to scale to store massive amounts of data. It matters how many copies of data you make, when the numbers get large. If you’re building a new SOR-worthy system, you need to make informed decisions about how data is physically stored, failure conditions, efficiency, et cetera. This means you should probably bone up on your distributed system design, eh? (And, by the way, a good way to do that is by following the excellent series of daily academic CS paper reviews over at The Morning Paper).

A key word in the above description is “accept”. There will always be times when something (a network error, a solar flare, etc.) prevents you from getting some data in the first place. There’s nothing you can do about that. But when you send an ACK to the end user (“Thanks, yo, we got it!”), then from that moment on, it must be durable.

At Salesforce, every SOR-worthy system we run — relational databases, big data systems like HBase, and even certain uses of stream processing systems like Kafka that function as part of our data storage architecture — must store data durably, even in the face of individual component failure.

2. Correct

Everyone knows that magic dust is what makes reindeer fly, right?

Well, you know what another word for magic dust is? Cosmic radiation. And you know what cosmic radiation does to data on disks? Bad things.

Thus, when you accept data into a system of record, it’s not enough to just keep it; you have to keep the right data, and you have to make sure it doesn’t get altered or corrupted over time. At scale, this can be tricky.

This is why mature storage systems make heavy use of internal mechanisms to ensure data correctness: parity bits, checksums, and similar methods. When you write some data, and slap a CRC on it, you’re guaranteeing that the bits you have are the bits you want. (Or, if they’re not, you at least know about it, and can use one of your other copies of the data to fix the problem).

This is an integral part of every SOR-worthy storage system at Salesforce. As an example, in our big data storage systems (Apache HBase), the underlying file system uses CRC-32s on each copy of the data to ensure that if corruption happens on any block, the system can use one of the other replicas within the cluster to “heal” that corrupted block.

Cosmic rays are real, but they’re obviously not the only cause of incorrect data. A bug in code is much more effective at corruption than cosmic radiation is! Internal data correctness checks are therefore only part of the solution. So in addition, your data also needs to be …

3. Restorable

Photo: Wikimedia Commons, Public Domain

Have you seen the movie “It’s A Wonderful Life”? In it, an angel named Clarence gives the main character, George Bailey, a glimpse of a world where he’d never been born, by wiping out his entire existence. This bleak vision is enough to snap him out of his yuletide depression. But it would have been a very different story if Clarence (who, let it be said, is only an angel 2nd class) was unable to successfully restore the real state of the world, with George in it, from his backup!

The point is, things go wrong. Logic errors, mistaken assumptions, code bugs, human errors — the list goes on. To mix holiday movie metaphors: sometimes you shoot your eye out.

Thus, another requirement of an SOR-worthy store is that you have to be able to restore it. I say “restore”, rather than “back up”, because it honestly doesn’t matter how “up” it’s “backed”; what actually matters is that you can restore it. (As they say, “A backup you haven’t restored from isn’t a backup.”)

Backup isn’t a simple process, especially as data gets big, but it’s crucial. Imagine the worst bug imaginable, which literally rm -rf’s every scrap of data in your database in a single instant. Could you recover?

Salesforce makes offline backup and restore a key part of every SOR-worthy storage system. For example, all the data in our large file storage system, Fileforce, is backed up regularly using an external disk-based system that optimizes the storage, while still allowing us to access older snapshots of the system’s state.

4. Disaster-Ready

You know the story of Rudolph the Red Nosed Reindeer, right? He’s a plucky young rangifer tarandus, ostracized because of his glowing proboscis, who later turns out to save the day by guiding Santa’s sleigh through a major meteorological disaster. You might say that Rudolph is Santa’s “disaster plan”.

(Now, if that’s your disaster plan, you might want to re-think it.)

Part of being SOR-worthy is that you must plan for the loss of an entire data center or site. This means that the data must be continuously copied to additional geographical locations (often called sometimes the “DR” (Disaster Recovery) location). The details of this vary greatly depending on your specific data storage architecture and consistency requirements, but the basic fact is the same: ship the data somewhere else.

This is a common industry requirement at this point; any SOR-worthy store needs to have this capability to site-switch, very quickly (a fast RTO) and with little to no data loss (a short RPO).

At Salesforce, our DR sites contain not only a full replica of all the hardware needed to run our production systems, but a near-real-time copy of the data, too. And we spend a lot of time making sure we can safely, quickly switch over to it in an emergency (aka “Site Switching”).

So, these are a few of the core properties you need around your data, to consider yourself worthy of being a System Of Record.

That’s not the end of the story, of course; your entire software system (not just the part that stores data) needs other properties, like high availability. And it goes without saying: a good system is a secure system. Salesforce puts customer trust above all other values, and if you run a system that stores data for people, you ought to too.

Stewardship of data is a serious responsibility. If you’re building a system that plays that role, think through how you’ve covered all the possibilities, so your days can be merry and bright.

And may all your SORs be tight.

Does building resilient, badass systems appeal to you? Come do it at the world’s most awesomest enterprise cloud company.

Want more architecture files in your stocking? Head to episode #7.

--

--