When Scylla, our database vendor, gave us an award for “Fastest time to production” we were very proud of what we have achieved, but not surprised. We pride ourselves at having an Agile culture at meshify, and we embraced that to minimize, wherever possible, the operational costs of a high performance database like Scylla we should embrace modern Agile/Devops practices. I gave a talk on this, where I emphasized on what this meant (using pre-configured and pre-tuned instances to allow for rapid node spinups).
But we began doing that almost a year ago, so I want to follow up with some updates on lessons learned from a year in production.
In that time we’ve:
- Upscaled our production cluster once. This took about a day (node spinup, data transfer, and node decommission)
- Moved from one set of nodes to another once (moving to a different aws account as part of migration to kubernetes).
- Replaced nodes multiple times. Going into the devops philosophy of “cattle not pets”, when a Scylla node stops performing, replacing the node is preferred to troubleshooting the node.
Here are some of our top lessons from that time
Understanding the tradeoffs of your architecture:
Key to our disaster recovery strategy is that our platform is operational with degraded functionality with a valid copy of our schema (no data in our Scylla database is required). This means that in a worst case scenario (so far only tested during internal Disaster Recovery/Business continuity tests), we can spin up a cluster and give it valid schema, and our primary alarms will fire.
Backups don’t matter, restores matter:
Without validating and testing your backups (Regularly) having well followed backup schedule isn’t enough.
We regularly restore our backups, and validate them, so here’s some of the Scylla specific things we have discovered.
The Scylla restore is topology dependent. Specifically going from a 3 to 5 node configuration via restore is impossible. You must restore to the same number of nodes you had at backup, then you can scale up or down your cluster to change your topology.
The Scylla restore process is “live”:
Once a Scylla cluster has valid schema, it can accept writes.
The Scylla restore process is entirely upserts, which is extremely powerful:
In practice this means that you can restore a backup into a production system without “losing” any of the data that is being written during the restore, or is already present.
Scylla point in time recovery:
Any recovery strategy require methods in point in time recovery, even if just to sanity check (The data really looks the same way today as it did yesterday). The downside of Scylla’s “upsert” restores is that point in time recovery requires a fresh schema (or able table to be truncated before the restore). The best way we have found to do this is actually to spin up an equally sized cluster (3 smaller nodes).
Features won’t always be ready:
We only upscaled our Scylla production cluster because a new technology feature (per user service level agreements) wasn’t ready. We are moving data to a data warehouse, which is a textbook use case of a reduced performance priority (Computers have more patience than humans, so making a computer wait for data so a human gets their data faster is usually the right decision).
We have no regrets, because we needed the performance then, and we could validate and test something we understand (more performance through additional hardware did cost us in AWS fees, but the decreased risk of using features that we were comfortable with and had heavily tested in staging was worth it).
Scylla is fast:
Here is how Scylla performs right now, as I’m writing these words:
You still need enough speed for a “worst case scenario”.
This might be scaling 101, but when deciding how much performance you need in a database, you need to need to be able to handle your full traffic, when an unplanned hardware failure happens, when you are running another intensive operation (in the case of Scylla, that would be the anti-entropy repair, followed by our nightly backup).
Here is what our HW load looks like during a nightly backup:
You’ll note that our requests served remain static (IOT sensors don’t care that’s it the middle of the night, because they don’t sleep).
And let’s look at our performance during that time
The biggest takeaway is that Scylla’s command queueing systems work as advertised: pushing the nodes to 100 percent load allows for the Meshify platform to work as intended without unacceptable latency, failed reads or failed writes.
Sam Kenkel is a DevOps Lead at Meshify.
Meshify is an Austin, TX-based IoT hardware & software company for the insurance industry providing innovative IoT solutions with a focus on simple installation, actionable insights, and cost-effectiveness.