Image for post
Image for post

Scylla: A year in production

Meshify
Meshify
Jul 19, 2019 · 5 min read

When Scylla, our database vendor, gave us an award for “Fastest time to production” we were very proud of what we have achieved, but not surprised. We pride ourselves at having an Agile culture at meshify, and we embraced that to minimize, wherever possible, the operational costs of a high performance database like Scylla we should embrace modern Agile/Devops practices. I gave a talk on this, where I emphasized on what this meant (using pre-configured and pre-tuned instances to allow for rapid node spinups).

But we began doing that almost a year ago, so I want to follow up with some updates on lessons learned from a year in production.

In that time we’ve:

  • Upscaled our production cluster once. This took about a day (node spinup, data transfer, and node decommission)
  • Moved from one set of nodes to another once (moving to a different aws account as part of migration to kubernetes).
  • Replaced nodes multiple times. Going into the devops philosophy of “cattle not pets”, when a Scylla node stops performing, replacing the node is preferred to troubleshooting the node.

Here are some of our top lessons from that time

Understanding the tradeoffs of your architecture:

Key to our disaster recovery strategy is that our platform is operational with degraded functionality with a valid copy of our schema (no data in our Scylla database is required). This means that in a worst case scenario (so far only tested during internal Disaster Recovery/Business continuity tests), we can spin up a cluster and give it valid schema, and our primary alarms will fire.

Backups don’t matter, restores matter:

Without validating and testing your backups (Regularly) having well followed backup schedule isn’t enough.

We regularly restore our backups, and validate them, so here’s some of the Scylla specific things we have discovered.

The Scylla restore is topology dependent. Specifically going from a 3 to 5 node configuration via restore is impossible. You must restore to the same number of nodes you had at backup, then you can scale up or down your cluster to change your topology.

The Scylla restore process is “live”:

Once a Scylla cluster has valid schema, it can accept writes.

The Scylla restore process is entirely upserts, which is extremely powerful:

In practice this means that you can restore a backup into a production system without “losing” any of the data that is being written during the restore, or is already present.

Scylla point in time recovery:

Any recovery strategy require methods in point in time recovery, even if just to sanity check (The data really looks the same way today as it did yesterday). The downside of Scylla’s “upsert” restores is that point in time recovery requires a fresh schema (or able table to be truncated before the restore). The best way we have found to do this is actually to spin up an equally sized cluster (3 smaller nodes).

Features won’t always be ready:

We only upscaled our Scylla production cluster because a new technology feature (per user service level agreements) wasn’t ready. We are moving data to a data warehouse, which is a textbook use case of a reduced performance priority (Computers have more patience than humans, so making a computer wait for data so a human gets their data faster is usually the right decision).

We have no regrets, because we needed the performance then, and we could validate and test something we understand (more performance through additional hardware did cost us in AWS fees, but the decreased risk of using features that we were comfortable with and had heavily tested in staging was worth it).

Scylla is fast:

Here is how Scylla performs right now, as I’m writing these words:

Image for post
Image for post

You still need enough speed for a “worst case scenario”.

This might be scaling 101, but when deciding how much performance you need in a database, you need to need to be able to handle your full traffic, when an unplanned hardware failure happens, when you are running another intensive operation (in the case of Scylla, that would be the anti-entropy repair, followed by our nightly backup).

Here is what our HW load looks like during a nightly backup:

Image for post
Image for post

You’ll note that our requests served remain static (IOT sensors don’t care that’s it the middle of the night, because they don’t sleep).

And let’s look at our performance during that time

Image for post
Image for post

The biggest takeaway is that Scylla’s command queueing systems work as advertised: pushing the nodes to 100 percent load allows for the Meshify platform to work as intended without unacceptable latency, failed reads or failed writes.

Find out more about Scylla by visiting their website or join the Scylla users slack and look for me there.

If you’re interested in using Scylla as part of global iot platform, we’re hiring.

Image for post
Image for post

Sam Kenkel is a DevOps Lead at Meshify.

Meshify is an Austin, TX-based IoT hardware & software company for the insurance industry providing innovative IoT solutions with a focus on simple installation, actionable insights, and cost-effectiveness.

Image for post
Image for post

Meshify

Meshify Technical Blog

Meshify

Written by

Meshify

We are an IoT hardware & software company providing innovative IoT solutions with a focus on simple installation, actionable insights, and cost-effectiveness.

Meshify

Meshify

Meshify Technical Blog

Meshify

Written by

Meshify

We are an IoT hardware & software company providing innovative IoT solutions with a focus on simple installation, actionable insights, and cost-effectiveness.

Meshify

Meshify

Meshify Technical Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store