How we develop apps that rely on databases in a Kubernetes workflow
Recently, our team has switched to using Kubernetes for all of our hosted services. There are many reasons we decided to start using Kubernetes, but here’s a few of them:
- Ability to write code that is not specific to any cloud provider
- Improved ability to scale based on demand
- Reduced cost compared to existing solution
Furthermore, containerising our applications made us think very carefully about dividing up our application into microservices that are all stateless. This means they can be easily scaled or replaced by Kubernetes without downtime. Stateless code is great, but it’s likely you’ll end up needing to store state somewhere, and that traditionally takes shape as a database. We had a think about databases in Kubernetes and suddenly the story became less clear.
What’s the state of production?
In the past, we’ve always deployed our databases in a cloud-provider’s managed PaaS offering. This was great because we didn’t have to worry about provisioning and mounting disks, managing backups, configuring monitoring, allocating correct amounts of compute or RAM to the underlying machine, etc.
However, we asked ourselves if databases made sense in the Kubernetes cluster alongside the rest of the application*. We quickly realised we’d be sacrificing all of the niceties from managed services for the benefit of… well, what? We weren’t really sure what putting our database in a container gave us. Yes, it prevents our reliance on a managed database service, but that was never really a problem anyway since all of our code worked through connection strings. It just so happened that the connection string pointed at a cloud-providers service. To our code, a database is a database — wherever it lives.
Because our team was developing software to validate product/market fit, we quickly decided we didn’t have the capacity or desire to manage all of the backups/monitoring/compute provisioning ourselves. So we settled on using a Kubernetes external service to point to a cloud-providers managed service in production.
During development, we follow the same approach as Kelsey Hightower:
Databases add some extra complexity, though.
We don’t want to make each developer install the required database server on their machine. That makes getting up and running slower and also consumes resources unnecessarily.
PaaS databases weren’t a perfect solution either. You still need to worry about manual operational tasks like creating new databases, rolling back migrations, resetting the database when you do something wrong, etc. There’s also the concern of data size and time constraints. If I want significant amounts of data to replicate production use cases, then copying or generating that data is going to take a long time.
Containers to the rescue?
We decided to investigate if using database containers could solve the problems described above.
On paper it looked great. We could spin up the database server, grab a connection string and connect to it in our development environment. When we were done, we could delete the container and start from scratch again next time.
Trouble in paradise
Spinning up database containers is great to get a development environment up quickly. However there’s an annoying consideration that isn’t usually a problem for applications: state. Here’s a few things we thought about:
- I need my database to start with the schema in place
- I might need some data in there when it starts, to make my development and testing workflow easier
- I might want that data to persist after I stop debugging
- I might want to start from a known-checkpoint with data that is known to reproduce a bug or scenario
- I might want to switch between branches, and I expect my data to stay with the branch so I can come back to it later
- I might want to run some tests on production-like data to make sure that my changes work as expected
- I might want to spin up a database with production-like data to test my latest application changes, and then tear it down immediately afterwards
- I don’t want any of these operations to take a long time
- I don’t want to take up GBs of space on my local machine
Some of these things can be solved easier than others. Initialising the schema can be done with manual scripts or automated tools as part of the container start-up process (this can be achieved with jobs in Kubernetes, for example). Persisting the data after stopping debugging is easy if you don’t restart the container every debug cycle, or if you use mounted volumes (although you’re not starting from a baseline each time you debug, as is the case for applications).
What about switching branches and production-like data? You might be able to achieve some of this with extra effort involving writing your own scripts or provisioning system.
But suddenly you’re using your time to solve a database development problem, instead of whatever business problem your application as a whole is designed to solve.
This is perhaps best expressed by Sarah Wells from the Financial Times at KubeCon 2018 EU, when discussing why they moved to Kubernetes from their own in-house systems:
“The FT is not a cluster orchestration company, we are a news organisation. And that’s where we should focus our innovation.”
Bridging the gap
We believe that development shouldn’t be blocked by the database. Containers go some way to solving this problem, but the issue of state still gets in the way.
In the Foundry team at Redgate, we’ve felt the pain of this frequently enough that we decided to investigate how we might solve these problems. The solution is starting to take shape in what we’re calling Project Spawn. With Project Spawn, we’re able to provision database environments with data rapidly. We don’t have to worry about allocating huge amounts of disk space locally, or worry about installing database servers anywhere. We define a database environment via a config file, and request it using the command line.
Do you recognise any of the problems discussed in this article? How are you solving them right now? How much time and energy do you spend working on these issues?
We’d love to talk to you. Leave a comment on this post, or contact email@example.com if you’d like to speak with us directly. We’re working with a group of people to validate Project Spawn, so if you’d like to get involved then get in touch!
*Using StatefulSets and Persistent Volumes in Kubernetes makes this much easier. However, we feel it is still more cost and time effective to use a managed database PaaS offering, dedicating the time you save to building your app which solves the core business problem. Until Kubernetes can offer feature parity with existing database PaaS offerings, we think databases are best placed outside the cluster.