On Kafka, Clouds and The Phoenix Project

Gwen Shapira

I’ve finally read the seminal DevOps book The Phoenix Project, after numerous people recommended it. It is a quick and fun read, and if you have even a tiny bit of IT ops experience, it will definitely resonate and maybe even be helpful. The book is very noticeably from 2013 — the virtualization state of the art is VMs, not containers. But this doesn’t detract even a bit — we now have (much) better technology to implement more or less the same thing.

There were two paragraphs that stood out to me more than any other in the book. The first is regarding their new agile project “Unicorn”:

At the time the book was written Apache Kafka just started gaining popularity outside of LinkedIn and Kafka Connect and Debezium will not be a thing for at least 2 more years. But this is still very visibly one of Apache Kafka’s top use-cases. Both the benefits and the concerns are spot on. But there is one thing they totally missed:

The migration is not a one time thing. They need the process and infrastructure to continuously pull changes out of the production databases into Unicorn, to avoid using outdated information. This can be done in batch ETL. But Unicorn was celebrated for the ability to run multiple experiments a day, so a continuous CDC process using Kafka and Debezium would be more appropriate. Over time the Unicorn database will continue to evolve and may diverge farther from the production schemas — this is a good thing, but the team may need to adopt KSQL to transform the data on the fly.

Worth noting that this pattern has a very close relative called Mainframe Offloading: When the legacy database you are escaping happened to be a mainframe. This variant tends to be very popular because it also includes significant cost savings (due to the way mainframes are priced), and was done many times successfully by various finserv companies.

The second paragraph that stood out to me is about their cloud migration effort:

Every description and technical decision in the book is realistic, except this one. I suspect that, like Brent, the authors always wanted to try a cloud migration.

Earlier in the book, Brent explains how the single task “provision a server” actually includes many steps and many back and forth discussions. Clearly he thinks that doing this in the cloud is easy! And in a sense he is right — the delays that happen on premises like waiting for the network team to modify a firewall would be a snap in the cloud. Unfortunately, he’ll discover many fun new things that will take him more than the 2 weeks he had to perform this task.

I *think* he expected to modify whatever scripts create the VM to AMIs and then automate the deployment of machines with the AMI using the cloud provider APIs. This in itself isn’t trivial, but he forgot:

  1. Which machines would he use? It sounds like they did some benchmarking of their software. The choice of machine depends on the location of the bottlenecks (cpus, storage, network, memory) and on the specifics of the workload and the specifics of the cloud provider specs (What does “up to” mean? What does burst balance mean to us?) — there are many cost-performance trade-offs and optimizations to make. We typically allocate few weeks to this effort alone.
  2. Which database would he use? Maybe the exact database he uses on premises exist in the cloud. In 2013 it probably didn’t. If it exists, than he just needs to worry about another data pipeline. If it doesn’t, he needs to pick a close cloud relative and deal with the implications of every place where it wasn’t close enough, or he’ll have to deploy his existing database in the cloud, including sizing and monitoring.
  3. Do you keep the data in the cloud long term? If you do, you need to budget for storage. If you don’t, you need to plan for ingress time.
  4. Testing pipeline: The whole cloud migration started because the project took too long to run in test environment, and they want test and production to be identical. So they need their entire CI/CD pipeline to provision and test in the cloud.
  5. Monitoring: Chances that his existing on-prem monitoring systems can also monitor his cloud applications and hosts are rather low. Also, since he plans on deploying the app on-demand, he’ll need to add the deployment of monitors and collections of logs and statistics back after the job is done to the automation.
  6. Security: The book dismissed it in a one-liner, but the dependency on the firewall is still there. Since it is a marketing project, lets assume PII was already cleaned and obfuscated. He’ll definitely need to figure out how to tie the production network to a cloud VPC and he’ll need to figure out access controls. Not a huge deal, but more work.

This is just off the top of my head. I’m probably missing quite a lot since other than (1) and a bit of (5), none of this is my actual job.

Other than being slightly overly optimistic about the cloud, the book is absolutely amazing.

What I’d really love is to find an equivalent book written from the VP of Engineering POV. Other than creating a small team of experts and bi-weekly sprints, how did he adopt to the agile and devops world? If someone writes that book, I’m looking forward to where the team adopts Serverless. Of all technologies that were added after the book was written, this is the only one that I don’t quite see how it fits in.

Gwen Shapira

Written by

Knows a thing or two about Kafka, Databases and Apache. Opinions are my own, and often not even that.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade