Dear deployment diary, serverless is f**king hard

I’m a software developer. I work for a large software company with hundreds of employees located in different time zones. It’s a big mess. And yet, the company is doing great and moving forward at a fast pace.

Like many other tech companies, it was all seeded with a small server, that over the years became bigger and bigger.

Software products, by their nature, develop like trees.

You seed them by planting your server and exposing it to the outside world — that little seedling needs to be kept in the right conditions to keep it alive. The CPU% should remain low, the incoming traffic shouldn’t be blocked and, of course, someone should spray all the bugs. 
As the sprints pass, more and more features are added to the little plant as it starts becoming a real tree: many branches, each with different little fruits, and new user interfaces. To keep it alive, much more work is required now. Every fruit is susceptible to a different type of bug, so many more areas of the plant have to be constantly monitored. 
By now, many more people are involved in the life of the tree. At some point, the tree just can’t take it anymore. It’s far too big, monolithic, and old. Most likely, the tree can no longer supply the growing demand for its fruits.

IT NEEDS TO SCALE TO BECOME AN ORCHARD!

But you can’t just cut one large tree into many smaller ones. So you plant a new tree and hope that the new tree will help produce new features while also helping the older tree with its own features. Obviously, just like you did last time, you’ll choose the most genetically advanced seed out there.

Things move fast in the seeding industry.

Ten years back, when you started your first tree, you used an Apache Tomcat server with some Oracle DB as your seed and wrote most of the parts to run your business logic on that server.

Five years ago, you started thinking about splitting the big server into a few EC2 instances and Docker containers and delegate some of the work to third-party SaaS solutions. And of course, Node.js now looked like a good option.

Three years back you wondered if you could use Kubernetes to manage the mess of containers you have.

And finally, last year you started thinking about running your new features using a serverless app, the state-of-the-art system that will let you grow faster and become way more reliable and maintainable, all for a “fraction” of a fee to AWS, Azure or GCP.

But here’s the catch:

all these people that took care of your tree when it was just a seedling, are new to the new tree. So the folks in the DEV org, who know how to create new features and new bugs in a big J2EE server, have to now start writing their business logic in the form of API gateways, Lambdas, and streams. Also, to make things simpler, they store it on storage services like DynamoDB and S3.

That big old server that is having code added to it by everyone. If you’re lucky, the code gets tested after every merge, but that’s not always the case. Because it is so big, it takes time to validate and make sure nothing broke. So every few months, a new version of it is born.

With the new technology, new expectations emerge.

Now, after learning (the hard way) about the technical debt in the old server, and with the frustration of not being able to move faster, DEV wants to establish a CI/CD pipeline to help execution become more efficient.

What’s new for your OPS?

As for your OPS folks, they used to deal with EC2 commands, and run their deployment script that calls DEV’s installer. They used to monitor your servers and would call you up at night if anything went south. If you’re lucky, OPS had set a central logging system to help you troubleshoot issues. In great companies, a monitoring system is set up to alert and help identify problems upfront, and OPS takes care of that too. But now with the new technology, it’s a mess. There are just too many icons on the AWS services menu. It translates into a big ball of mud, a mix of Helm charts, Chefs, Puppets, Bash files, python scripts and all sorts of CloudFormation yaml files.

And we’ve reached the boiling point…

The business logic that was coded in Java in the form of REST endpoint is now split into API gateways glued to a bunch of lambdas. What used to be stored in a self-contained folder of classes is now described in CloudFormation yaml files and a bunch of jars stored in some s3 address.

Wait, what? CloudFormations are OPS assets. Aren’t they?

Well, kind of. Some of the content of these files holds critical information that OPS is using to keep production systems alive. Assets like roles and policies help different tenants protecting their internal and external boundaries. With S3 OPS can backup stuff for disaster recovery. Networking management, machines, and clusters are controlled through these files as well.

So who should be responsible for gluing all the little pieces in a serverless app together?

Can OPS do it? Of course, they can! However, it means they’ll practically become DEV members. Is that something they want to do? Not sure. Also, it isn’t clear whether DEV teams will accept them as members.

Can DEV do it? Yes. That’s an easy one. It’s just a different language from Java, but it’s a language. However, can DEV deploy it in a production account? Probably yes, but that means they’re setting themselves to do more OPS related work, which some DEV folks may not want to do. Also, OPS won’t like the idea of letting DEV run freely in the well-kept production environment.

The holy grail

If we could only give DEV more freedom while still guaranteeing our tree won’t die

The world is not there yet, but it seems we’re moving in that direction. In general, AWS, Azure, and GCP are working to enable DEV to do more with self-service solutions. To add to that, a few startups are also trying to address that specific problem.
Meanwhile, DEV will have to coordinate every little move they take with OPS. 
So this, my friends, is the story of why serverless is so fucking hard.