In my current role, one of my teams has spent the better part of two years trying to dig ourselves out of a hole created by another team who poorly designed some fairly critical applications dealing with platform services -- meaning functionality that is upstream to many other microservices. A large part of the problem was that I believe the former team was a victim of over pre-optimization. I know it is easier to have hindsight since we’re now living with the consequences and constant issues of the ecosystem of services we inherited. However, I believe some of the problems we’ve had to solve could have been prevented.
Pre-Optimization is the idea that we must solve all problems all at once, before we even encounter them. It’s almost saying that, as developers, we should have a crystal ball that will predict all problems that we will ever encounter in the lifetime of an application.
Examples and Lessons Learned From Them
Some of the things I’ve seen teams pre-optimize include:
- Predicting auto-scaling requirements, by prematurely scaling up. One of the most common issues I have seen is that teams will provision the biggest cpu/mem EC2 combo that they can get away with, based on how legacy applications performed on-premise. The problem is, as we monitored these new services, we noticed only 2% of the CPU of each instance was ever only used. These very large instances were expensive, as far as operational cost, but seemingly did little to contribute to the performance of the service. Lesson learned: It’s sometimes better to scale out with smaller sized instances versus scaling instances up prematurely.
- On the converse of the item above, I have also found teams focusing on only one thing to optimize. One of the applications we inherited had scaling policies in place with assumptions of what the system load would look like, based on past data from the legacy system. Unfortunately, while they had accounted for scaling on the service side, the team had not accounted for performance issues on the database side. Within hours of people logging into the system, the database was overwhelmed, and we spent almost an entire day trying to figure out a DB scaling policy on the fly (in production) to resolve the issue. Lesson learned: Don’t optimize based on how legacy systems you’re modeling from, worked. Always test and verify.
- Another pre-optimization activity I found is figuring out all your data models in the beginning, and prematurely normalizing data. When I was still a dev, I recall being in conversations with the team I was on when the data models were being laid out. The team spent almost 2 sprints (1.5 sprints longer than they should have, in my mind) arguing over what the data models should look like, in order to accomodate both future growth and store data they believed our users would be interested in. As someone who had come from an e-commerce shop with years of building back-end services, I knew we were trudging down some shaky ground, because my experience suggested that we can never predict all the data requirements ahead of time, but we can make our data contracts extensible, so that we’re not constantly making contract breaking changes. But, what do I know ¯\_(ツ)_/¯? Unfortunately, I was in the minority and the loudest voices on the team won. Fast forward when we shipped, the data models were so complex and hard to change that whenever we made adjustments — we had to require everyone downstream to also modify and redeploy their code, to adjust to our changes. The performance of the data queries were also so inefficient because the tables were overly normalized instead of being flatter. Lesson learned: Sometimes, a simple design can be more efficient. Normalization can always be done later and is not always the answer.
4. I have also seen teams build bells and whistles into apps because not only was it cool and “bleeding edge”, but because they thought our customers don’t know what they want and we’ll give them something they will want. Teams end up building what some have called “some day features”. When one of the teams in the company set out to create a new reporting product, this was the goal the team marched towards. However, when they shipped it, the customers hated it. They felt that a product was delivered to them with features they didn’t ask for. Sadly, the basic functionality didn’t even work right. Today, our company is building the third iteration of that product with a third different team. Hopefully, third time really is the charm. Lesson learned: Always ask the customer what they want before building something just because it’s “cool”.
5. I have also seen teams never move forward with development, because they’re stuck in “analysis paralysis”. All too often, because teams are trying to predict problems that have not yet occurred, they try to design such a fault- and everything-proof system, that so much time has passed, but they have failed to deliver a working product. I was in a project where this happened. The team spent so much time designing/predicting potential issues, that we delivered the product late, and once it was in the hands of the customer, it appeared that we still didn’t catch all the issues. Lesson learned: Just build based on requirements that you know today. Software should be fluid and should be built to tolerate modifications, if necessary.
6. Sometimes I look at some of the services we inherited, and for some services, there’s probably at least 20 more endpoints than necessary. For example, is it really necessary to have an endpoint to expose every possible combination of data? I have seen implementations where they have used query parameters and one exposed endpoint. The other argument is, maybe no one’s ever going to use or need that data. I can’t count how many times my team has done some research into how often certain endpoints are called, and the answer was exactly zero times. Lesson learned: Keep in mind that there are multiple ways to build service endpoints, but try to comply with best practices, and only expose data when there’s a need for it.
7. Finally, I’ve also seen teams build a UI for tooling when it doesn’t make sense. The questions to ask are who is the UI for, and can they use a web service to get the functionality instead? Lesson learned: While it’s cool to build a UI to practice our full-stack development capabilities, sometimes the team should focus their energy on building other tooling instead (maybe monitoring, which is often an after thought?). Abide by the YAGNI (You Ain’t Gonna Need It) principle.
On the flip side of pre-optimization, there’s other activities that engineering teams should, but don’t do, that contribute to software delivery failure.
- Do a Build-vs-Buy evaluation if it makes sense. A problem I had found with teams sometimes is that they’re so hell bent on building cool solutions, they never perform a build versus buy analysis on the product they’re delivering. My other team, which does DevOps enablement, had been looking into different feature flagging solutions, so that we can gate new features and encourage more frequent deploys, without exposing our customers to the new code, should there be an issue. There’s open source and commercial solutions out there that can actually do this very well. One of the other delivery teams decided they can build one in-house. However, once other teams started using it, it was clear that the team who designed/built it had not accounted for all use cases, and the app was definitely not built to scale to support thousands of feature flags. In fact, scaling was such a problem that one day, it caused a major outage and no application that called it could even fall back to normal behavior, because they didn’t build in any circuit breaking or fall-thru. Lesson learned: Don’t build something for a problem somebody else already solved, and already did a good/great job of solving.
- Don’t believe high test coverage is equivalent to a high quality product. We have a QA sub-organization within engineering that has a separate reporting structure. Some of the rules they make are sometimes arbitrary. For example, they require that there must be a 95% test coverage for all apps and services before teams can deploy. However, from what I’ve seen, people massage the test settings to achieve this goal, and never in the history of my career does increased test coverage (since it can be rigged) necessarily mean bug free. I’m not sure where this illusion comes from. Lesson learned: Don’t be so absorbed in metrics that don’t make sense, in order to check a box. Identify the goal the metric is trying to achieve and focus on that instead.
- Do think about monitoring. Since one of my teams has SREs, monitoring is something we’ve promoted and pushed for teams to do with our DevOps community of practice. We cannot emphasize enough how important it is for teams to create monitoring and alerting for their apps and services, so that our customers don’t see an issue before we do. It just makes us look bad when we fail to catch these issues first. Despite multiple communications around this, it still appalls me how many of the hundreds of apps/services we have in production today do not even have a synthetic test or health check. Lesson learned: creating monitors is an investment of time for sure, but the cost of a customer finding a problem versus the team finding it first is high. The adage “An ounce of prevention is worth more than a pound of cure” applies here.
- Have a security-first mindset. Far too often, I also see teams treat security as “technical debt”, and put off addressing security violations. In a cloud environment, this is even more dangerous, because security exploits can easily be very public (e.g. accidentally leaking PII data in a public S3 bucket) and very costly. Since one of my teams manages all of our AWS infrastructure, we have seen first hand how costly a security breach is. Often, we end up killing the account, and have to rebuild everything, especially if it has critical production workloads. Lesson learned: Don’t think security exploits won’t eventually catch up with your team. It’s often a matter of “when” and not “if”.
- Do not create circular dependencies. I’m not even sure how this happens. I’ve seen teams design services that have to be deployed in a very specific order, in order to work. I’ve also seen some chicken-and-egg situations where Service A (let’s say an authentication service) needs something from Service B (let’s say a service lookup), but it needs something from Service A (an authentication key), in order to work. This violates so many design principles and reeks of code smell. I’m not even sure how it got past architectural approval. However, it’s a pattern I’ve seen too often lately, that I’m starting to wonder if I missed something along the way, as software design has evolved over the years. Lesson learned: Don’t forget to use tried and true design patterns. They’re there for a reason. If something’s off, question the design, and go back to the drawing board if you have to.
Crawl, Walk Run
In some situations, some pre-optimization makes sense (for example, for code maintainability or portability). Avoiding pre-optimization is not an excuse to be lazy. But, if we pre-optimize for something that will never happen, then it’s just waste. Pre-optimization can also sometimes lead to design decisions that make us paint ourselves into a corner we can never get out of. I believe, sometimes, pre-optimization is anti-agile. Software delivery is meant to be an iterative process, especially for complex projects. I have seen delivery teams — especially those with very optimistic and very junior devs — tend to bite off more than they can chew very early in the project. I don’t discount their enthusiasm, but sometimes this is where the value of experience really comes in, and sometimes, some amount of tempering is necessary.
There is this idea of doing a “crawl, walk, run” approach to software development. What this means is first building what’s called a minimum viable product (MVP — the “crawl” phase). An MVP isn’t something that developers just cobbled together with spit and bailing wire to please the product owner. I personally had always thought an MVP was just a working POC (Proof Of Concept). However, what I read was an MVP is the smallest version of an actual useful product that meets the basic requirements customers wanted. The key to taking the MVP from good-enough to something great is iteration — getting feedback, refactor/fix, and then getting feedback again, while keeping quality in mind. Quality must not slip during these develop/test iteration cycles. Creating an MVP can help teams to fail fast, by identifying issues quickly and pivoting if they need to.
For the “walk” phase, teams can start “speeding up” by introducing performance (e.g., determine auto scaling policies) and cost optimizations (right-sizing resources) and overall application tuning, which may sometimes include re-architecting (e.g., going from an EC2-hosted app to a containerized app using a serverless solution).
Finally, for the “run” phase, teams can now accelerate into as fast a pace as they’re comfortable. Maybe this means deploying services multiple times a day. This is typically also when teams can start thinking about bigger problems like Disaster Recovery and perform “Game Days” — an exercise to determine what potential issues can occur if the worse case scenario happens aka Chaos Engineering. Teams can take note of Netflix’ and Amazon’s approach to these, and see if they can leverage some of the lessons these companies have learned from these exercises.
Pre-optimization is not a “root of all evil”, as some may suggest. There are things you can/should and cannot/should not pre-optimize. Pre-optimizing the wrong things can be very costly to a team and their ability to delivery good quality, working software that customers rave about.