[CLOSED] When is a feature done? When are you ready to deploy a new feature to production?

Daniel Moldovan
DevOps Dudes
Published in
6 min readSep 9, 2021
SpongeBob SquarePants looking at a rainbow. Text states “code is complete. feature is done”

You need to implement a new feature. You research and select suitable technologies and approaches. You start coding. The required functionality is implemented. You can run and demo the feature locally on your computer.

Is that it? Is the feature done? Can you deploy it in production? Can you move to the next feature?

Of course, it depends. Is the feature part of a university home assignment? Then probably yes. Is the feature part of a larger software as a service product that clients pay for? Then probably no.

Why can’t you just write the code on your computer and deploy it to production after it runs locally? What is especially different for software-as-a-service (SaaS) products?

If you know why and are curious about the conclusions, jump to the How to avoid such “success stories” section. If you still need to be convinced, let’s continue with the story below.

Releasing a new feature: a success story

Hint: not so much

You are a programmer in a company offering a subscription-based video streaming service. Clients pay a monthly fee to access your product.

A new feature needs to be implemented. A service recommending to customers what content to watch.

Your team takes on the work to implement it. You spend time designing the flows and architecture. You research suitable recommendation algorithms. You test them and pick the ones to use. You write the code. It runs on your computer. That is it. After one month of work, the new recommendation service is ready. You declare it’s ready to go into production.

Release day

You deploy the new recommendation service in production. It starts and crashes. You check why. There is a problem communicating with another service used for listing videos. The video listing service has changed since you have started writing your recommendation code. Your service is not passing all required video listing parameters. An easy fix. You make some quick changes to send the correct parameters. You deploy again.

The service starts. You open up your browser to check the recommendations There are none. You debug. The recommendation service returns empty lists. You would like to know why. But the service does not have logging to help you trace the cause. Ok. You quickly add logging. You redeploy everything and watch the logs poor in.

The problem is again related to the video search service. The data it returns has a different format than you expect. Your code cannot process it to get the list of videos available. This time the fix is a bit more complicated. But you manage to implement it in a couple of hours. You deploy the new code to production. You check what the recommendation service returns for you. You get a list of videos. You declare victory.

Release day: + 1

You log in and start to work on the design of your next feature. Slowly, client complaints start to come in. One or two every couple of hours. Client care is marking each complaint to be investigated by the engineering team. One such investigation request gets assigned to you. In the issue, the client reports that the video streaming service is sometimes slow to load. You check monitoring and see no obvious problems. You dismiss the issue as transient, probably due to intermittent network connectivity. You complete the day’s work and log off.

Release day: + 2

Early in the morning, you get a call. Client complaints are coming in from everywhere. All report very slow loading for the streaming service. All teams are to assemble and trace the root cause of the problem.

You quickly open your computer and connect to the chat room created for this event. It appears that the problem is connected to the new recommendation feature. The streaming service is slow to load as it is waiting for recommendations. But the new feature does not have detailed monitoring. You know the recommendation service is slow, but not why. Ok, you instrument the feature with metrics. You deploy to production and investigate further.

The added metrics provide much-needed visibility. The slowest piece of code is the one analyzing video scores and matching videos to client preferences. To improve the recommendations over time, you store which recommended videos are played back by the clients in a database. After running for two days, the recommendation database has a lot of data in it, making all database queries slow. The recommendation service was not tested for performance before being deployed in production. And this time, you do not have a quick solution. You need to change the data you store in the database. Or choose a more scalable database solution.

A decision is made. The recommendation feature is disabled. The code is rolled back to a version before this feature existed. Clients can use the streaming service again. The problem is resolved.

Your team has worked for one month to write the code for the beautiful new feature. But the feature had to be disabled. After all that work, all you have accomplished is to cause some clients to cancel their subscriptions. And many others to start thinking about migrating to your competition.

How to avoid such “success stories”?

Especially for companies offering software as a service, client experience is paramount. If the client is not happy, the client does not pay. If the client does not pay, the company will fail.

So, what can we do? How can we avoid stories like the one above?

We can change the objectives. Instead of having as objective the delivery of features, we set our objective to delivering client experience. Changing the objective means we include client experience in the whole feature delivery process. And work to ensure that for any new feature, we focus on achieving the best client experience.

How? We extend the definition of done. We no longer consider a feature done when the code is functionally complete.

Instead, we consider a feature done:

  • When the code has tests to catch issues. Both with itself, and in how it integrates with other libraries, components, or services.
  • When logging has been added to enable tracing the root cause of problems that occur at runtime.
  • When the code is instrumented with metrics for understanding its behavior at runtime.
  • When end-to-end tests are implemented to exercise production flows. To quickly discover bad deployments and other runtime problems.
  • When Service Level Objectives (SLOs) are defined for the new feature. Defining the acceptable behavioral limits for fulfilling the product SLA .
  • When the code is instrumented with metrics collecting the Service Level Indicators (SLIs). Monitoring if the service is within its SLOs at run-time.
  • When the code has been performance tested against estimated production traffic levels over time. To ensure it will not buckle or break under pressure.
  • When alerting thresholds are defined over your SLIs. To get notified about production problems before clients even notice.
  • When there is documentation on how to debug, scale, and operate the new feature in production.

Of course, with the extended definition of done, it becomes apparent that a single person will probably not be able to cover all the above . So the focus moves from the individuals writing the code to the team owning the feature. And what does extending the definition of done bring us? Well, what negatively impacts client experience the most? Usually problems with production. Bringing client experience into focus throughout the feature delivery process will make us try to:

  • minimize the occurrence rate of problems
  • minimize problem detection time
  • minimize the impact of problems when they occur
  • minimize recovery time from problems.

Conclusion: what does this all mean?

Well, it means a lot more work has to be done before declaring victory and releasing a new feature in production. Work as important as writing the feature code itself. Testing, both functional and performance. Continuous integration and delivery mechanism. Monitoring, logging, and alerting. Documentation and operating procedures.

After all, the only useful piece of software is the one behaving within SLAs and doing something useful for the clients.

--

--

Daniel Moldovan
DevOps Dudes

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.