Don’t silence your production errors, solve them with elegance

Lucas Vinícius da Rosa
Ship It!
Published in
9 min readSep 12, 2019

Hey, you! Don’t hide behind your monitor. I know you’ve been missing those annoying production environment errors in the Slack channel. I also know that life has been difficult in this information era of ours.

Too many integrations points, too many APIs and unformatted data flying around our nose. Eventually (or constantly) something changes here and there and the data flow gets crazy again with new errors.

But don’t give up. Software thinkers of the universe, we are in the same galaxy boat!

Don’t worry about a thing ’cause every little thing is gonna be alright

In a software engineering routine, it is common to face software changes. Be it in the application codebase itself or in the third-party APIs, assets, libraries that compound a software product.

Here at Resultados Digitais (as would be the case of many scaling engineering teams), we integrate deeply with third-party APIs.

Not so long ago, we’ve got caught with hands short when the following error began to show and increase chronically on our Slack production errors channel.

RestClient::BadRequest: 400 Bad Request

An important API responsible for retrieving/importing new advertisement leads changed its permission scopes. As a collateral effect, many users who used to succeed in getting their leads data (automatically) imported in our platform (RDSM — RD Station Marketing) were not able to do it anymore.

As more leads from the not permitted users’ Ads were submitted (via Webhook) to us more the pile of errors grew. In practical terms, whenever a new lead converted in some Ad we got a 400 Bad Request from the API.

Style is everything, be elegant

I will be short in this section. If we aim to successfully (elegantly) react and solve application errors, some disciplines are essential:

  • Map new errors from the beginning;
  • Do not neglect the long-standing errors just because they are known but nobody took action over them;
  • Associate the error with an Issue in your repository;
  • Assign a reasonable/achievable SLA for the solving task;
  • Think first in the user feelings and only then in the difficulty of the solution ahead.

It’s important to highlight the fact that we cannot control external changes. So we must adapt to them quickly and, why not, with elegance.

There aren’t always roses in the gardens of software engineering

Before we could dive into the solution design it was necessary to understand the application paths that generated the error. Their nuances.

The 400 Bad Request was exploding (via Rollbar) in the Slack channel because there was no such error handling in our implementation. And it gets worse. The third-party API response upon error was too generic, encapsulating three possible causes in the same error status code.

Solution design

The solution consists of two macro steps:

  1. Avoid exploding the BadRequest error (at Rollbar) upon lack of permission when retrieving a lead;
  2. Call the appropriate mechanism to register the error and instruct the user on how to solve the problem.

Given this, after some mind-blowing reflexions below the shower, two approaches came up: the reactive and the preventive one.

The REACTIVE approach

The closest scenario to reality was based on a REACTIVE approach. We would react to any given 400 BadRequest, mapping the error state to a specific data structure attached to our user’s ads account.

We choose to name this particular data structure as lead_ads_errorand utilize it as a jsonb field.

lead_ads_error JSON-based data structure

The jsonb field datatype has been chosen based on the following criteria:

  1. Flexibility since the field behaves like a document in a NoSQL database, as in Mongo. So we can further handle dynamic error from the third-party API’s when retrieving leads information.
  2. Efficiency over conversion/processing the data by the PostgreSQL when compared to a text or json datatype.

Too much information? Don’t worry. Let’s depict the steps of this approach when the infamous BadRequest error occurs:

  1. Rescue the BadRequest exception and call the registration error service
  2. Set leadgen_ads_error[:leadgen_retrieval_permission]from the ads account to FALSE and record the last failed leadgen_id
  3. When loading the RDSM Dashboard, if permission is not in place, show the user the message with solution information
  4. Once the user applies the solution recommendation, he gets instructed to re-check the permission (by clicking on a button in the RDSM Dashboard solution message)
  5. Try to re-do the lead retrieval using the attribute (leadgen_id) obtained in step 2)

Better safe than sorry (the PREVENTIVE approach)

Thinking further, there is another scenario in mind. It would be more desirable if we could, for instance, check the user account permissions for retrieving ads leads before the user activates the feature in our platform. The nature of this approach is inherently PREVENTIVE, as we anticipate the permission checking before the effective use of the feature.

Following this hypothesis the above steps would address the solution:

  1. When the user is connecting the ads account into RDSM, it requests the permission checking API endpoint about the lead retrieval access
  2. Notify user about the lack of permission for leads retrieval, if it is the case
  3. The user follows the recommendation and restores the permission at third-party’s
  4. The user goes back to ads account configuration and performs step 1)

Although the PREVENTIVE method sounds leaner, and it is, it has a dependency that our real scenario does not meet. The permission checking API endpoint. Unfortunately, the third-party API does not provide a way of retrieving the permission role of this specific lead retrieval endpoint.

The PREVENTIVE approach also does not handle the case of permission changing after the user configured the ads account in our platform. In a sweet and dreamy world where 0s and 1s would be more estimated, we would have the PREVENTIVE and REACTIVE approaches combined.

As it is not the case, we do our best with the REACTIVE path and a smile on our faces.

Shut up and show me the code!

The reactive implementation gets through the next steps:

  1. Rescue the BadRequest exception and call the registration error service
RestClient’s find() implementation for lead retrieval
Specialized error rescue for RestClient::BadRequest exception
Instantiation and calling of the permission error register service

2. Set leadgen_ads_error[:leadgen_retrieval_permission]from the ads account to FALSE and record the last failed leadgen_id

Ads::LeadAdsPermissionErrorNotificationRegisterService

Permission error register service implementation
  • It receives the ads_account in the constructor
  • It registers the leadgen_id into lead_ads_error[:last_failed_leadgen_id]
  • It sets lead_ads_error[:leadgen_retrieval_permission] to FALSE

3. When loading the RDSM Dashboard, if permission is not in place, show the user the message with solution information

Conditional view template rendering based on permission status

4. Once the user applies the solution recommendation, he gets instructed to re-check the permission (by clicking on a button in the RDSM Dashboard solution message)

LeadAds error and recovery dialog

5. Try to re-do the lead retrieval using the attribute (leadgen_id) obtained in step 2)

The back end receives the permission checking request via the ads controller’s #check_lead_ads_permission action/method.

AdsController#check_lead_ads_permission method implementation

LeadAdsPermissionErrorChecker

Permission error checker and state cleaner implementation
  • It receives the ads_account in the constructor
  • It checks in the find lead authorization (retry the lead retrieval with last failed leadgen_id)
  • It cleans the lead ads error state if the retrieval permission is in place
  • It sends a successful event (SocialMediaMetric::LEAD_ADS_PERMISSION_RECOVERY) to the metrics monitoring system
  • It returns the checked status to the caller

Important solution engineering aspects

The implementation of the designed solution should also be elegant, if you permit I repeat myself. To achieve this let’s take a look at five engineering aspects that drive us towards an exemplar concrete solution.

UX effectiveness

When communicating with the user is essential to have empathy. Imagine yourself in his perspective. Reading the message in the RDSM Dashboard, can you understand what the permission Lead Ads error is about and how to perform the steps to solve it?

Short copy messages and the care of not exposing too much technical information helps the user gets through the problem smoothly. If you have a designer in your team, leverage from his knowledge regarding user perceptions and emotions.

In the final of the day, if your solution proves to be assertive in the number of permission cases solved, know that the UX played a crucial role in the victory path.

Scalability

To scale is at the heart of a fast-growing organization’s product. When handling errors would be no different. To address one user problem should mean address all the users problems that meet those scenario criteria.

We’ve already seen the particularities between the reactive and preventive approach when it comes to the Lead Ads permission issue. Regardless of each approach we choose, the heuristic applied must guarantee that one or tons of users can follow the solution path.

Security

The security of a web application is distributed in its overall aspects. It is not a stage or specific functionality. In each step of the solution design and implementation, keep track of data inputs/outputs flows. Where the data comes from and to where it is going. Sanitize your database (or whatever interpreter is in place) queries and service calls.

Take our reactive solution implementation, for example. When the user clicks on the permission checking button, no ads account-specific data is passed to the back end. The necessary data used as input in the verification flow is obtained in the back end itself. So an arbitrary user could not force a check for another ad account.

It is also important to take extra eyes on new API endpoints, their resistance against multiple requests, the correctness of HTTP verbs and headers and, last but not least, the solution collateral effects over the rest of the application.

Monitoring

If you got till here, should have noted that in some points of code we have some logging and metrics sending.

The SocialMediaMetric::LEAD_ADS_PERMISSION_RECOVERY event is sent to our metrics system every time successful permission is recovered and the lead_ads_errorstate from the ads account is cleaned.

Dashboard for Lead Ads permission recovery success counting

As we are also measuring the number of occurrences of the BadRequest error (although this is not so explicit in the code excerpts demonstrated in this article). Then the ratio of these two values will give us the success solution rate:

Solution success rate equation

Enhancing (discovering new scenarios)

Let’s admit. It is impossible to predict everything (yet some engineers have prophetic dreams once a while). As we rolled out the solution for a small number of users, a new scenario popped up.

The permission recovery heuristic relies on the last leadgen_id who failed to be imported in the platform. The problem of this is that if the lead is deleted at the third-party, the permission revalidation will presumably fail (as that leadgen_id corresponds to a lead that doesn’t exist anymore).

This scenario was not revealed until the users started to experiment with the solution. To fix it, one partial alternative is to register more than one failed lead (consider the gained flexibility by establishing the lead_ads_error data structure as jsonb). Then iterate through them in the permission checking service. When the first permission check succeeds, the error state is cleared.

The downside of the above approach lies in cases where no more than one lead failed being imported (1 Ad with just 1 lead conversion, for example). Consequently, this turns us back to the original scenario. Maybe to use another parameter/attribute besides the leadgen_id could enhance the permission checking heuristic. But this road goes on and on, my friend…

Leaving by the front door

If you made it get here, embrace yourself and realize the beautifulness of not silencing your production errors. You get the opportunity to turn them into consistent product evolution.

Hope you enjoyed reading this article as much as I got thrilled writing it. ‘Till the next time, folks.

--

--

Lucas Vinícius da Rosa
Ship It!

Security Engineer, Ethical Hacker (CEH Master) and Independent (Portuguese) Literature Author