A close of image of interlocking gears

Handling the Complexity of External Systems

How do you alleviate user pain points when the source of the pain is out of your control?

The pain of social network profile disconnections

Hootsuite offers users the convenience of being able to manage multiple social network profiles within a single platform, allowing them to do things like post to multiple networks at once and view analytics data for all their profiles in a single dashboard. The functionality we offer to our users relies on us having authorization to interact with social network APIs on behalf of their profiles.

Sometimes our authorization with a social network platform becomes invalid and we are no longer able to make the API requests we need to. To resolve this, we need to prompt the user to re-grant us authorization by triggering a “social profile disconnection”. This trigger flags the profile as invalid in our system and sends the user a notification to reconnect it by logging in to the social network platform so we can retrieve a new access token.

These disconnections are a major pain point for our users as they disrupt their ability to use their social network profiles in the product. A disconnected social network profile is unable to be used in Hootsuite until the user reconnects it by logging in to the social network platform it belongs to. Any scheduled messages will fail to publish, analytics data will fail to refresh, direct messages won’t be received, and any other functionality that relies on making calls to the social network platform’s API will fail. Essentially, the profile is useless in the product until it’s reconnected.

A list of 2 social network profiles connected to Hootsuite. Both profiles are marked with a small red exclamation point to indicate they are disconnected and are displaying a blank image for their profile pictures.
Disconnected social network profiles are marked with a red exclamation mark. Note that we are unable to load the image for the profiles due to the lack of authorization with the social network API.

Sometimes a disconnection is necessary

When I joined Hootsuite as a software developer in May 2019, one of our main objectives was to reduce the amount and frequency of these disconnections. The approach at the time was to review the error handling implemented in our services to find where errors were being handled incorrectly and triggering unnecessary disconnections.

During our investigations we discovered where we could make code changes to prevent incorrect disconnections; however, it became apparent that in most cases the disconnections were being triggered for valid reasons. If we changed the error handling to no longer trigger disconnections in such cases, the social network profile would appear to be connected to Hootsuite but any attempts to use it would result in errors due to the lack of proper authorization.

These valid disconnections happen for a variety of reasons, usually related to something outside of our direct control. For example, if the user changes their social network account’s password the access token we have stored will become invalidated, meaning their social network profile will be disconnected. Changes in a user’s permissions granted to social network profiles can also cause disconnections, such as if they lose their admin role on a shared profile. Even social network platform outages can result in unexpected API responses that will result in disconnections.

Discovering the “why”

How do you alleviate user pain points when the source of the pain is out of your control? We had to change our approach. The social networks have their own distinct systems that we only interact with via public APIs, and we need to minimize the amount of complexity we pass on from them to our own users. What we needed was a way to inform users why their social network profiles were disconnecting and tell them how they could stop it from happening again.

We decided to pivot towards a solution that focused on determining the root cause of a disconnection, the hypothesis being that we could educate users on the reason behind the disconnection to help them prevent it from happening again. When a social network profile is disconnected, we would take a look at the error response and translate it into a user-friendly, actionable error message. To do this effectively we had to account for a few challenges that came from external systems owned by the social networks and internal systems at Hootsuite:

  • Externally, each social network has its own unique set of error codes, messages, and formats for responses
  • Internally, not every service sends consistently formatted data to trigger a social network profile disconnection
  • The social networks often introduce new error codes and messages, so we needed to have a solution to quickly add new logic when new reasons for disconnections arose

Building the solution

A system diagram showing the flow of social profile disconnection events into the error parsing service, which stores them in the database. The frontend of Hootsuite requests disconnection reasons from the error parsing service which returns them to the frontend.
The new error parsing service store the disconnections triggered by other services so that it can provide root cause information via its API

Our solution was to create an error parsing service that would be responsible for storing the error responses that cause social network profile disconnections and interpreting their root causes. The service would expose an internal API to other services within Hootsuite so that they can request that information and use it to display error messages to the user. The final system works like this:

  1. Whenever a profile disconnects, store the error response from the social network
  2. An internal Hootsuite service sends a request to the new error parsing service to fetch the latest reason a social network profile disconnected
  3. The error parsing service queries the most recent disconnection information for the social network profile from its database
  4. The error parsing service runs the stored information through a series of parsers that we’ve defined based on the meaning of particular error codes and messages (E.g. we know that code 4700 from some Example Social Network means “the user changed their password”, so if that code appears in the error response, return an identifier of “example-network.user-changed-password”)
  5. Return the root cause identifier to the caller of the API so that it can display the error message it determines to be relevant
A window titled “Reconnect your social accounts” that contains a list of disconnected social network profiles with buttons to reconnect them. The first profile is displaying a popup containing details on why it disconnected, explaining that it is due to the user revoking access for the social network API.
An example of displaying an error message to a user based on the root cause identifier for a disconnection.

To deal with the differences between each social network’s error responses but still minimize duplicated code we created a generic representation of an “error parser” in the error parsing service. An “error parser” is a function that transforms an error response from a social network API into a root cause identifier for that same network. Once we have a parser type defined for a particular network we can write as many parsers as needed to cover the error codes and messages encountered in disconnections from that network.

Handling inconsistent data

A flow chart detailing the parsing workflow for the error parsing service. Error reasons are first checked to see if they are JSON formatted or plain strings before being run through the correct set of parsers until one is matched or no parsers are left. If a parser is matched a unique identifier for the root cause is returned, otherwise an error is returned to indicate the error reason could not be parsed.

Dealing with inconsistencies in the data sent by other Hootsuite services was a more challenging issue to solve. The flowchart above shows an overview of the parsing flow for social network error responses. The initial step is determining whether the error information is JSON formatted or a plain string.

Social network profile disconnections can be triggered by many services at Hootsuite. We discovered not all of the services were sending the same data when they did so. Not only were some services simply sending plain strings for the error information, but those that were sending JSON didn’t necessarily all include the same fields. We had to implement two sets of parsers: one for JSON and one for plain strings, and then define a set of required fields that we needed in order to parse error responses successfully.

We collaborated with other teams to update their services to send consistently formatted events containing all the required data so that our parsers didn’t have to account for too many differences. However, due to resourcing and bandwidth issues we needed a solution that in the meantime would get acceptable, but not complete coverage.

Thankfully, every event we received contained enough information so that we could look up anything missing from our main database in order to run our parsing. Each time we need to fetch additional data we update a metric so the team can track down any services that are sending incomplete data.

A flow chart showing the disconnection event ingestion workflow. When an event is ingested it is checked to see if any required information is missing. If there is some missing data the Hootsuite main database is accessed to get the missing data before the event is stored. If the databased is accessed a metric is updated to track the number of events requiring that are missing data. If the event is missing no data it is immediately stored.
If any required data is missing on a disconnection event we can look it up from our main database. Each time we do this we update a metric so we can keep track of how many incorrectly formatted events are being ingested and work to figure out why.

Determining the root cause

Being able to quickly add new parsers was a key goal of our implementation of the service, as we’d often have new errors appear suddenly that we needed to help users understand. We put a lot of time and effort into designing the new service to be easily extended, and thanks to a lot of helpful things in the Scala Cats library we were very successful.

By generically defining a “disconnection reason parser” as a function that takes some raw error response for a particular social network profile and returns a root cause reason for that same network we were able to use MonoidK to combine a set of parsers so we could run an error reason through all of them, returning the result of the first one that is a match.

This setup means that adding a new parser is as simple as defining a new function and adding it to a list of the existing ones. Once we determine the identifying characteristics of a new error reason (like the error code and message) we can write a new error parser function and add it to the chain of parsers for the social network it belongs to.

type ReasonParser[-A, +T] = A => Option[DisconnectionReason[T]]private val passwordChangeParser: ReasonParser[SomeNetworkRawErrorType, SomeNetworkReasonType] = input => {
if (input.code == 4700) Some(SomeNetwork.passwordChanged)
else None
}

Each parser defines what network type it is for, in this case an example network called “SomeNetwork”. A parser takes in the raw error information and either returns Some root cause identifier or None to indicate it did not match the error. The typing of ReasonParser lets us enforce that a raw error reason for a particular network will only ever result in a root cause identifier for the same network.

private val passwordChangeParser: ReasonParser[SomeNetworkRawErrorType, SomeNetworkReasonType] = …private val anotherParser = …private val yetAnotherParser = …val allParsers = List(passwordChangeParser, anotherParser, yetAnotherParser)val composedParsers = allParsers.foldK

We define all the parsers we need for a network and then use Cats’ MonoidK to combine them all using “foldK”. We can then pass in the raw error response to “composedParsers” and they will all run until one matches or they are all tried.

Monitor and measure

Being able to measure the error parsing system’s ability to help users keep their social network profiles connected was critical for the team. As part of our parsing implementations we use Prometheus to collect metrics so we can monitor dashboards to track things like which error reasons are most common, what percentage of our error volume are we successfully parsing, and to help find new error reasons.

Our team has created Service Level Objectives to define the maximum amount of unparsed errors we deem acceptable for each social network platform we support, and has created alerts that will notify us of any violations. When we’re notified of an increase in unparsed errors we have defined processes to follow in order to determine the source of the increase, whether it’s a brand new error reason or a change to an existing one. We can then quickly deploy changes to update and add parsing to get back below our defined thresholds.

Results and moving forward

Thanks to the success of this project we’ve seen a significant decrease in both the volume of social network profile disconnections as well as customer support tickets related to them. Users are no longer getting stuck in repeated disconnection patterns as much as they were before and are experiencing social profile disconnected less frequently overall.

Moving forward we are looking to add to the error parsing service to make it responsible for parsing more than just disconnection related errors, such as errors received due to issues that users can resolve without having to reconnect their social network profiles, and even errors that we often see when the social network platforms are experiencing issues. In the future we’d like to see this service become the single source of truth for the “why” behind all of the errors we receive from the social networks we support so that we can offer consistent and helpful messaging to our users when they’re experiencing problems.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Turner Vink

Turner Vink

1 Follower

Software developer @Hootsuite, they/them, opinions my own, on Twitter @turnervink