Improving Product UX On-The-Fly With a Robust Error Classification System
Building an ETL platform with 100s of integrations with different sources like MySQL, Postgres, SaaS providers, and destinations like Snowflake, BigQuery, Redshift is complicated on many fronts. One such front is Error Handling.
Building a No-code platform means always putting the UX front and center of all our decisions. We strive to make our product as user-friendly as possible.
For every error, we should be able to handle them differently and display appropriate error messages. It is not humanly possible to preemptively create a list of all possible errors because each and every integration has a different system of error codes and messages. For example, when we pull data from a MySQL database, we get the following error message.
ERROR 1044 (42000): Access denied for user 'hevo'@'%' to database 'db_name'
whereas in Postgres, it’s
ERROR: permission denied for database "db_name"
Both errors are semantically equivalent and require the user to give us access. But the error message is different, it requires us to modify code and handle them separately. In both scenarios, we ask the user to provide permissions on the UI in a way that it would be easy for the user to understand what they can do.
It seems like the provided user does not have the right permissions. Please recheck the permissions provided. (link to our documentation)
There are multiple places in our product where we need to handle errors.
To reduce code duplication, we actually first classify these errors into their semantic meanings. Continuing with the same example, both of the errors mentioned above would be mapped to the error type INSUFFICIENT_PRIVILEGES. We further categorize these error types into USER_ACTION_REQUIRED, INTERMITTENT, and INTERNAL for polling errors. Similarly, we have different behavioral logic for other components of our product.
This 2-Level Hierarchy allows us to identify the semantic meaning of the error message using error_types/categories (level 1) and how we want to handle it using category types (level 2).
- Every time we add a new error classification, we need to make code changes and deploy the app. Even with a rolling deployment, that means 20 minutes of sub-optimal throughput. Multiply that with the number of error classifications in a week & the amount of engineering effort required, and the cost starts to make itself clear.
- Most Error Classifications are simple String or Regex checks that don’t require an engineer to make code changes. Product Managers already have context on the different errors that can happen, they should also be able to add error classifications.
The first step is to define the framework, which means structuring the error classification process.
Each error classification requires 5 attributes.
- Regex or Text Classification — The regex or text that will match the error message.
- Error Type/ Component of the product — (e.g. CONNECT, AUTOMAPPING). Additionally, the scope is also defined by suffixing the exact part of the product. CONNECT_SOURCE applies to connection errors with sources, CONNECT_DESTINATION applies to all connection errors with destinations, and finally, CONNECT_GLOBAL applies to all connection errors in the product.
- Error Entity — Specific part of the product/ sub error type.
- Category Code — The category code for the classification like INSUFFICIENT_PRIVILEGES.
- Displayable Error Message — The error message that is displayed when the classification matches the error message.
This means that the following Error Classification will apply to all connect errors.
Once the process was formalized, we needed a way to add classifications while the app is running. An obvious way would be to write an API on the service that accepts classifications and stores them in a MySQL table.
But, everything we do here at Hevo needs to account for distributed systems and efficiency. In this case, we have multiple clusters (India, Asia, USA, Europe) each with an elastic number of machines. Therefore, all of those machines should have the latest copy of the error classifications as soon as possible.
There are 2 broad approaches to syncing information in a distributed environment.
In a pull-based system, each server polls the external service regularly and pulls the data if anything has changed.
In a push-based system, the external service keeps track of every server and sends them the info when it detects that there are updates.
There are numerous famous examples of both kinds of approaches. However, for us, one particular downside of using a push-based approach is that keeping track of a high number of ephemeral nodes might not be a good idea.
We decided to go with pull-based primarily with some elements of push-based. We were able to achieve this by understanding our unique situation and introducing an intermediate cache for each environment.
Let’s break down the above diagram by starting with the Error Classification Store.
The Error Classification Store is hosted on our Westeros; it contains all of the error classifications and sports a comprehensive API that validates and manages all of the error classifications.
Westeros is an internal global service that helps our main apps with any information that needs to be stored and provided. For example, Billing information.
Each cluster has its own Redis cache, where all of the error classifications are cached and used by all of the nodes in that cluster. Every 30 minutes, the machines check the timestamp value on Redis to see if the cached value has been updated or not. If (now — timestamp) > 30 minutes, one node picks up the duty to check for any updated error classifications in the Error Classification Store and refresh the cached value. This is to avoid unnecessarily polling the error classification store as we need Westeros to have high availability. By using Redis, which has sub-millisecond latency, we can cost-effectively support error classifications in a scalable manner. Additionally, we have another layer of caching in memory of the app to avoid hitting Redis too. We can do this as we know that the number of error classifications added per week would be low enough.
One issue with this approach is that we potentially have a 30-minute latency for when we add an error classification and when it starts being applied to all of the errors. To mitigate this, we applied a little bit of a push-based approach as users would add or edit classifications only using Alfred. This constraint allows us to send a message to each environment whenever the user adds or edits a classification.
Alfred is an extremely useful tool we utilize for various tasks that interact with the main app. It allows us to configure Pipelines and teams, check billing info, debug issues, and add Error Classifications.
This is efficient as we only invalidate the cache in each environment when there is a change. Moreover, we can do this comfortably because there is only one way to upsert an error classification (through Alfred), and we only need to keep track of the clusters (India, Asia, US) which are static.
- Allows us to improve the user-friendliness of the product without code changes and redeployment.
- Reduce the burden on Support — Errors that have clean and actionable error messages reduce the overall support tickets opened for understanding the error messages.
- Enables Product Managers to clean the error data for analytics and avoids the need for any coding.
Additionally, Alfred enforces validation of each error classification and maintains an audit trail via our Maker-Checker system.
- Right now, we only support CONNECT and AUTOMAPPING errors, but this can easily be extended to support POLL and LOAD errors too.
- Although the error classifications can be added dynamically, the error categories still require code changes. This is not really a problem as we don’t add new categories regularly, and the list of categories we already have has been built over the years.
- The displayable error message is currently static. But we can power more advanced displayable messages by using regex group captures.
Thanks to Kaushtub Rawat & Umesh for their efforts on the Alfred side of this.
The world of data is changing and is never going to be the same, if you think it is worth being a part of our mission and work on challenges like this, give us a buzz at email@example.com.