Challenges In Managing Hundreds of Automation Workflow

10 min readJun 15, 2023

What I’ve learned in managing production errors might help the operation team keep up with the constant increase of new automation.

Automate and Chill

A few years ago, when I was managing in-house monitoring tools, I remembered setting up my team’s chat group icon with an image I found on Google that says, “Automate and Chill.” A catchy phrase that felt glorious after we have completed redesigning our monitoring system. The project has successfully fastened the process from creating custom code in the tools to only adding a few configurations. While “chill” is not necessarily what we intended to achieve, reducing the tedious task is something we’re striving for. It gave us more time to work on fun tasks and explore new tools.

However, when a company assigned me to manage an automation platform and create automation for the business user (primarily IT Ops and Service Desk), the satisfaction of deploying automation felt different. I was faced with an interesting challenge. At that time, our L1 operation team had difficulty handling failed jobs and complaints from business users. Ironically, we often need to do some manual tasks to keep the automation running.

More automation we deployed resulted in more manual tasks the operation team required to perform. The condition was concerning since it did not incentivize the automation team to create more automation. It’s quite the opposite.

This condition has led us to perform several improvements in developing automation. In addition, the experience made me understand certain concepts that may help the operation team manages production automation at scale.

Pain Point in Managing Failed Workflow

Naturally, without proper design and governance, creating jobs/automation workflow in most automation tools is similar to creating independent jobs in a crontab. Each developer will only focus on each separate script without necessarily needing to know what the other script is doing. Furthermore, how developers solve problems may vary, and it creates complexity, especially when managing failed workflow in production environments.

The automated process is typically run for at least 2 minutes and is programmed to execute each step sequentially. It basically replicates the step-by-step process on the existing manual runbooks.

When the automation experience an unexpected error executing one of the steps, it stops. The process is not continued and will remain unprocessed until the operation team looks at it. In this situation, the users will see that their request is still in progress much longer than it should be, and most of the time, they will not resubmit the same request again for retry just because they already submitted it in the first place, which is understandable.

The team involved in resolving the issue is often time the developer who creates that specific workflow. They usually rerun the process manually or retry in development tools while turning on debugging mode. It is very time-consuming, and it certainly affects their productivity. On the other hand, the standby L1 team struggled to resolve the issue primarily due to the limitation on troubleshooting and retry runbook. When there is no standardized pattern for building each automation, troubleshooting and resolving an issue can be frustrating because each process has different behavior. For example,

if error X occurred on process A, then execute runbook #1
if error X occurred on process B, then execute runbook #2
if error Y occurred in process A, then execute runbook #3
if error X occurred in process C, then execute runbook #4

Imagine having hundreds of processes and dozens of cases. There will be too many runbooks to define. Therefore, we redefined a development standard and design pattern for building a new automation workflow to minimize this hassle, mainly by standardizing the retry mechanism.

The New Retry System

Generally, the error that occurs falls into these two categories; retry-able and non-retry-able. The example of retry-able error is timeout, whereas non-retry able are service down, undefined response, etc. Most of the time, it’s the first one. So, it made sense for us to focus on that type of error first. Moreover, encountering an unexpected error in the system is inevitable. Thus, an effective way to handle that condition is necessary.

Illustration of retry-able and non-retry-able error

First, the operation team defined what an ideal retry mechanism looks like. We came up with a retry runbook that can be applied to all processes. It more or less looks like this,

if there is an error, then “click retry.” (it doesn’t has to be click, but it’s a few seconds activity)
if successful, then done
if not successful, then identify the obstacles
if obstacles can be removed, then retry again
if obstacles can’t be identified, then escalate

ops team demand a standard retry process

After that, the development team continued to create a development standard that could satisfy the operation team’s requirements. The requirement more or less translates to several design decisions, such as:

keep track of the status of each execution on the queue, and rollback the status if a retry is necessary
standardize file and folder structure (input, process, output)
identify the critical step that is not allowed to execute twice and apply some validation to avoid duplicated execution

As a result, we have built an internal queueing system that tracks the request list, its status, and other execution details. The status represents the state of the transaction, which step the request is currently in. We decided to define three conditions; ready, in progress, and finished. When the system receives the request, it will set the status to “ready,” which means the transaction is waiting to be picked up. It is later followed by in progress and finished. Whenever a retry-able error occurs, the operation team can change the transaction status so the system will retry the process. This trigger of the retry process is only a few seconds of operation, and it will retry the workflow in the background so we can run multiple workflows simultaneously.

In addition, folder and file naming conventions were standardized for all processes. It was beneficial in troubleshooting, mainly when a non-retry-able error occurred due to something wrong with the input files (sometimes we missed some cases in input file validation handling) or the backend system produced unexpected results.

Lastly, during the development phase, the development team needs to identify if there is any step that cannot tolerate duplicate input. If this kind of step exists, the developer must define validation logic to avoid executing the same transaction more than once. After all, we need to enforce this restriction, not only because we need a retry mechanism. An example of implementation is each request will have a unique identifier that the automation will keep track of in a defined period, and it will reset all processed files whenever the first step of the process is initiated. Eventually, this will become the standard test case during the testing phase.

While not all of the workflow comply with this new standard (especially old automation workflow), applying those changes in every new automation we built and to our high transaction workflow has reduced the complexity of the retry process. It also decreases unnecessary urgent tasks assigned to the development team. So the operation team can manage production issues much faster, while the development team can be more focused on more productive work, such as continuously applying patches to tackle non-retry-able errors.

Nevertheless, this improvement is only improving the retry process (restoration). We need to do more tasks, including defining more runbooks for handling the workflow obstacle and reducing the number of requests that need to be retried in the first place (prevention). For Example:

Plan the server and platform capacity.
Improve monitoring observability for faster troubleshooting.
Perform a logic update for a missing error handling.
Optimize inefficient logic that consumes high processing time and high resources, etc.

Should This Retry Be Automated?

While our new retry process only takes a few seconds to triggered, it is still manual. The retry also depends on how fast our team reacts to alert events, and the standby team may have missed some alerts. So, this retry process will be less effective during a major incident. Thus, making this retry system automated is part of our plan.

Look at the illustration above: the car that falls from the road is a failed request. The idea is to place a trampoline underneath it, making the car automatically bounce back to the last queue whenever it falls. This solution is most feasible for us, considering we already have a manually triggered retry process. We only need to figure out how the system can trigger it automatically.

To put it into practice, we need to build a new workflow to roll back queue status from in progress to ready. This flow will be triggered whenever a request fails at any workflow step. In addition, it needs to have a counter that verifies if the request has been retried for more than a predefined threshold. When it happens, set a status to on hold and notify the operation team to take a look manually because it is most likely due to non-retry-able errors.

Fortunately, the tools we are using have a feature called “compensated flow.” The compensated flow will be triggered whenever the workflow fails. However, as we explore this feature, we need to make a few adjustments to initial ideas as the feature has a few limitations. Let’s see how it goes.

Make It Sustainable

So far, I have been describing technical solutions to address technical challenges. However, to make it sustainable and adaptable to change, we must set up non-technical solutions such as business processes, policies, and guidance to address technical challenges.

As you may have noticed, the solution I described earlier is not solving the non-retry-able errors. The applicable solution is to define a new runbook for each type of error or apply a permanent fix. Because of that, the feedback loop between operation and development needs to be maintained. This method is our organization’s common spirit and practice, especially from my current managers, in constantly providing insight and feedback from the operation team to the development team.

In addition, one potential approach we can use is assigning some of the development team to focus on applying regular patches to production workflow (let me call it Team X). This approach is heavily inspired by some of my leader’s decisions when I was involved in a part-time assignment on the pilot implementation of RPA. At that time, we were facing similar challenges.

So, this Team X will closely coordinate with the operation team to analyze the most frequent production issue, be an on-call team that receives escalation from the operation, and ensure another development team follows a standard best practice. Furthermore, they need to lead the development team in maintaining an update to the development guideline since they gain the most operation exposure and have specific technical skills to develop automation workflow.

I imagine the team’s communication would be like this

Theoretically, this team composition will likely improve the team’s dynamics. Some illustration is described as follows:

Team X should have one shared goal with other development teams: to benefit the company by enabling more automation. However, Team X should not be responsible for delivering automation to users. It is the main task of other development members.
Team X’s interest would be delivering good quality and robust (less prone to error) automation workflow by continuously updating development guidelines and making sure the rest of the development team follows that. They will receive incentives for doing so and be disincentivized when they are not. Remember that this team will be an on-call team during production issues.
If the operation team wants fewer issues in production, they need to give feedback to Team X constantly. In addition, this dynamic would also encourage Team X to ask the operation team to do the practice proactively.

This approach is something that I personally planning to implement. Even though we are not using DevOps standard practices & tools like most software development did (I have yet to find the DevOps tools applicable to our automation platform), this approach shares some cultural aspects of DevOps. It promotes collaboration, shared responsibility, continuous improvement, and operational excellence among all team members.

Key Takeaway

Managing failed automation workflow in production can be challenging due to the numerous business logic and various workflow design on all automated processes. Following are some of the things we can do to make it more manageable:

Identify shared patterns on various automation workflows and then standardize them so they can have a more predictable behavior in production
Establish standard naming conventions, development guidelines and design patterns for automation workflow
Set up a required process, policy, or team so the development playbook (guideline) can continuously adapt to change.

It’s a Learning Process

Actually, this is only a fraction of the challenges that organizations may encounter when implementing automation at scale. In another post, I will share interesting story about why most process automation tools are low-code platforms.

What’s your thoughts about this story? Let me know if there is something I missed, or something that I should consider. I am yet an expert in this area, and I am just sharing my story on my personal blog. Formulating ideas through writing helps me think critically and remember what I have learned. Therefore, this story is part of my learning process and, hopefully, can give you value too.