How We Built the New Builder

Tray.io Product Manager Sheree Lim and Tech Lead Allen Evans give us a run down on one of our recently released major features: the new workflow builder.

Sheree Lim
Building Tray.io
8 min readApr 1, 2022

--

A recent screenshot of our workflow builder and the beloved rainbow connector snake. Notice the “Back to old builder” button in the bottom right hand corner? Well, this has since been deleted since our customers liked the new builder so much!

The workflow builder is the foundation of the Tray.io product; it is how our users build end-to-end integrations by connecting together services and using the Tray visual programming language to abstract typically complex engineering concepts into their integrations.

The workflow canvas sits at the core of the builder, and its code base has been evolving over 5+ years.

It’s mainly responsible for two key builder features.

  • layout — determine where steps will be rendered based on the flow of the workflow (e.g.: sequences, branches, loops)
  • interaction — allow individual steps to be selected, and respond to hovers, clicks, and drags.

Over time, features have been added to the canvas without close attention to its performance, and coupling has increased, making it harder and harder to reason about the canvas in isolation, and to build features in a stable way.

On the performance side, this coupling has also increased memory demands and computational requirements of our users’ environments, all the while reducing the responsiveness of the UI and our ability to provide smooth and satisfying experiences.

Whilst investigating how we might go about refactoring the canvas, it became quickly apparent that this was going to require a full rebuild of the workflow builder.

There comes a point when you need to make a decision on whether you want to keep adding, or if you want to take a long-term investment and rebuild what’s already there.

So how did we go about rebuilding it?

In short, we started by:

  • instrumenting the current canvas and measuring its performance
  • debating alternative technical approaches to refactoring / solving pain points
  • can we replace the current layout engine?
  • can we reduce the amount of computation / render cycles required to view and interact with steps?
  • reach two or more distinct approaches, and pick the two most likely to succeed
  • build a PoC of both approaches, and measure their performance against the current benchmarks
  • pick the winner based on performance, ease of use, and extensibility criteria
  • implement a production grade solution following the learnings of the PoC

Following that, we can build features on top of the new canvas!

We were aiming for meaningful improvements:

  • allow workflows with hundreds of connectors to be rendered smoothly
  • decrease the rendering speed by at least 66%
  • allow for super-wide (branching), super-long (sequencing), and super-deep (inner loops, inner branches, etc.) workflows
  • get as close as possible to 60fps when repainting the canvas

A new layout engine for a new visual builder

At the core of Tray.io’s visual workflow builder exists a component called the “layout engine” The purpose of this component is to take a workflow and return a set of xy coordinates for each step within the workflow.

Figure 1. Data processing flow

The problem with the v1 layout engine is that the time it takes to calculate these coordinates exponentially increases relative to the number of steps within a workflow. A simple workflow consisting of 10 steps takes an average of 24ms to perform the layout calculation whilst a 100 step workflow takes an eye-watering 1,100ms to complete the same calculation.

Whilst the layout engine calculation is being performed, the frontend UI will freeze because it blocks the main processing thread in the browser. The user experience feels clunky and certainly not the experience we want to deliver to our users.

Why is it so slow?

After doing some digging around in the code, we identified that the application was doing multiple data transformations and manipulations to create an input structure suitable to be passed into a 3rd party library that could generate these coordinates for us. Worst still is that the 3rd party library didn’t quite position the steps in a way that aligned with our UX designs, so additional transformations had to be applied to the output of this library before it could be used for plotting steps on the screen. This is the root cause of the exponential slowdown.

How can we make it faster?

One option was to move the layout engine calculation off the main thread of the browser and onto a background worker thread. This would stop the UI from freezing whilst the calculation was being performed but it didn’t address the issue of latency. For example, a user would still have to wait over one second to see a new step appear after being added into the workflow. This solution would have been a sticking plaster at best, so instead we took the plunge and made the decision to look at the problem again, break it down and implement our own custom layout engine that doesn’t slowdown exponentially.

Understanding the data

The first challenge was to understand the data. Our workflow comes from our servers in a format that is optimised for storage. Many of the transformations described above were due to this format. It was clear that the data needed to be transformed, but it should only be transformed once. To achieve this, the workflow graph data structure was created to provide an in-memory efficient data structure optimised for rendering, see figure 2.

Figure 2. A bi-directional graph for the workflow presented in figure 2.

With the new data structure in place, we now had a way to optimally query the graph to be able to answer questions such as “tell me which steps are children of this parent step”.

The next task was to use this graph to generate the corresponding xy coordinates for steps within the workflow.

Calculating the coordinates

Creating a performant algorithm from scratch to generate these coordinates proved quite tricky. Previously this had been handled by our 3rd party library but that was no longer an option. We created a couple of variations of the algorithm and compared how they performed both in terms of time and UX alignment.

The algorithm that won out in the end used a concept of recursive nested rendering. The idea behind this algorithm is to break down the workflow into the simplest structures (see figure 3), calculate the position of steps within those structures and then recursively apply the same approach to more complex parent structures.

Figure 3. Nested branch workflow identifying encapsulated layout structures

In the example above, we calculate the positions for steps in structure [3] first, then [2] and finally [1]. The algorithm, at the most basic level can be described as follows:-

Figure 4. Simplified overview of the algorithm implemented to calculate the xy coordinates for a workflow

Layout engine v2

With the new layout engine built using the workflow graph data structure and recursive layout engine algorithm, the only thing left to do is profile how well it performs compared to the v1 layout engine.

The chart below (figure 4) shows the performance gains achieved with the v2 layout engine compared to v1. By taking the time to fully understand the problem, cutting away duplicated nested data transformations and implementing a custom algorithm to align steps based on the requirements set out by our UX team, we developed a very performant layout engine.

Figure 4. Like-for-like comparison between the v1 and v2 layout engines*. Lower is better.

What is really encouraging for the layout v2 engine is the time taken to calculate the coordinates for a workflow grows linearly relative to the number of steps whereas v1 suffers from exponential slowdown.

With customers using Tray.io to create ever increasingly imaginative and complex workflows, the layout engine v2 is ready to meet those needs.

Challenges

The sheer size of this initiative was quite intimidating at first. By the end, we’d worked off of 15 different epics, and over 250 Jira tickets.

The first challenge to overcome was the existing architecture debt. The old architecture didn’t scale because it was built by hacking stuff together (hello, startup) without a plan or a solid API design for adding additional features. There were also a lot of unknown unknowns as well as undocumented features and behaviour. This meant we had to take a giant step back and understand what we wanted for the new builder, and how we could build it to scale for the future.

We encountered many problems that aren’t commonly seen in other products e.g. layout engineering, so engineers were sometimes doing things for the first time, with few resources online to find inspiration from.

One of the most challenging parts of the rebuilding process was integrating all the different components of the builder back together. Each component needed refactoring, integrating into builder v2 and testing.

The final challenge. Testing. There are an infinite number of workflow structure iterations, and we had 1 QA… (plus one stolen QA at some point). There was also a lack of existing meaningful tests to confirm everything still works the way it should. We ended up getting pretty creative as a team in order to cover as many use cases as possible, creating our own complex workflows, reaching out to customers who had exceptionally large workflows, and getting as many people internally involved too.

The entire rebuilding process took just nine months, which is quite some feat, for a team of only three full-time engineers (and a special shoutout to engineers we “borrowed”).

As a result of this hard work, the performance of workflows with 100 steps improved by 11,000%! Pretty good going if you ask us!

Part of the enjoyment of this process was the variety and complexity of challenges we faced, which are quite different from other products you might work on as an engineer.

Since then, we’ve made UX improvements almost every week, which was simply not possible in the old builder. In our weekly “Customer love” session where we watch customer calls, we’ve been able to see all these improvements in action, as well as see where we can continue to improve.

--

--