How We Blew The Bloody Doors Off

The Ongoing Scalability Story at Tray.io

Tray Builders
Building Tray.io
4 min readOct 15, 2021

--

When your customer base and team both double in a year, it’s time to scale. Here’s how our Engineering team kept up with Tray.io’s growth. 🚀

Grew the team

An industry-changing product can only be built by a world-class team. The first step in keeping up with growing customer demand was accelerating the growth of our engineering team. We’re close to doubling our London-based R&D team since the start of 2020!

In July 2020, just 62 engineers made up the D half of our R&D team.

Just 15 months later, we’re a team of 111and are actively hiring (visit our careers page to learn more about the 26 current openings)!

Improved our product offering

Tray is a visual workflow builder powered by APIs and “connectors”.

These connectors are built, updated, and managed by our Connectivity Squad, and this team now proudly owns more than 600 connectors. This time last year, we had about 500. Some of the most challenging, high profile connectors we’ve built recently include the Netsuite Connector, the Python Script Helper and the JDBC.

There are a number of challenges to producing and maintaining the sheer number of connectors. So we asked the squad’s engineering manager Mike Massari about them.

Mike said:

Every integration is different and when building a connector we need to study the services and adapt to their unique traits and complexities. After a connector is built, we are always mindful that APIs are constantly changing and the 600+ that Tray connects to are no different. We constantly do our best to keep our connectors up to date and performant!

The team has also built a number of internal tools to help with the building and maintenance of Connectors. According to Mike, the connector-CLI may be the most important of these. It is a command-line ‘Swiss army knife’ tool that has improved the day-to-day connector building experience massively.

There are more internal tools in development geared towards automation and integration of processes and code checks.

In order to support 600 connectors across hundreds of services, we’ve naturally had to scale our team and our processes. We now have more than 30 engineers working on Connectivity. The team is still evolving their ways of working as they adapt to a growing team and connector offering.

How manager Mike Massari put it:

Since early 2020, we have dedicated delivery bandwidth depending on the specific connector type and task to complete. This gives us the flexibility of working on multiple different streams of work, while keeping consistent standards and high quality throughout all our teams.

Stabilized and scaled the backend

The Executions Squad, fearlessly led by engineering manager Danny Yates, is at the heart of the Tray system — it’s a large scale, distributed state machine that is responsible for triggering and running customers’ Workflows, maintaining workflow logic, handling errors, and calling on the 600+ Connectors.

With nearly 600 customers and as many connectors, Tray.io is processing a LOT of tasks, or steps, within tens of thousands of customer workflows. Some key points:

  • We process in the region of half a BILLION tasks every day
  • This is 8x the number of tasks in Jan 2020
  • At any given second, Tray handles tens of thousands of workflow steps.

The graph below shows the real growth trajectory of tasks since January 2020:

Scaling this quickly wasn’t easy, and the Executions Squad faced several challenges along the way. But all good developers like a challenge!

According to Danny, over the course of 2020, the team made key improvements to stability and scalability to address three main challenges:

  1. Calling on fragile internal systems. If those systems crashed, executions could stop, so a lot of work was done — with ongoing work to be done — around hardening our systems so that we can continue the majority of executions without referencing downstream systems, in case those downstream systems no longer exist.
  2. The “noisy neighbor effect.” In the past we have had occasional issues with some ‘rogue’ customer workflows consuming excess resources and impacting their own and other customers’ workflows. To remedy this we have built a concurrency limiter. Each org in Tray now has a powerful and efficient queue system to throttle runaway workflows and help organizations protect their production workflows.
  3. Technical debt. It’s inevitable in any high-growth company after many years of building a complex product. We’re always striving to make our codebase easier to work with, especially as our team scales. Our team has dedicated a lot of time this year towards improving and stabilizing our codebase , which has led to fewer bugs and the ability to deliver new features faster, and with more confidence.

It doesn’t end there. The Executions Squad is constantly growing and improving as they do. In Danny’s words:

We’re continuously evolving our processes to enable our engineers to deliver at pace. For example, we are evolving toward a continuous delivery model, which itself comes with challenges, like releasing the software whilst it’s doing tens of thousands of things a second, without impacting its ability to do those things.

--

--