Smart pipelining — reactive approach to computation scheduling

Published in

Alter Method

6 min readJan 21, 2019

Many tech-eons ago (several human years now), my colleagues and I started developing an in-house big data platform. The goal was to ignite company’s digital transformation and enable people to harvest the power of data.
It was a greenfield project. There was nothing except the task, company’s AWS account and our imagination. Now, from a safe distance, I can say that that was all we needed.

Being a team of experienced people, we knew that what we start with is not going to be what we will end up with.

One can be an amazing visionary but can foresee business requirements only to certain extent.

There are always surprises along the way.
And those are usually unpleasant ones.

So we took the time and let the product evolve. Shape itself, if you will.

Phase one — linear computations

At first, business requirements were simple and so was the initial set of computations. It was a dozen of daily ETL processes, roughly speaking.
Each of those computations could live on it’s own, taking in input and producing output without colliding with any other.

Since everything was that simple, we decided to choose AWS Data Pipeline service as the backbone of our platform. It was easy to set up and convenient for us to start.

AWS Data Pipeline service provides a number of “jobs”, such as shell activity (running anything that can be ran in shell), SQL activity, Redshift activity (for copying data to/from AWS Redshift warehouse) and some other. It was more than we needed.
It also provides functionality of scheduling running times of the pipelines so it was very easy to configure a job to run at 2AM every day, for example. We call these pipelines scheduled or hardcoded (this is our internal expression).

It felt like we made a good choice — we had a job orchestration and scheduling at the same time.

Idempotence was something we insisted on from the very beginning.

Having a set of batch computations, we had to be sure that results do not change with every rerun. Doubling, tripling and quadrupling the number of output rows when rerunning a job is not acceptable for any of our cases.
That being said, our Hive tables are partitioned and partitions are being overwritten on each computational run. When it comes to our Redshift tables, similar applies: we rely on RedshiftCopyActivity’s insertMode property and it’s OVERWRITE_EXISTING value.

At the end of the phase one, we had a set of linear, independent computations, all idempotent and all scheduled at corresponding moments of the day.

Phase two — dependencies entangled

Things started to get more and more complex very soon. Our once simple computations grew and started developing dependencies one on another. What once was a set of independent jobs, became a dependency chain, then a dependency tree, then a dependency thing…

Computation A had to be run before computations B and C, but C also needed computation D to finish. Computation B maybe needed D, E and F. And on it went.

Since all jobs were scheduled and we knew their running times, we were able to compute their end times, to schedule ones that should come after.
The level of wrongness of this approach is astonishing.

Imagine job B depending on job A — it uses output of A as its input.
Job A starts at 1am and takes, let’s say, 1 hour to finish.
In that case, job B could be scheduled for 2am and everything should be fine.
But what if for some reason job A has much more data to process than usual. Let’s say there was an aggressive marketing campaign the day before that drove more traffic or simply, over time, the business is doing good and data size is increasing?
Job B would start, job A would not produce complete results, and B would work on incomplete input.
One could tweak the scheduled times for each computation from time to time and fix the issue, but this is not how it should be done. This is not flexible and doesn’t scale well with the number of computations increasing.

At this moment in time, we knew we haven’t implemented the optimal solution, but still we moved on with defensive mechanism: each computation had to sleep for some time before actually starting, giving it chance to wait for its dependencies to finish if they are still running.

At the end of phase two, we had a complex set of entangled jobs, building a branching chain of dependencies. Computations were able to wait for their dependencies to finish, but not intelligently — if something took extra long time to finish, the whole structure would produce wrong results.

Phase three — race with time

After applying the fix and postponing the same problem until unknown moment in the future, we knew that we have to tackle this challenge once and for all. In addition to described in Phase 2, it was time to integrate a consumer of the results into the whole computational chain.
Company already had a BI solution in place and we wanted it to import data our batch jobs produced.

Because of the BI tool’s way of working, we had to do that before certain moment during the night — if we failed to do so, BI solution would import incomplete results. There would be no difference for people looking into BI graphs and widgets — they would look into incomplete data without knowing it is incorrect. Imagine the level of confusion seeing acquisition numbers going down in just a day, while you know there were no actions that would cause that behavior. Some seriously wrong decisions could be made based on incomplete and incorrect data. Nobody wants that to happen.

We realized we are fighting a lost battle with the Time.
Along with problems of dependencies and job scheduling, Time got us surrounded from both sides. Computations had to wait for one day to end, in order to have the data complete, and had to finish processing before certain point at night, to have the results ready for the BI tool.

This is when we realized we have to make a twist in our approach — we couldn’t beat Time with time. But we can beat it if we don’t care about it.

We went for something we now call “a reactive approach”. When we really want to brag, we call it smart scheduling (although it is a natural thing to develop, nothing really super-smart).
How does it work? Easy to explain in several steps…

Each computation “knows” who it is depending on
Each computation fires a notification that it has finished
Each computation waits for notifications from all of its dependencies before starting
Importing data into BI is done in same manner, only when everything prior has been finished

Wrap up

There are some details of this approach that I haven’t mentioned so far which are very important.
For example, one might ask why we didn’t go for some of the existing scheduler and orchestrator tools?
They weren’t mature when we started the whole story.
And we wanted to own the tool.
And we didn’t want to have any infrastructure for it.

That is it, you read it right. Our scheduler implementation doesn’t ask for any infrastructure.
By choosing AWS services cleverly, we managed to keep its expenses down to 0$ per month.

More on that in some future post soon. I hope I got you intrigued.

Smart pipelining — reactive approach to computation scheduling

Phase one — linear computations

Phase two — dependencies entangled

Phase three — race with time

Wrap up

Written by Dusan Zamurovic