A Football is Better than a Bag of Marbles
Let’s chat for a minute about our technology stack at Handy. Like any online platform, we use a lot of different techniques, tools, and services to keep everything running smoothly for the rest of the oragnization. I’ve written about our stack before here and here, but the thing is, that hid a lot of complexity. Like, a lot.
For example, one part of our stack — inserting logs into Hive — was implemented with a complex Logstash configuration copying each file into HDFS, a second Airflow task to copy those files into directories organized by logging channel, then about 200 discrete Airflow tasks to kick off another 200 discrete YARN applications (same application in the sense of the same code running, different application in that it’s a different YARN “process”), with callbacks on those tasks.
We now have the same pipeline functionality implemented, with a much simpler Fluentd configuration (bonus: Fluentd is way more reliable than Logstash in lots of ways), one Spark application which gets the data from S3, writes it to HDFS organized by channel, then inserts it into Hive. This is followed by some post-ingestion steps.
The big source of complexity here is the need to use a YARN application for ingestion, because we’re still working with JSON, which is schema-less. Nevertheless, with all of that, the whole process now fits in a diagram which doesn’t erase any steps.
Apart from the satisfaction of a much cleaner process, this has made our process much more reliable, and much cheaper:
- Airflow is not a high-throughput system. Though it’s cool, Airflow is (in the words of Maxime Beauchemin himself) not designed to run many small tasks. With this new design, we cut out a lot of the tasks that Airflow had to handle. This means we can use a much smaller Celery cluster and suffer less if we fall behind for any reason (e.g. loss of YARN capacity).
- With only one YARN application per hour, we don’t run the risk of Celery boxes running out of memory, and we can make better use of resources in our YARN cluster. The short version is that a threaded Spark application can take better advantage of YARN + Spark capacity scheduling, and achieve better throughput (more tasks can run at once and long tasks don’t block shorter tasks).
- We took the opportunity to right-size our boxes and configure YARN capacity scheduling. This means our Hive work can run close to the data, and Spark can run on spot workers. This is much cheaper and uses fewer machines. Overall, we cut our AWS bill by 75%.
Our ingestion time with threading tends to max out at about 20 minutes under this new setup. Previously, it would top out at 55 minutes. Hive queries run substantially faster. Or at least, we think they do. We didn’t measure before-and-after, but anecdotally they feel much snappier.
Alas, nothing is free. By rolling up a lot of our workflow into a single Spark application, we lose the fine-grained insight into what succeeded and what didn’t. Without that insight we have to decide whether or not to retry the ingestion task as a whole. We are currently building out features in our ingestion application to integrate with Airflow and use it to track which channels were successfully ingested, and which were not.
Another important component of making this work — which is also only possible because we consolidated our tasks — is that our tasks can now be adaptive. In this case, when an ingestion task is run outside of the normal window for its Airflow DAG run, the task can skip itself and Airflow can catch up to run the task that should be running in that slot. To avoid data or work loss, that task then has to look back (using Airflow’s record of what last succeeded) to see which time slots it should be running.
This again is a win, because Airflow (by way of logic in our custom operator) ensures that there’s never an ingestion gap of our data, and we don’t have to rely on retrying small slivers of tasks. Instead, the work gets rolled up in a bigger chunk that can be executed more efficiently by Spark + YARN, as described above.
There’s a clear benefit to reducing the complexity of any system; a single football spirals smoothly through the air, while a bag of marbles, well, doesn’t. It’s easier to understand, runs more efficiently, and makes the finance team smile by cutting operating costs. Of course, there are always going to be compromises, but those are decisions that’ll have to be made with your team. This arrangement works for us, and everyone seems pretty happy about it!
Are you interested in making our stack run more efficiently and doing other cool things at Handy? Check out our careers page to see if anything catches your eye.