Serving growing user needs with automated tooling
Running a Big Data platform can offer a great deal of value to users, but only if that value is simple and easy to access. At Criteo, we’ve found that making our massive data sets accessible to our growing and geographically-distributed sales user base has come with certain challenges; not least the ability to scale to ever-greater volumes of daily requests.
In order to provide the level of real-time services our users require, regardless of their location or time zone, automation has been essential. We’ve embraced this challenge by developing and rolling out a set of ‘self-service’ tools for automating user requests.
In creating these tools, we had to think from three different dimensions: purpose, delivery and anticipated benefits. This meant focusing on why our users need these tools, how we can make using them as easy and intuitive as possible, and what outcomes we could expect both for the user and for our time-scarce engineering team.
We began this journey from the perspective of purpose, looking at what specific functions our users required from our Hadoop Cluster. We identified three main functions, those being the ability to profile tasks, analyse performance in real-time and allow for synchronisation of API data between clusters.
It was with these goals in mind that we created our customised tools and services for users:
Garmadon is our live containers metrics and events collector service, which allows for real-time introspection within the Hadoop Cluster. It is based on a java agent running on different YARN component (containers, resourcemanager, nodemanagers) producing events. Those events are then sent to other systems (ElasticSearch, DrElephant, HDFS…) to be viewable by users.
Here is the architecture schema:
For real-time visualisation, a set of Grafana dashboards is provided and some jobs heuristics are added to the DrElephant service for performance monitoring and tuning. This means that users can access and monitor container usage (vcore/memory), JVM metrics, GC events, hdfs latency, number of hdfs calls and gain detailed metrics for Spark.
This process is useful as it provides feedback about container behaviour to our users, detects jobs that are affecting namenodes, enables better capacity planning and displays data lineage so we can identify which jobs create and which ones consume a dataset.
This tool is available on our github.
OOPS is our live profiling service, enabling users to profile any JVM programs running on YARN workloads. It relies on an async-profiler tool (here), performing the profiling and FlameGraph to draw SVG stack trace visualisation:
The service allows users to pinpoint which part of the code is overconsuming CPU or object allocation. Here is an example of a CPU profiling graph displayed by OOPS, which shows the impact of reading JSON on a mapper:
This is a service that enables users to copy and synchronise HDFS data between Hadoop clusters. It relies on Finatra servers for the web server part and DistCp to manage sync.
In our implementation we have added a function of logical copy, which represents the real data to copy sync and splits it in multiple hard copies (DistcP). This ensures the duration predictability of each DistCp copy and permits the user to only relaunch the copy on a subset of data in case of failure.
Mumak ensures data is available on multiple clusters for resiliency. It also enables users to duplicate the data for test purposes or for specific incident analysis.
Together, these tools offer a comprehensive suite of functionality for our users to carry out tasks on the cluster.
To make these tools as accessible as possible, we had to think about how they were to be delivered to users. We wanted the onboarding process to be simple so that, once deployed, we wouldn’t have additional work in responding to queries or handling user frustrations.
We first decided to create web-based user interfaces (UIs) for OOPS and Mumak, and utilise existing dashboards for Garmadon available via open source, so that users would have easy access to the services. The UIs we developed were designed to be intuitive and directive to increase the appeal of the tools. We then developed supporting documentation that walked the user through each of the tools from start to finish and offered troubleshooting advice. On top of that, we provided comprehensive user training that offered hands-on experience of using the tools, under the direction and supervision of our engineering team.
Using these three steps, we’ve made the tools user-friendly and reduced instances of drop-off or repeated requests for support.
Deploying these tools has transformed the way our users interact with our data, delivering on several anticipated benefits both for the user and for our engineering team.
From a user perspective, we’re now able to offer a faster and frictionless experience. They can access the cluster any time of the day or night without suffering limitations due to their time-zone or working hours. Additionally, they can resolve issues autonomously and instantly without having to submit a ticket and wait until their query has worked its way through the support queue.
For the engineering team, the tools have cultivated a better working experience. Previously, a significant portion of engineer time — approximately 40 minutes per day, per engineer — would be consumed resolving elementary issues. Now, through empowering those users, they’ve been able to reapportion that time to other projects. Furthermore, the automation of the platform means the team can scale Hadoop operations without an exponential rise in user requests.
Creating a collaborative environment
At Criteo, we are well aware of the crucial importance our Hadoop cluster plays in our sales operations. It’s a powerful platform that gives us a competitive advantage and helps cement our place as a market leader. For that potential to be truly realised, it’s important that we maximise the productivity of those using the platform. By building our suite of tools, each with its own specific aims, we’re satisfied that our users are able to execute their tasks independently and with the level of insight they need to maximise their results.
The tooling has evolved our engineers’ role from simple service-providers to a more consultative position, where they’re not having to resolve each and every issue encountered on the platform. This has also provided the motivation to start development of further tools that enable users to take greater control over their applications.
More broadly, these tools reflect our culture of continual service enhancement, demonstrating the benefits of taking a more strategic and long-term perspective on how the platform can give us a competitive advantage.
Want to join the community? Attend our Infra Tech Talks or check our open roles!
Thanks to Nicolas Fraison and Anthony Rabier for making this article happen.