We use Apache Oozie to schedule jobs on one of our Hadoop clusters. Some of these jobs deliver business-critical data to customers and internal users, and need various types of monitoring or alerting to notify our team of problems. This post discusses one such system of jobs, and two types of monitoring: Oozie SLAs, and the use of Oozie Decision Control Nodes.
The Optimizely analytics backend processes around 3 billion events per day, using many types of open source software including extensive use of the Hadoop ecosystem. At one end of the data ingest pipelines, events are stored in the Avro file format on disk in large collections organized by date. We mine this raw event data for metrics and insights with a variety of tools, including batch computations with Hadoop MapReduce.
A typical Hadoop batch job counts monthly unique visitors across customer accounts. When designing the system to do this work, we considered the following characteristics of these jobs:
- Not real time: the jobs only need to run every few hours
- Recurrent: the jobs run on cron-style schedules, including hourly and daily
- Large amounts of data to process: de-duplication of various unique IDs requires looking back at many days worth of data
- Business critical: customers and internal users need regular, accurate results
These characteristics make such jobs good candidates for Oozie workflows of Hadoop MapReduce, Pig, file system and other actions — all scheduled with Oozie coordinators.
Monitoring job success and run time with Oozie SLAs
The first type of monitoring we set up for these jobs is Oozie’s very useful built in SLA system, which allows you to trigger alerts if a job:
- Does not start at the expected time.
- Does not end at the expected time.
- And/or does not execute for the expected duration.
- Any job with an SLA triggers an alert when it does not end successfully.
Since our jobs may take hours to run, we set fairly wide SLAs governing (start, end, and duration) and so far have primarily been alerted in cases of job failure.
Here’s an example coordinator with SLA:
The above is configured to email a team alias if the job goes out of its SLA, but for a more critical job that should never fail you can add multiple email alert contacts (including a pager email address).
Monitoring file size in HDFS with Oozie Decision Control Nodes
A few months ago, we encountered a bug where an error in the job scheduling caused some jobs to run at incorrect times, leading to incomplete results being written to unusually small files on HDFS. Oozie SLAs didn’t catch this, because the jobs were not failing or running outside of their scheduled times (since the schedule itself was wrong).
When designing regression tests for this issue, one solution we settled on was monitoring the size of the output files. To ensure that these files were reasonably large, we used Oozie’s Decision Control Node feature. This type of Oozie action allows you to use control-flow statements in your workflows: for example, you can split a workflow into two execution paths based on the results of a previous job, or the size of an existing file.
Here’s an complete example workflow which triggers alerts if the total size of files in a directory is too small:
Thanks for reading!
We hope this post is helpful in your own adventures with Oozie and Hadoop.
If you’re interested in this type of work, we’re hiring all types of engineers.
Thanks to Alex Milstead for extensive input on the first draft of this post.