Apache Oozie Monitoring

Rania ZYANE
11 min readJan 24, 2022

For our Datalake project, we have been using Apache Oozie for years. It has been very useful for creating, managing and orchestrating a bunch of processing data pipelines : a complex ingestion and processing workflow of jobs scheduled hourly, daily, weekly and monthly, having multiple actions and stages.

Briefly. Ingestion workflows, which are around 230 in total, run daily at specific frequencies (cron jobs) in order to qualify, absorb data received from various sources and transfer them to our datastore (Apache Hive). As to processing workflows. They are scheduled at different times around the day in order to optimize cluster resources, especially for those that take a couple hours to run.

The objective of this article is to share our team’s experience with Apache Oozie Monitoring. It is not intended to explain how Oozie works, nor to present what a Datalake is.

We have learned to deal with the annoying aspects of Apache Oozie particularly with the complexity along with the number of requests we receive. The tool has shown a great maturity, consistency, scalability to run multiple workflows concurrently and the ability to model dependencies and complex pipelines.

Here are some easy-Oozie tips and tricks :

  • Dependency management : Dependency between workflows can be easily handled with sub-worklfows, joins and forks. Datasets can also be used as witness and workflows can only start when they exist
  • Notifications : Always add notifying actions (email-action) to workflows in order to be alerted when some workflow has failed
  • Always set the coordinators time parameters : start and end time

Yet, we’ve been facing a lot of challenges using Apache Oozie since ingestion is the silent budget killer for every Datalake project. With the growth of Datalake, the number of Oozie workflows continues to increase. The follow-up / tracking of these workflows was a daily pain point. Unfortunately, Oozie UI is very poor in terms of monitoring and alerting :

  • No alerts are issued for TIMEOUT execution of a coordinator.
  • Debugging : Oozie scheduling can be very delicate when Oozie’s scheduling system is used with no understanding of its mechanism.

The challenge behind ensuring a healthy data delivery pipeline largely depends on the health of workflows. Monitoring these workflows correctly is crucial since they are the first elements in our production chain.

Monitoring jobs with Oozie SLA

How bad is waking up on a failing data pipeline ? Try missing a critical one that needed immediate attention due to direct impact on data reliability.

Tracking workflows daily can minimize the risk of bad data ingestion and consequently preserve data consistency. The cost of bad data used in business can be harmful in case of unawareness. Overall, data reliability helps teams trust their data and pick out issues early on.

Understand not only the impact but also the cause of bad data is absolutely necessary. Data Observability is the go-to process for a better data reliability. It goes beyond monitoring by offering more context to metrics and providing a deeper view of data operations, and indicating whether a fix is needed.

Along with monitoring, Data Observability can be seen as a bunch of activities that help engineers understand the health and state of data :

  • Alerting — both for expected events and anomalies
  • Tracking — ability to set and track specific events
  • Comparisons — monitoring over time, with alerts for anomalies
  • Analysis — automated issue detection that adapts to your pipeline and data health
  • Logging
  • SLA Tracking

This article will be restricted to Monitoring Oozie workflows time and size wise.

The first monitoring solution we implemented was the Oozie’s built in SLA System. A service level agreement (SLA) is an agreement/contract between a service provider and a customer that identifies both the services required and the expected quality of service in measurable terms. In Oozie, a SLA related to time may make more sense than traditional SLA. They allow you to control the requirements of a running Oozie job. Consequently, to raise an alert when a job :

  • Didn’t end successfully
  • Took longer to run than expected duration
  • Did not start at the expected nominal time
  • Did not end at the expected end time
  • Took less than or exactly the expected duration
  • Started at the expected start time
  • Ended before or at the expected end time

The main idea behind SLA is to check whether a job has MET or MISSed a specific time term. These states can easily be managed by Oozie and notifications can be sent by email in order to alert.

Oozie SLA : Events and Statuts

There are several ways to visualise/get SLAs :

  • SLA Tab in Oozie UI introduced in oozie-4.x
  • RestfulAPI to query SLA summary
  • JMS messages sent to a configured JMS provider for instantaneous tracking
  • As an Instrumentation.Counter entry that is accessible via RESTful API and reflects the number of all SLA tracked external entities. Name of this counter is sla-calculator.sla-map.
  • oozie command line : oozie sla -filter jobid=<jobID> but is deprecated and won’t give expected results.
  • Oozie SLA in Hue

Oozie offers to define, store and track SLA for it’s entities : Oozie Coordinator, Oozie workflow and Oozie actions. The specification comes as a part of the application and is expressed using XML Schema as a fragment of an XML document.

To enable tracking a job for SLA, a tag should be added to the application xml : ”uri:oozie:sla:0.2” along with “uri:oozie:workflow:0.5”. The application SLA is defined in Schemas Appendix and contains the tags below:

  • nominal-time : this is the time relative to which the jobs’ SLAs will be calculated. Generally since Oozie workflows are aligned with synchronous data dependencies, this nominal time can be parameterized to be passed the value of your coordinator nominal time${coord.nominalTime()}.
  • should-start : relative to nominal-time, this is the duration of time ( in MINUTES, HOURS, DAYS) within which a job should start running to meet SLA.
  • should-end : relative to nominal-time, this is the duration of time ( in MINUTES, HOURS, DAYS) within which a job should finish running to meet SLA.
  • max-duration : this is the maximum amount of time ( in MINUTES, HOURS, DAYS) a job should run and it is optional.
  • alert-contact : list of email addresses to which the alerts should be sent.
  • alert-events : a list of events to which an Email should be sent : start_miss, end_miss and duration_miss. *_met. It is optional and only applicable when alert-contact is given.

Adding SLA requirements to an entity’s definition itself, leads to a simpler reading because both SLA and execution are written on the same document.

SLA_event navigation

SLA_EVENT table contains records about invocations of action in Oozie workflows or coordinators that include SLA. In terms of description, it contains columns describing the SLA including the SLA timestamp and the tags discussed above except for : should-start and should-end. These two elements are not directly mapped in the table. Instead they are used in other mapped columns :

  • expected_start : nominal-time + should-start
  • expected_end : nominal-time + should-end

The sla_id field is a foreign key to COORD_JOBS and COORD_ACTIONS. These tables are used to store temporal and state information about Coordinators and their actions. Likely, COORD_ACTIONS table has foreign keys to each of WF_JOBS and WF_ACTIONS, which contain all temporal and state characteristics of Workflow actions.

Consequently, SLA_EVENTS.sla_id allows linking an SLA event to it’s coordinator and workflow jobs.

To make this work under oozie5.x, it needs some light configuration :

1. Changes in oozie-site.xml :

2. SLA definition in Workflow

3. SLA definition in Coordinator

The Oozie workflow is then run as usual using the command line :

oozie job -oozie $oozie_server -config $path_to_workflow/job.properties \
-D start=$(date -u +”%FT%H:%MZ”) \
-D end=$(date -u -d “+ 100 month” +”%FT%H:%MZ”) \
-D frequency=”$freq” \
-D nameNode=$nameNode \
-D jobTracker=$jobTracker \
-D jdbcURL=$jdbc \
-D jdbcPrincipal=$principal \
-D metastore=$metauri \
-D timezone=WET \
-D env=$env \
-run

For our project, we decided to use Hue as an Oozie UI alternative in view of the bunch of visualisations and functions it offers on its latest version on Cloudera Data Platform. As mentioned before, Oozie UI didn’t gain ground unfortunately. Actually, all the job management happens on the command line and the default UI is readonly.

On the other hand, Hue provides an elaborated list of features that made it easy to give up Oozie UI :

  • JobBrowser : Actions, Workflows, Coordinators, Bundles and SLAs dashboards
  • Workflow Editor : workflow creation and scheduling
  • Kill, suspend, and re-run jobs from the UI
  • One click access to Oozie logs or MapReduce launcher logs
  • HDFS Browser : Hdfs files viewer
  • Built with standard and current Web technologies Filtering, sorting, progress bars, XML highlighting for both browsers.

Speaking of Oozie SLA, to enable Hue to access these, Service Oozie on Hue configuration should be selected :

Hue : enable Oozie Service

Hue then provides two types of SLA views :

  • Main SLA Tab for searching and graphing for all defined SLAs on Oozie jobs
  • Individual SLA tab for workflow/coordinator

Real World Example :

After configuring and enabling SLA on Oozie and Oozie on Hue, comes the implementation part.

1. Definition of the SLA on a coordinator action :

As seen on the coord.xml, the sla:0.2 tag is added.

Nominal time value is the coordinator nominal time that the function : coord:nominalTime() returns which resolves to the coordinator creation/materialization datetime. For scheduled coordinators, coord:nominalTime() is always the coordinator start datetime plus a multiple of the coordinator frequency.

As to the should-end tag, it has been defined to 1 minute. This means that the expected time for the workflow job to end is : the coordinator’s start datetime + 1 minute. Otherwise, the SLA will be MISSed.

Also, SLA events will be triggered only for : start_miss, end_miss, duration_miss.

2. Overview on the Workflow :

The workflow job used in this example is a simple shell action that executes a script that sleeps 2 minutes before writing to an HDFS file.

You have surely noticed it !

expected_end_time = coordinator_start_time + 1 minute

actual_end_time ⋍ coordinator_start_time + 120 secondes (sleep) + ε

This Oozie job will absolutely miss the given SLA but will be executed successfully.

In order to verify this hypothesis, we will change the should-end and max-duration to 10 minutes (or to change the script.sh).

The new coordinator defines SLA for all types of events : start_miss, end_miss, duration_miss, start_met, end_met, duration_met.

After launching the Oozie job once every 5 minutes in a time window of 20 minutes, we had these results :

1. SLA calculation is not yet Started

SLA tab view on Hue — NOT STARTED

2. SLA calculation is in progress as the Workflow job started :

SLA tab view on Hue — IN_PROGRESS

The first execution @1 has MET the SLA since the actual end time is less than the expected and the actual duration (10000 ms) is less than the expected (60000 ms).

3. SLA calculation finished :

SLA tab view on Hue — DONE

Now that the SLA calculation has ended for all the coordinator executions, it is clear that the second and third workflows have missed the SLA because : actual end time was past the expected end time and the actual duration (127000 ms and 130000 ms ) was larger than the expected (60000 ms).

Note that the job’s status is always SUCCEEDED even if the SLA are not MET. This said, SLA definition does not impact job execution. It allows to track it instead.

Another way to access these results is via RestfullAPI of Oozie in JSON format using several types of filters :

  • id of the workflow job, workflow action or coordinator action
  • parent_id — Parent id of the workflow job, workflow action or coordinator action
  • nominal_after and nominal_before — Start and End range for nominal time of the workflow or coordinator.
  • bundle — Bundle Job ID or Bundle App Name. Fetches SLA information for actions of all coordinators in that bundle.
  • event_status — event status such as START_MET/START_MISS/DURATION_MET/DURATION_MISS/END_MET/END_MISS
  • sla_status — sla status such as NOT_STARTED/IN_PROCESS/MET/MISS
  • job_status — job status such as CREATED/STARTED/SUCCEEDED/KILLED/FAILED
  • app_type — application type such as COORDINATOR_ACTION/COORDINATOR_JOB/WORKFLOW_JOB/WORKFLOW_ACTION
  • user_name — the username of the user who submitted the job
  • created_after and created_before — Start and End range for created time of the workflow or coordinator.
  • expectedstart_after and expectedstart_before — Start and End range for expected start time of the workflow or coordinator.
  • expectedend_after and expectedend_before — Start and End range for expected end time of the workflow or coordinator.
  • actualstart_after and actualstart_before — Start and End range for actual start time of the workflow or coordinator.
  • actualend_after and actualend_before — Start and End range for actual end time of the workflow or coordinator.
  • actual_duration_min and actual_duration_max — Min and Max range for actual duration (in milliseconds)
  • expected_duration_min and expected_duration_max — Min and Max range for expected duration (in milliseconds)

Here’s an example using parent_id filter :

Request :

curl -i -X GET — negotiate -u: “ooz ie_server:11000/oozie/v2/sla?filter=parent_id=0000178–210914145432787-oozie-oozi-C”

Result :

Monitoring jobs with Oozie Decision Control Nodes

Another feature Oozie provides along with SLA is HDFS files control using Decision Control Nodes.

Since we have a large load of ingestion workflows, one of the quality tests applied on the received data flows is : size monitoring.

Oozie SLA only alerts on time related aberrations. Receiving an empty data file is mostly unusual and must be watched but it’s not caught by Oozie SLA.

A decision node allows a workflow to decide the next execution path. It is very similar to a switch case statement which leads to a list of predicates-transition pairs and a default transition.

Here’s an example on how a decision node can be used to control data files sizes :

Conclusion

The story of a Datalake project team is told in this article. The growth of similar projects request a fine control/tracking/monitoring spirit. Within a cloisonne and Kerberized environment, there was little choice but to rely on the tools available. With time, Apache Oozie has shown that it’s definitely worth the pain. Along with its main functionalities of creating and scheduling workflows, it offers features to ease the control and monitoring of these workflows. Oozie SLA and Oozie Decision Control Nodes are two of those features. The next step would be to enhance this monitoring system in order to reply to every activity in Data Observability.

References :

--

--