Oozie : Scheduler for Hadoop

Neha Kumari
3 min readJul 8, 2018

--

It is needless to mention the demand of Big data processing in today’s world. Several companies use the big data analytics and learning as a ladder to reach closer to the customer needs via predictive analytics, providing personalised recommendations, etc. Hadoop being the open-source software platform for scalable and distributed computing of large volumes of data provides a very efficient way to achieve the same.

But most of the time, it is not possible to do all required processing or computation with in a single MapReduce, Hive, Pig or Cascading job. Multiple MapReduce jobs often need to be chained together, which produces and consumes intermediate data and coordinate their flow of execution. To solve this, Yahoo implemented a system which can run multistage jobs consisting of MapReduce, Pig, etc in 2008 and named it as Oozie(which in Burmese means elephant keeper). It was later open sourced in 2010.

Apache Oozie is a workflow scheduler for Hadoop jobs, which combines multiple jobs sequentially into one logical unit of work, and gives the provision to execute jobs which are scheduled to run periodically. An Oozie workflow is nothing but a collection of nodes. In each node you can specific what kind of operation you want to perform. Nodes can be of two types here :

  • Control flow nodes : To control the workflow execution path
  • Action nodes : To trigger the processing/computation

Control flow nodes can be :

  • Start node
  • End node
  • Kill node
  • Decision node
  • Fork and Join node

Action nodes can be used to trigger MapReduce, HDFS, Java, Shell, and other various kind of jobs. Action node and control flow node co-ordinate between each other and are arranged in a control dependency manner, which makes Oozie workflow a DAG of nodes.

There are two types of Oozie jobs:

  • Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered on a scheduled basis(time/data driven).
  • Oozie Workflow jobs are DAGs, which has a sequence of actions to execute.
Sample DAG of Oozie workflow

Advantages :

  1. Monitoring and SLAs
  2. Job can be triggered in both time driven and data driven fashion
  3. Parameterisation of workflow triggers
  4. Oozie is very much scalable as workflow management system which supports several thousand concurrent jobs.
  5. Oozie provides good visibility of the jobs/co-ordinator via the simplified UI.
  6. Very much suitable for Hadoop ecosystem since it is primarily designed to handle Hadoop components.
  7. Oozie has great community support and very rich documentation.

Apart from the UI which Oozie provides to get visibility into the jobs and the co-ordinator, it also provides Command Line Interface Utilities and exposes REST APIs which facilitates launch, control and monitoring of Oozie jobs and very helpful in our day to day life of dealing with Oozie jobs. Below are few commands and examples of the same :

CLI :

  • To Kill/Suspend/Resume a co-ordinator or a workflow job :
cli:~$ oozie job -[kill/suspend/resume] {job-id} -oozie http://gw-url/oozie/
  • To check logs/info of a job :
cli:~$ oozie job -[log/info] {job-id} -oozie http://gw-url/oozie/
  • To get the list of co-ordinators :
cli:~$ oozie jobs -oozie http://gw-url/oozie/ -jobtype coordinator
  • To get the list of running jobs :
cli:~$ oozie jobs -oozie http://gw-url/oozie/ status=RUNNING
  • To get SLA of jobs :
cli:~$ oozie sla -oozie http://gw-url/oozie/ -len 4

REST APIs :

  • To post a job : http://gw-url/oozie/oozie/v1/jobs
  • To get all jobs which are in running status: http://gw-url/oozie/v2/jobs?status=RUNNING

Other alternatives of Oozie, which are very much explored and famous in the industry these days are:

  1. Apache airflow
  2. Azkaban
  3. Luigi

--

--