BT

Introduction to Oozie

Posted by Boris Lublinsky, Michael Segel on Jul 12, 2011 |

Tasks performed in Hadoop sometimes require multiple Map/Reduce jobs to be chained together to complete its goal. [1] Within the Hadoop ecosystem, there is a relatively new component Oozie [2], which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task. In this article we will introduce Oozie and some of the ways it can be used.

What is Oozie ?

Oozie is a Java Web-Application that runs in a Java servlet-container - Tomcat and uses a database to store:

  • Workflow definitions
  • Currently running workflow instances, including instance states and variables

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for the following types of actions: Hadoop map-reduce, Hadoop file system, Pig, Java and Oozie sub-workflow (SSH action is removed as of Oozie schema 0.2).

All computation/processing tasks triggered by an action node are remote to Oozie - they are executed by Hadoop Map/Reduce framework. This approach allows Oozie to leverage existing Hadoop machinery for load balancing, fail over, etc. The majority of these tasks are executed asynchronously (the exception is the file system action that is handled synchronously). This means that for most types of computation/processing tasks triggered by workflow action, the workflow job has to wait until the computation/processing task completes before transitioning to the following node in the workflow. Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling. When a computation/processing tasks is started by Oozie, Oozie provides a unique callback URL to the task, the task should invoke the given URL to notify its completion. For cases that the task failed to invoke the callback URL for any reason (i.e. a transient network failure) or when the type of task cannot invoke the callback URL upon completion, Oozie has a mechanism to poll computation/processing tasks for completion.

Oozie workflows can be parameterized (using variables like ${inputDir} within the workflow definition). When submitting a workflow job values for the parameters must be provided. If properly parameterized (i.e. using different output directories) several identical workflow jobs can concurrently.

Some of the workflows are invoked on demand, but the majority of times it is necessary to run them based on regular time intervals and/or data availability and/or external events. The Oozie Coordinator system allows the user to define workflow execution schedules based on these parameters. Oozie coordinator allows to model workflow execution triggers in the form of the predicates, which can reference to data, time and/or external events. The workflow job is started after the predicate is satisfied.

It is also often necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline. Oozie coordinator support creation of such data Application pipelines.

Installing Oozie

Oozie can be installed, on existing Hadoop system, either from a tarball , RPM or Debian Package. Our Hadoop installation is Cloudera’s CDH3, which already contains Oozie. As a result, we just used yum to pull it down and perform the installation on an edge node[1].There are two components in Oozie distribution - Oozie-client and Oozie-server. Depending on the size of your cluster, you may have both components on the same edge server or on separate machines. The Oozie server contains the components for launching and controlling jobs, while the client contains the components for a person to be able to launch Oozie jobs and communicate with the Oozie server.

For more information on the installation process please using Cloudera distribution, visit the Cloudera website[2]

Note: In addition to the installation process, its recommended to add the shell variable OOZIE_URL

(export OOZIE_URL=http://localhost:11000/oozie)

to your .login, .kshrc, or the shell startup file of your choice.

A Simple Example

To show Oozie usage let us turn to a simple example. We have two Map/Reduce jobs[3] - one that is doing initial ingestion of the data and the second one merging data of the given type. The actual ingestion needs to execute initial ingestion and then merge data for two of the types - Lidar and Multicam. To automate this process we have created a simple Oozie Workflow (Listing 1)

<!--
Copyright (c) 2011 NAVTEQ! Inc. All rights reserved.
NGMB IPS ingestor Oozie Script
-->
<workflow-app xmlns='uri:oozie:workflow:0.1' name='NGMB-IPS-ingestion'>
 <start to='ingestor'/>
 <action name='ingestor'>
  <java>
   <job-tracker>${jobTracker}</job-tracker>
   <name-node>${nameNode}</name-node>
   <configuration>
    <property>
     <name>mapred.job.queue.name</name>
     <value>default</value>
    </property>
   </configuration>
   <main-class>com.navteq.assetmgmt.MapReduce.ips.IPSLoader</main-class>
                    <java-opts>-Xmx2048m</java-opts>
   <arg>${driveID}</arg>
  </java>
 <ok to="merging"/>
 <error to="fail"/>
 </action>
  <fork name="merging">
   <path start="mergeLidar"/>
   <path start="mergeSignage"/>
  </fork>
 <action name='mergeLidar'>
  <java>
   <job-tracker>${jobTracker}</job-tracker>
   <name-node>${nameNode}</name-node>
   <configuration>
    <property>
     <name>mapred.job.queue.name</name>
     <value>default</value>
    </property>
   </configuration>
   <main-class>com.navteq.assetmgmt.hdfs.merge.MergerLoader</main-class>
                    <java-opts>-Xmx2048m</java-opts>
   <arg>-drive</arg>
   <arg>${driveID}</arg>
   <arg>-type</arg>
   <arg>Lidar</arg>
   <arg>-chunk</arg>
   <arg>${lidarChunk}</arg>
  </java>
  <ok to="completed"/>
  <error to="fail"/>
 </action>
 <action name='mergeSignage'>
  <java>
   <job-tracker>${jobTracker}</job-tracker>
   <name-node>${nameNode}</name-node>
   <configuration>
    <property>
     <name>mapred.job.queue.name</name>
     <value>default</value>
    </property>
   </configuration>
   <main-class>com.navteq.assetmgmt.hdfs.merge.MergerLoader</main-class>
                    <java-opts>-Xmx2048m</java-opts>
   <arg>-drive</arg>
   <arg>${driveID}</arg>
   <arg>-type</arg>
   <arg>MultiCam</arg>
   <arg>-chunk</arg>
   <arg>${signageChunk}</arg>
  </java>
  <ok to="completed"/>
  <error to="fail"/>
 </action>
 <join name="completed" to="end"/>
 <kill name="fail">
  <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
 </kill>
 <end name='end'/>
</workflow-app> 

Listing 1: Simple Oozie Workflow

This workflow defines 3 actions - ingestor, mergeLidar and mergeSignage. Each action is implemented as a Map/Reduce[4] job. The workflow starts with the start node, which transfer control to the Ingestor action. Once the ingestor step completes, a fork control node [4] is invoked, which will start execution of mergeLidar and mergeSignage in parallel[5]. Once both actions complete, the join control node [5] is invoked[6]. On successful completion of join node, the control is passed to the end node, which ends the process.

Once workflow is created it has to be deployed correctly. A typical Oozie deployment is a HDFS directory, containing workflow.xml (Listing 1), config-default.xml and a lib subdirectory, containing jar files of classes used by workflow actions.

(Click on the image to enlarge it.)

Figure 1: Oozie deployment

A config-default.xml file is optional and typically contains workflow parameters common to all workflow instances. A simple example of config-default.xml is presented at Listing 2.

<configuration>
    <property>
        <name>jobTracker</name>
        <value>sachicn003:2010</value>
    </property>
    <property>
        <name>nameNode</name>
        <value>hdfs://sachicn001:8020</value>
    </property>
    <property>
        <name>queueName</name>
        <value>default</value>
    </property>
</configuration>

Listing 2: Config-default.xml

Once a workflow is deployed, an Oozie provides a command line utility [5] can be used to submit, start and manipulate a workflow. This utility typically runs on the edge node of Hadoop cluster[7] and requires a jobs properties file (see sidebar – configuring workflow properties) - Listing 3.

oozie.wf.application.path=hdfs://sachicn001:8020/user/blublins/workflows/IPSIngestion
jobTracker=sachicn003:2010
nameNode=hdfs://sachicn001:8020

Listing 3: Job properties file

With job properties in place, a command presented at Listing 4 can be used to run an Oozie workflow.

oozie job –oozie http://sachidn002.hq.navteq.com:11000/oozie/ -D driveID=729-pp00002-2011-02-08-09-59-34 -D lidarChunk=4 -D signageChunk=20 -config job.properties –run

Listing 4: Run workflow command

Configuring Workflow Properties

There is some overlap between config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. Although documentation does not describe clearly when to use which, the overall recommendation is as follows:

  • Use config-default.xml for defining parameters that never change for a given workflow
  • Use jobs properties for the parameters that are common for a given deployment of a workflow
  • Use command line arguments for the parameters that are specific for a given workflow invocation.

The way Oozie processes these three sets of the parameters is as follows:

  • Use all of the parameters from command line invocation
  • If ano of the parameters can not be resolved there, try to resolve them using job config
  • Once everything else fails, try using config-default.xml

 

An Oozie console (Figure 2) can be used to watch the progress and results of workflow execution.

(Click on the image to enlarge it.)

Figure 2: Oozie console

Oozie console can also be used to get details of the job execution, for example job log[8] (Figure 3)

(Click on the image to enlarge it.)

Figure 3: Oozie console – Job log

Programmatic Workflow invocation

Although command line interface, described above works well for the manual invocation of Oozie, it is sometimes is advantageous to invoke Oozie programmatically. This can be useful when Oozie worklows are either part of a specific application or a larger enterprise process. Such programmatic invocation can be implemented either using Oozie Web Services APIs [6] or Oozie Java client APIs [7]. A very simple Oozie java client invoking described above process is presented at Listing 5

package com.navteq.assetmgmt.oozie;

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.OozieClientException;
import org.apache.oozie.client.WorkflowJob;
import org.apache.oozie.client.WorkflowJob.Status;

public class WorkflowClient {

         private static String OOZIE_URL = "http://sachidn002.hq.navteq.com:11000/oozie/";
         private static String JOB_PATH = "hdfs://sachicn001:8020/user/blublins/workflows/IPSIngestion";
         private static String JOB_Tracker = "sachicn003:2010";
         private static String NAMENode = "hdfs://sachicn001:8020";

         OozieClient wc = null;

         public WorkflowClient(String url){

                         wc = new OozieClient(url);
                 }

                 public String startJob(String wfDefinition, List<WorkflowParameter> wfParameters)
                                                                                throws OozieClientException{

                          // create a workflow job configuration and set the workflow application path
                  Properties conf = wc.createConfiguration();
                  conf.setProperty(OozieClient.APP_PATH, wfDefinition);

                  // setting workflow parameters
                  conf.setProperty("jobTracker", JOB_Tracker);
                  conf.setProperty("nameNode", NAMENode);
                  if((wfParameters != null) && (wfParameters.size() > 0)){
                          for(WorkflowParameter parameter : wfParameters)
                           conf.setProperty(parameter.getName(), parameter.getValue());
                  }
                  // submit and start the workflow job
                  return wc.run(conf);
                 }

                 public Status getJobStatus(String jobID) throws OozieClientException{

                          WorkflowJob job = wc.getJobInfo(jobID);
                          return job.getStatus();
                 }

                 public static void main(String[] args) throws OozieClientException, InterruptedException{

                          // Create client
                          WorkflowClient client = new WorkflowClient(OOZIE_URL);
                          // Create parameters
                          List<WorkflowParameter> wfParameters = new LinkedList<WorkflowParameter>();
                          WorkflowParameter drive = new WorkflowParameter("driveID","729-pp00004-2010-09-01-09-46");
                          WorkflowParameter lidar = new WorkflowParameter("lidarChunk","4");
                          WorkflowParameter signage = new WorkflowParameter("signageChunk","4");
                          wfParameters.add(drive);
                          wfParameters.add(lidar);
                          wfParameters.add(signage);
                          // Start Oozing
                          String jobId = client.startJob(JOB_PATH, wfParameters);
                          Status status = client.getJobStatus(jobId);
                          if(status == Status.RUNNING)
                                     System.out.println("Workflow job running");
                          else
                                     System.out.println("Problem starting Workflow job");
                  }
       }

Listing 5: Simple Oozie java client

Here, first workflow client is initialized with Oozie server URL. Once initialization is complete, this client can be used to submit and start jobs (startJob method), get the status of the running job (getStatus method) and other operations.

Building java action, passing parameters to workflow

In our examples so far we have shown how to pass parameters to Java node using <arg> tag. Considering that java node is the main approach to introduce custom calculations to Oozie, it is also important to be able to pass data from Java node to Oozie.

According to java node documentation [3], the “capture-output” element can be used to propagate values from java node back into Oozie context. These values can then be accessed via EL-functions by other steps in the workflow. The return values need to be written out as a java properties format file. The name for such properties files can be obtained via a System property specified by the constant “JavaMainMapper.OOZIE_JAVA_MAIN_CAPTURE_OUTPUT_FILE” .A simple example demonstrating how to do this is presented at Listing 6.

package com.navteq.oozie;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Calendar;
import java.util.GregorianCalendar;
import java.util.Properties;

public class GenerateLookupDirs {

        /**
        * @param args
        */
        public static final long dayMillis = 1000 * 60 * 60 * 24;
        private static final String OOZIE_ACTION_OUTPUT_PROPERTIES = "oozie.action.output.properties";

        public static void main(String[] args) throws Exception {
                Calendar curDate = new GregorianCalendar();
                int year, month, date;
                String propKey, propVal;

                String oozieProp = System.getProperty(OOZIE_ACTION_OUTPUT_PROPERTIES);
                if (oozieProp != null) {
                         File propFile = new File(oozieProp);
                         Properties props = new Properties();

                         for (int i = 0; I < 8; ++i)
                         {
                                    year = curDate.get(Calendar.YEAR);
                                    month = curDate.get(Calendar.MONTH) + 1;
                                    date = curDate.get(Calendar.DATE);
                                    propKey = "dir"+i;
                                    propVal = year + "-" +
                                            (month < 10 ? "0" + month : month) + "-" +
                                            (date < 10 ? "0" + date : date);
                                    props.setProperty(propKey, propVal);
                                    curDate.setTimeInMillis(curDate.getTimeInMillis() - dayMillis);
                         }
                         OutputStream os = new FileOutputStream(propFile);
                         props.store(os, "");
                         os.close();
                 } else
                         throw new RuntimeException(OOZIE_ACTION_OUTPUT_PROPERTIES
                                        + " System property not defined");
        }
}

Listing 6: Passing parameters to Oozie

In this example we assume that there is a directory in HDFS for every day. So this class first gets the current date and then calculates previous 7 closest date (including current) and passes directories names back to Oozie.

Conclusion

In this article we have introduced Oozie – a workflow engine for Hadoop and provided a simple example of its usage. In the next article we will look at a more complex example, allowing us to discuss more Oozie features.

Acknowledgements

Authors are thankful to our Navteq colleague Gregory Titievsky for providing some of the examples.

About the Authors

Boris Lublinsky is principal architect at NAVTEQ, where he is working on defining architecture vision for large data management and processing and SOA and implementing various NAVTEQ projects. He is also an SOA editor for InfoQ and a participant of SOA RA working group in OASIS. Boris is an author and frequent speaker, his most recent book "Applied SOA".

Michael Segel has spent the past 20+ years working with customers identifying and solving their business problems. Michael has worked in multiple roles, in multiple industries. He is an independent consultant who is always looking to solve any challenging problems. Michael has a Software Engineering degree from the Ohio State University.


[1] An edge node is a machine that has the Hadoop libraries installed , yet is not part of the actual cluster. It is for applications that have the ability to connect to the cluster and host ancillary services and end user applications which directly access the cluster.

[2] See this link for Oozie installation. 

[3] The detail of these jobs are irrelevant for this article, so we do not present it here

[4] Map/Reduce jobs can be implemented in Oozie in two different ways – as true Map/Reduce action[2], where you specifiy Mapper and Reducer classes and their configuration or as java action [3] where you specify the class that is starting Map/Reduce job using Hadoop APIs. Because we already had all java mains that were using Hadoop APIs and were implementing some additional functionality, we have chosen the second approach.

[5] Oozie guarantees that both actions will be submitted to Job tracker in parallel. The actual parallelism in execution is outside of Oozie’ control and depends on requirements of the job, capacity of the cluster and scheduler used by your Map/Reduce installation.

[6] The role of the join action is to synchronize multiple threads of parallel execution started by fork actions. If all execution threads started by fork complete successfully, join action is waiting for all of them to complete. If at least on thread of execution fails, the kill node will “kill” remaining running threads.

[7] This node does not have to be the one, where Oozie is installed.

[8] While Oozie job log contains details of the workflow execution, in order to see details of action execution it is necessary to switch to Hadoop Map/Reduce Administration page.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Still using XML? by Dean Wampler

Oozie has some nice capabilities, but it also has a few flaws that I've noticed from my own use of it.

1) XML is not a scripting language. Get over it.
2) They deprecated the use of calling out to native apps, like shell scripts and commands. It's annoying to write Java code for tasks that are best handled by shell commands.
3) The Web UI is way too "clicky". It takes 1/2 dozen or so clicks to get to task logs with the stack trace you need. Part of the blame here belongs to Hadoop's UI, however.

I think the community would benefit from management tools that leverage what the DevOps community is already using, like Chef, Puppet, etc., etc.

Re: Still using XML? by Michael Segel

While I don't disagree that Oozie is immature, I don't believe that you would want to use Chef or Puppet (your examples) for handling job flow control. Oozie fills the niche in being able to string multiple map/reduce jobs to complete a unit of work.

It is definitely primitive and there is room for improvement and we hope that our series of articles can help provide some basic ideas on how to make the most of Oozie. I agree with your #2 and #3 comments and I hope that in the future these issues will be better addressed.

I have spoken with some of the Cloudera folks and there are some planned improvements which should be available in their next release, CDH3u1.

Why another WF Engine? by Richard Clayton

Curious as to why you guys chose to create your own Workflow engine, vice adapting something like jBPM/Drools Flow or Activiti? Was there something missing from existing solutions? License issues?

Re: Why another WF Engine? by Boris Lublinsky

Well,
we did not - Yahoo/Apache/Cloudera did.
In reality, Oozie is very close to jBPM, but provides a very tight integration to Hadoop. Theoretically they could of just extend jBPM to provide Hadoop support, but the decided to extend the ideas and re-implement the engine

Re: Why another WF Engine? by Richard Clayton

Boris,

I'm sorry; I meant to imply why Yahoo, Apache and Cloudera did! I was curious since I'm working on an analytical problem right now that requires Workflow coordination between components in and out of Hadoop.

Thank you for your response.

Re: Why another WF Engine? by Boris Lublinsky

This is a slightly different question, for which there is quite simple answer. All of Oozie functionality is exposed as a set of REST services. Alternatively you can use java APIs wrapping REST services calls (see Listing 5) in the article. This makes invocation of Oozie from any workflow engine very straightforward. For a callback from Oozie back to the invoker you can use custom action (wait for the third installment). As a result, mu suggestion will be to use you favorite workflow engine inside the enterprise and integrate it to Oozie for orchestrating hadoop portion of your code

Re: Why another WF Engine? by Michael Segel

Richard,
While neither Boris or I can speak for Cloudera, etc...

Looking at jBPM, I don't believe that it is released under Apache License. License incompatibility is a fairly big issue.

Supporting compression was an issue until a snappy port was released under Apache license.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

7 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT