Oozie workflows are defined with HPDL (Hadoop Process Definition Language) that is a definition language over XML i.e. it uses XML syntax to describe workflows of Hadoop jobs. Composing workflows in bare XML imposes some challenges and difficulties:
- A workflow is a graph, or more precisely a DAG (Directed Acyclic Graph), and it becomes harder to write and maintain a workflow in XML when the graph is getting more complex. A graph is a collection of nodes where each node points to one or more subsequent nodes. Changing the graph structure by adding, deleting or moving nodes around could easily become a tedious task and error prone. Moreover, it is sometimes hard to comprehend the sequences in the graph from the bare XML syntax because of the inherent poor readability.
- Developing workflows is not an everyday task. Also, the workflow itself and its constituent nodes have a number of attributes and parameters. Moreover, these attributes could change from an Oozie release to another. All these reasons make it hard to memorize the syntax and developers find themselves obliged to consulting the bulky documentation a lot or doing a lot of copy and paste from old workflows which is an error prone practice. The thing that makes it an unpleasant experience to write new workflows especially for newbies.
Oozie Eclipse plugin is an editor for editing Apache Oozie workflows graphically inside Eclipse. As stated above, writing XML workflows is a tedious task moreover it is not easy to memorize the XML syntax. Using the graphical editor, writing workflows becomes a matter of drag-and-drop, connecting lines to nodes, and filling property sheets. It helps reducing development time and allows more dedication to logic rather than syntax.
The graphical editor facilitates workflows readability and maintainability. The easiness and intuitiveness of the editor help boosting the developer’s learning curve a lot and shortens the time to production.
(Click on the image to enlarge it)
Components of the Editor’s View
(Click onthe image to enlarge it)
Workflow area: is the white space area for drawing workflows. A workflow is composed of workflow nodes that are connected through workflow connections.
Palette: holds nodes of all types and other editing tools. There are three editing tools: the selection tool, the marquee selection tool, and the connection creation tool. Workflow nodes are categorized into three groups: control nodes, action nodes and extended action nodes.
Properties sheet: is where the properties and attributes of the selected items from the workflow area are displayed and edited. Properties names are the same as their correspondences in XML. However, there are some special properties:
- Position: is a node property and it is editor specific that is used to set the node position in the workflow area.
- Type: is a read only node property that is used as a helper to know the node’s type from the property sheet.
- Schema Version: is a property for workflows and nodes of type extended actions. The property is used to specify the XML schema version used for the workflow or the extended action node.
- XML: is a special property for the Custom Action node. It is used to specify the XML content of the custom action node.
Outline page: provides a summary of the content of the workflow. There are two summary options: outline and overview. The outline displays a tree of constituent nodes with their connections while the overview displays a thumbnail of the workflow diagram. Both provides shortcuts for finding and accessing parts of the workflow.
Action bar: provides shortcuts for useful commands like zooming, aligning nodes, grid and snapping options, and undo and redo.
Example
In order to install the plugin and get started using it, the following posts on the plugin’s blog are helpful resources:
- Install Oozie Eclipse Plugin
- Getting started with Oozie Eclipse Plugin
- Editing workflows: the first steps
We rewrite the same fork-join example from Apache Oozie documentation site using the editor.
At first, let’s have a look at the generated XML file and we ask you to try to comprehend what this workflow is doing in general.
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.1" name="example-forkjoinwf">
<start to="firstjob"/>
<action name="firstjob">
<map-reduce>
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.example.IdMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.example.IdReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${input}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/foo/${wf:id()}/temp1</value>
</property>
</configuration>
</map-reduce>
<ok to="fork"/>
<error to="kill"/>
</action>
<fork name="fork">
<path start="secondjob"/>
<path start="thirdjob"/>
</fork>
<action name="secondjob">
<map-reduce>
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.example.IdMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.example.IdReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/usr/foo/${wf:id()}/temp1</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/foo/${wf:id()}/temp2</value>
</property>
</configuration>
</map-reduce>
<ok to="join"/>
<error to="kill"/>
</action>
<action name="thirdjob">
<map-reduce>
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.example.IdMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.example.IdReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/usr/foo/${wf:id()}/temp1</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/foo/${wf:id()}/temp3</value>
</property>
</configuration>
</map-reduce>
<ok to="join"/>
<error to="kill"/>
</action>
<join name="join" to="finaljob"/>
<action name="finaljob">
<map-reduce>
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.example.IdMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.example.IdReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/usr/foo/${wf:id()}/temp2,/usr/foo/${wf:id()}/temp3</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${output}</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
</action>
<kill name="kill">
<message>Map/Reduce failed, error message[${wf:errorMessage()}]</message>
</kill>
<end name="end"/>
</workflow-app>
<!--
<workflow>
<node name="start" x="11" y="144"/>
<node name="firstjob" x="126" y="144"/>
<node name="fork" x="229" y="144"/>
<node name="secondjob" x="338" y="64"/>
<node name="thirdjob" x="338" y="232"/>
<node name="join" x="461" y="144"/>
<node name="finaljob" x="554" y="144"/>
<node name="kill" x="653" y="232"/>
<node name="end" x="653" y="144"/>
</workflow>-->
It is actually hard to comprehend what this workflow is supposed to be doing especially for the novice users. Usually, one would need a companion workflow diagram similar to the one in this article. Enough of the XML! The previous workflow should look like in the following image:
It is now crystal-clear and easy to comprehend the workflow structure. It executes 4 Map-Reduce jobs in 3 steps, “firstjob”, then “seconjob” and “thirdjob” in parallel, and finally “finaljob”. Having a look at the properties sheet for each node, one would find that the output of the job(s) in one step are used as input for the next job(s).
Here is the summary of the steps to create this example (assuming the editor is installed successfully):
- From the menu bar, click on File -> New -> Other…
- The “New” wizard opens. From the “Apache Oozie” category, choose “Apache Oozie Workflow” wizard and press “Next >”.
- Choose the project and the folder in which to place the workflow file and enter the file name as “example-forkjoinwf.xml” then press “Finish”. A new workflow file is created and the editor opens having only the start and the end nodes in the workflow area.
- From the palette, select “MapReduce” node under “Action Nodes” category and insert 4 nodes in the workflow area.
- Add one instance from each of “Fork”, “Join” and “Kill” nodes which could be found in the palette under “Control Nodes” category.
- You can rename each node either by selecting the node and changing the “Name” property from the “Properties” sheet, or by selecting the node, pressing on key “F2” and editing the name in-place.
- In order to connect the nodes, use the “Connection” creation tool from the palette, click on an output terminal of a source node then on the input terminal of a target node. Hover over terminals with the mouse pointer in order to know the name of the terminal.
- Change the “Name” and “Schema Version” properties of the workflow by clicking on an empty place in the workflow area and editing the corresponding entry in the “Properties” sheet:
- For each node from the following nodes: “firstjob”, “secondjob”, “thirdjob”, “finaljob” and “kill”, populate their properties. For instance, the “Properties” sheet of “firstjob” should look like the following image:
- Save the file and the XML would be automatically generated within the same file.
Note: You can view the above workflow XML in Oozie editor by placing it into an XML file in one of the projects opened in Eclipse. Then right click on the file and choose “Open With -> Oozie Workflow Editor”.
Conclusion
The adoption of an easy workflow graphical editor, like this plugin, is prefered and strongly recommended as it helps to overcome the aforementioned difficulties besides having the following benefits:
- Since the workflow is a graph, it is much more intuitive to edit it graphically. Actually, developers draw the graph first in their mind, on a piece of paper, on a board or on an external design tool, then they transcribe it into XML. The graphical tool does this transcription automatically and silently.
- Composing Oozie workflows is becoming much simpler. It becomes a matter of drag-and-drop, a matter of connecting lines between nodes, and a matter of filling a sheet of properties.
- The development time is significantly reduced allowing much dedication for the logic rather than the syntax. Moreover, that would give a chance for writing more complex workflows and advanced use cases.
- It becomes easier to compose, read and maintain workflows. Everyone, including the less technical people, could become an expert in writing and reading workflows.
- As a plugin for Eclipse, it is well integrated with the developer environment. Workflow files are kept side by side with the orchestrated Hadoop jobs written in Java or Spark jobs written in, say, Scala that are in turn developed in the same IDE, Eclipse. That also facilitates the maintenance of all these resources in the same source control (e.g. the same GitHub repository).
About the Author
Ahmed Mahran is leading the big data team at Seeloz Inc and Badr. He also contributes to the open source community; he is a committer at Eclipse Foundation. He is also an innovator at Mashin where he has developed the Oozie Eclipse Plugin. Here is Ahmed's LinkedIn profile linkedin.com/in/ahmahran