Controlling Hadoop Jobs Using Oozie Tutorial | The Complete Guide

Last updated on 11th Aug 2022, Blog, Master Program, Tutorials

About author

Dina Nath (Hadoop Application Developer )

Dina Nath has 7+ years of experience in Big Data, Hadoop, Python, Spark, Scala, Impala, SQL, and Hive. He spends most of her time researching on technology, and startups.

(5.0) | 19652 Ratings 2219

Hortonworks Oozie

It cover some of the background and motivations that led to the creation of Oozie.

Explaining the challenges developers faced as they started building difficult applications running on Hadoop.

Its simple Oozie application. Also by covering the different Oozie releases, their main features, their timeline, compatibility considerations, and some interesting statistics from big Oozie deployments.

What is OOZIE?

Apache Oozie

It is a workflow scheduler for Hadoop.

It is a system that runs the workflow of dependent jobs.

Users are permitted to develop Directed Acyclic Graphs of workflows, which can be run in parallel and sequentially in Hadoop.

It consists of 2 parts:

Workflow engine:

Responsibility of a workflow engine is to save and run workflows composed of Hadoop jobs e.x., MapReduce, Pig, Hive.

Coordinator engine:

It runs workflow jobs based on pre defined schedules and availability of data.

Oozie is scalable and can maintain the timely execution of thousands of workflows in a Hadoop cluster.

Oozie is too much flexible, as well.

One can easily start, stop, suspend and rerun jobs.

Oozie makes it simple to rerun failed workflows.

One can simply understand how difficult it can be to catch up missed or failed jobs due to downtime or failure.

It is even possible to avoid a specific failed node.

How does OOZIE work?

Oozie runs as a service in the cluster and clients submit workflow definitions for quick or later processing.

Oozie workflow consists of action nodes & control-flow nodes.

An action node represents a workflow task, e.x., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data by using Sqoop or running a shell script of a program written in Java.

A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein variety branches may be followed depending on the result of earlier action node.

Start Node, End Node, and Error Node under this category of nodes.

Start Node, designates the initiate of the workflow job.

End Node signals end to the job.

Error Node designates the occurrence of an error and corresponding error message be printed.

At the end of execution of a workflow, HTTP call back is used by Oozie to update the client with the workflow status.

Entry-to or exit from an action node may also trigger the call back.

The Nodes in the Apache Oozie Control flow are listed:

  • Start control node
  • End control node
  • Kill control node
  • Decision control node
  • Fork and Join control node

Packaging and deploying an Oozie workflow application:

A workflow application include the workflow definition and all the associated resources such as MapReduce Jar files, Pig scripts etc. Applications need to follow a easy directory structure and are deployed to HDFS so that Oozie can access them.

It is require to keep workflow.xml (a workflow definition file) in the top level directory (parent directory with workflow name).

Lib directory includes Jar files containing MapReduce classes.

Workflow application conforming to this layout can be build with any build tool e.g., Ant or Maven.

Such a build need to be copied to HDFS by using a command, for example –

% hadoop fs -put hadoop-examples/target/<"name of workflow dir"> name of workflow.

Steps for Running an Oozie workflow job:

To run this, will use the Oozie command-line tool (a client program which communicates with the Oozie server).

1. Export OOZIE_URL environment variable the oozie command which Oozie server to use (here we’re using one running locally):

% export OOZIE_URL=”http://localhost:11000/oozie”

2. Run workflow job using-

% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run

The -config option refers to a local Java properties file containing descripton for the parameters in the workflow XML file,

As well as oozie.wf.application.path,

which tells Oozie the location of the workflow application in the HDFS.

3. Get the status of workflow job-

Status of workflow job can be seen using subcommand ‘job’ with ‘-info’ option and specifying job id after ‘-info’.

e.g., % oozie job -info <"job id">

Output shows status which is one is RUNNING, KILLED or SUCCEEDED.

4. Results of successful workflow execution can be seen using Hadoop command like-

% hadoop fs -cat <"location of result">

Why use Oozie?

The major purpose of using Oozie is to manage variety of jobs being processed in Hadoop system.

Dependencies between jobs are specified by a user in the form of Directed Acyclic Graphs.

Oozie consumes this data and takes care of execution in the correct order as specified in a workflow.

That way user’s time to maintain the complete workflow is stored.

Oozie has a provision to specify the frequency of execution of a particular job.

Features of Oozie

  • It has client API and command line interface which can be used to launch, control and monitor job from Java application.
  • Using its Web Service APIs one can control jobs from wherever.
  • It has provision to execute jobs which are scheduled to run periodically.
  • It has provision to send email notifications upon completion of jobs.

Conclusion

We had a look at what is Apache Oozie,:

Apache Oozie is a web application written in Java works with Apache Hadoop installations.

  • Apache Oozie is a task management tool for large data systems that work in the Hadoop ecosystem with other individual products like YARN, MapReduce, and Pig.
  • Individual job tasks are grouped to logical work units by Apache Oozie.
  • It enables more advanced scheduling and work managing.
  • Engineers may use Oozie to build difficult data processes that make Hadoop operations easier to manage.

Apache Oozie is availed under an Apache Foundation software licence and is part of the Hadoop toolset.

which is considered an open-source software system or commercial, vendor-licensed system by the community.

Since Hadoop has become so famous for analytics and other types of enterprise computing,

Tools like Oozie are increasingly being considered as solutions for data handling projects within enterprise IT.

Are you looking training with Right Jobs?

Contact Us

Popular Courses