Skip to content

Oozie Team

Quintin Donnelly edited this page Mar 7, 2019 · 5 revisions

Biforce Brand

Goals

Main

  • Schedule and automate entire application

Iteration

  • Create workflow library for OLAP and Warehouse ETL
  • Create master workflows for setup and execution of Sqoop jobs
  • Migrate from development environment to EMR

Preparation

In the main Oozie folder, there is a shell script named setup.sh that will create required HDFS directories as well as move necessary files to help get set up quickly for testing. For further information, comments have been written in the script itself.

Script Usage Procedure

  1. Ensure the oozie folder is inside the environment in which you wish to use it.
  2. Navigate into the Oozie folder
  3. From the CLI, type the command dos2unix setup.sh to ensure that the script's internal formatting is set to Linux/Unix instead of Windows.
  4. From the CLI, type the command sh setup.sh to run the script.
  5. Once the script is done, follow the manual steps that must be done at the end of it.

WARNING:

  • If editing setup.sh outside of the testing environment, make sure to run 'dos2unix' before trying to run the script again.
  • Be careful when moving or removing files from the Oozie folder, as the setup script expects them to be in their current location in order to successfully run.

Implementation

For clarification on any Oozie concepts, please refer to The Apache oozie User Guide.

Sqoop jobs from ETL have been converted into Oozie workflows and make up the Oozie library. In each workflow, the Sqoop command has been broken down into its individual arguments using the <arg></arg> tags instead of <command></command> tags to ensure that there are no unwanted parsing errors.

Furthermore, secure connection arguments, such as username, passwords, and connection strings have been moved to the biforce-setup.properties file, which is found in the setup sub-directory.

Setup

This section of the library is meant to be run only once to create Sqoop import jobs on the Sqoop metastore. This section should be run after running the setup script previously mentioned.

WARNING: Running the setup workflows will delete and re-create any existing Sqoop jobs on the metastore. This will wipe any preexisting metadata, such as values needed for incremental append.

biforce-setup.xml is the main workflow that calls the hive and warehouse subworkflows to complete the setup process.

Relevant files: biforce-setup.xml biforce-setup.properties

OLAP

This section pertains to the workflows that deal with Hive and data for Spark Analysis.

Relevant files:

  • delete-hive-imports.xml
  • create-hive-imports.xml

Warehouse

Contains workflows that are used for obtaining data that will be going directly to storage.

Relevant files:

  • delete-warehouse-imports.xml
  • create-warehouse-imports.xml

Execution

Contains files to be called for actual Biforce operation. These workflows call the Sqoop jobs that were created in setup.

Relevant files:

  • execute-hive-imports.xml
  • execute-warehouse-imports.xml

Incomplete

  • Create biforce-execution.xml workflow to call execute-hive-imports.xml and execute-warehouse-imports.xml sub-workflows.
  • Create biforce-execution.properties file for biforce-execution.xml
  • Migrate processes into EMR
Clone this wiki locally