Oozie Team

Purpose of Oozie

The reason we decided to use Oozie was to automate: the ETL process, transferring the data into S3, and connecting remotely to EMR to run the spark job. By doing this we will stream-line the process so as more data is gathered, minimal effort will be required of the user to process the data. We also decided to incorporate Oozie coordinators so the entire workflow can be scheduled at any desired interval.

Set-up (ETL)

Before running the spark workflow, the mySQL JDBC connection driver must be installed. https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdhigsqoopinstallation.html#topic13 https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdhooziesqoopjdbc.html Above are the general instructions, make sure the mysql JDBC driver is placed in hortonwork hdfs's /user/oozie/libext
Change the hortonPass parameter in job.properties file equal to the password for root user in your mySQL.
Change 'mySQLPW' in sqoop-job.txt to mySQL password for root user in mySQL
Run sqoop-job.txt on command to add sqoop job for join command
use -copyFromLocal command to migrate workflow.xml to directory of choice; Here we have it set to /user/root/HData
Run the following command while in the local directory that contains job.properties:

oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -D oozie.wf.application.path /user/root/HData/workflow.xml  -config job.properties -run

Set-up (Spark)

This oozie workflow was ran on Hortonworks.
In order to view the hadoop jobhistory logs, you may want to login as root.
For Spark2 compatability: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bkspark-component-guide/content/choozie-spark-action.html#spark-config-oozie-spark2

Make a directory in HDFS (HData) that will hold your workflow and the input for your spark action.

hdfs dfs -mkdir /user/root/HData                                                                                                                                        
hdfs dfs -copyFromLocal workflow.xml HData/                                                                                               
hdfs dfs -copyFromLocal all-bible HData/

Make a directory in HDFS at the same location as the workflow.xml.
Note: it must be named "lib".

hdfs dfs -mkdir /user/root/HData/lib                                                                                                                                    
hdfs dfs -copyFromLocal WordCountSpark.jar HData/lib

Use the following to run the oozie workflow:

oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -config <local path of job.properties> -run

Oozie Coordinators

Make a directory in HDFS (CData) that will hold your coordinator

hdfs dfs -mkdir /user/root/CData                                                                                                                                        
hdfs dfs -copyFromLocal coordinator.xml CData/

Change the application path in your job.properties file to point to the coordinator instead of the workflow.
Use the same line as above to run the coordinator job.

oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -config <local path of job.properties> -run

Results and Future Iterations

Fully automated ETL process of the workflow
Spark action runs locally on Hortonworks
Documentation and code for a working Oozie coordinator
Future: fully automate the entire process from Caliber to Redshift

Provide feedback

Saved searches