-
Notifications
You must be signed in to change notification settings - Fork 9
Oozie Team
The reason we decided to use Oozie was to automate: the ETL process, transferring the data into S3, and connecting remotely to EMR to run the spark job. By doing this we will stream-line the process so as more data is gathered, minimal effort will be required of the user to process the data. We also decided to incorporate Oozie coordinators so the entire workflow can be scheduled at any desired interval.
-
Before running the spark workflow, the mySQL JDBC connection driver must be installed. https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdhigsqoopinstallation.html#topic13 https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdhooziesqoopjdbc.html Above are the general instructions, make sure the mysql JDBC driver is placed in hortonwork hdfs's /user/oozie/libext
-
Change the hortonPass parameter in job.properties file equal to the password for root user in your mySQL.
-
Change 'mySQLPW' in sqoop-job.txt to mySQL password for root user in mySQL
-
Run sqoop-job.txt on command to add sqoop job for join command
-
use -copyFromLocal command to migrate workflow.xml to directory of choice; Here we have it set to /user/root/HData
-
Run the following command while in the local directory that contains job.properties:
oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -D oozie.wf.application.path /user/root/HData/workflow.xml -config job.properties -run
This oozie workflow was ran on Hortonworks.
In order to view the hadoop jobhistory logs, you may want to login as root.
For Spark2 compatability: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bkspark-component-guide/content/choozie-spark-action.html#spark-config-oozie-spark2
- Make a directory in HDFS (HData) that will hold your workflow and the input for your spark action.
hdfs dfs -mkdir /user/root/HData
hdfs dfs -copyFromLocal workflow.xml HData/
hdfs dfs -copyFromLocal all-bible HData/
- Make a directory in HDFS at the same location as the workflow.xml.
- Note: it must be named "lib".
hdfs dfs -mkdir /user/root/HData/lib
hdfs dfs -copyFromLocal WordCountSpark.jar HData/lib
- Use the following to run the oozie workflow:
oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -config <local path of job.properties> -run
- Make a directory in HDFS (CData) that will hold your coordinator
hdfs dfs -mkdir /user/root/CData
hdfs dfs -copyFromLocal coordinator.xml CData/
- Change the application path in your job.properties file to point to the coordinator instead of the workflow.
- Use the same line as above to run the coordinator job.
oozie job -oozie http://sandbox-hdp.hortonworks.com:11000/oozie -config <local path of job.properties> -run
- Fully automated ETL process of the workflow
- Spark action runs locally on Hortonworks
- Documentation and code for a working Oozie coordinator
- Future: fully automate the entire process from Caliber to Redshift