An ETL pipeline that extracts data from HDFS , transforms using spark and writes back to HDFS.
This repository contains a Scala script that implements a simple data pipeline using Hadoop and Spark.
The purpose of this pipeline is to:
- Read data from HDFS.
- Perform JOIN operations using Spark.
- Perform data analysis and transformations.
- Write the processed data back to HDFS.
- Ensure you have Java, Hadoop and Spark installed.
- Execute the Scala script (
pipeline2.scala
) to run the data pipeline. - Command to execute:
spark-shell -I "path/to/file.scala"
- Alternatively you can also use depending on your environment:
saprk-submit
:load "filepath"