hadoop-spark-pipeline

An ETL pipeline that extracts data from HDFS , transforms using spark and writes back to HDFS.

Hadoop Spark Pipeline

This repository contains a Scala script that implements a simple data pipeline using Hadoop and Spark.

The purpose of this pipeline is to:

Ensure you have Java, Hadoop and Spark installed.
Execute the Scala script (pipeline2.scala) to run the data pipeline.
Command to execute: spark-shell -I "path/to/file.scala"
Alternatively you can also use depending on your environment:
1. saprk-submit
2. :load "filepath"

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
pipeline1.scala		pipeline1.scala
pipeline2.scala		pipeline2.scala