In this project, we introduce efficient data partitioning technique, where incoming batches are partitioned evenly for the map and reduce tasks.
The repository is built over Apache Spark v2.0.0. The prototype exposes the low level API of Spark and uses the runJobs
method of SparkContext. To try , first build Spark based on existing instructions. For example, using SBT we can run
./build/sbt package
You need to specify "PromptPartitioner" for the execution environment as follows:
sparkConf.setExecutorEnv("Partitioner","PromptPartitoner" )
When running multiple computations as part of one app then the number of mappers is automatically detected using the number of data blocks (i.e., partitions). However, you need to specify the number of reducers in your computation when initiating the PromptPartitioner object as follows:
val partitioner = new PromptPartitioner(numReducers)
Please check org.apache.spark.examples.PromptExample
for more details.
You can run the PromptWordCount
example with 4 cores for 10 batches with our proposed technique. Note that this example requires at least 4GB of memory on your machine.
./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.PromptWordCount
To compare this with existing Spark, we can run the same computation but with default Spark partitioner (time-based)
./bin/run-example --master "local-cluster[4,1,1024]" org.apache.spark.examples.StreamWordCount
The benefit from using our data partitioning technique is clear on larger clusters. Results from running the two stage query for different workloads and different batch interval sizes on Amazon EC2 cluster is presented in our paper.
The source code in this repository is a research prototype and only implements the data partitioning described in our paper. We are working on adding more features to our work.
- Ahmed S. Abdelhamid, Ahmed R. Mahmood, Anas Daghistani, Walid G. Aref, “Prompt: Dynamic Data Partitioning for Distributed Micro-batch Stream Processing Systems”, In proceedings of International Conference on Management of Data, June 14-19, 2020
If you have any question please feel free to send an email.
- Ahmed S. Abdelhamid samy@purdue.edu