SparkBWT is a tool for calculating the Burrows-Wheeler transform (BWT) on Apache Spark Framework.
The application has been developed using maven, the main languages are java and scala. To improve the performance it has been used C/C++ languages too, integrated through JNI. In the development of the application was used the framework Apache Spark.
The source code is the src folder. Inside there are the main folder that contains the code for the application and the test folder that contains the code for testing the classes in main folder.
In src/main we can find:
- java contains the JNI glue code and the code for CLI.
- native contains the native code, that is the c++ procedure for sorting based on Radix-Sort algorithm
- scala contains the implementation of the algorithm in Apache Spark..
The building of the project can be made automatically with maven, but this requires that the following tools are installed in the system:
make
g++
For building in Windows environment you have to use MinGW and CMake.
To build the project from command line:
git clone https://github.com/MR6996/spark-bwt
cd spark-bwt
mvn package -P [profile]
The profiles are window
and linux
depending on your operating system.
In the created /target
folder, we can find the jar
file needed to run the application (Should be named as spark-bwt.jar
).
The tool can be launched using the tool provided by default by Apache Spark spark-submit
. Can be used a YARN cluster and can be used the option parameters for configuration.
A typical usage is:
spark-submit [options] spark-bwt.jar <filename>
for help:
spark-submit spark-bwt.jar -h
The project is distributed under GPL v.3 License More info
[1] Mario Randazzo, Simona E. Rombo A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform