The goal of this project is to create a continuous feed of relevant world news, based on tweets retrieved from the Twitter API. Due to the potentially large amount of data, the news feed is generated by a distributed system, using Apache Spark (1.6.1), Scala (2.10.5) and Java (1.8). The project was developed in the context of the seminar "Mining Massive Datasets" by Masters students (Daniel Neuschäfer-Rube, Jaqueline Pollak, Benjamin Reißaus) at Hasso Plattner Institute (Potsdam, Germany). The following sections outline how to run the project, details on the algorithms can be found on the wiki pages.
-
Set Up Amazon Cluster (Optional)
-
Upload java8_bootstrap.sh to S3
-
Create EMR Cluster
- click
'Go to advanced options'
- emr 4.7.1 with spark, hive, hadoop and ganglia
- set java 8 as default by adding configuration to
Edit software settings
- machine type:
m3.xlarge
- number of worker instances: 10
- choose
java8_bootstrap.sh
as script in bootstrap actions
- click
-
In the following, we describe how to run our application locally. The automation scripts can be used though to run everything on the Amazon cluster.
-
-
Build Jar
mvn clean package
-
Run Clustering With One Of Two Modes
-
Cluster tweets from file (generate yourself, or use this sample):
spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -input [path to twitter.dat]
-
Cluster tweets from Twitter API (requires config.txt in resources folder):
spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -source api
-
-
Merge Clustering Results
spark-submit --class de.hpi.isg.mmds.sparkstreaming.ClusterInfoAggregation target\SparkTwitterClustering-jar-with-dependencies.jar
-
Visualize Clustering Results (Details)
- install node
- install node packages in
webapp
folder:npm install
- run webserver:
node server.js
- open browser: http://localhost:3000/index
We have provided the following automation scripts in the folder SparkTwitterClustering/utils/
:
-
buildAndCopyToRemote.sh
- actions:
- build fat jar
SparkTwitterClustering-jar-with-dependencies.jar
on local machine - copy
SparkTwitterClustering-jar-with-dependencies.jar
,runOnCluster.sh
,driver_bootstrap.sh
andtwitter.dat
to remote server which is referenced byMASTER_PUBLIC_DNS
.
- build fat jar
- command:
./buildAndCopyToRemote.sh [MASTER_PUBLIC_DNS]
- execution machine: local
- actions:
-
driver_bootstrap.sh
- actions:
- install git, tmux, zsh, oh-my-zsh
- configure tmux
- copy twitter.dat to HDFS
- command:
./driver_bootstrap.sh
- execution machine: remote cluster driver
- actions:
-
java8_bootstrap.sh
- actions:
- install java 8 and set up JAVA_HOME accordingly
- command: --- supplied during cluster creation
- execution machine: all cluster nodes
- actions:
-
runOnCluster.sh
- actions:
- run spark job with different batch sizes, number of executors, number of cores
- save results in
runtime.csv
- command:
./runOnCluster.sh
- execution machine: remote cluster drive
- actions: