Twitter News Clustering

The goal of this project is to create a continuous feed of relevant world news, based on tweets retrieved from the Twitter API. Due to the potentially large amount of data, the news feed is generated by a distributed system, using Apache Spark (1.6.1), Scala (2.10.5) and Java (1.8). The project was developed in the context of the seminar "Mining Massive Datasets" by Masters students (Daniel Neuschäfer-Rube, Jaqueline Pollak, Benjamin Reißaus) at Hasso Plattner Institute (Potsdam, Germany). The following sections outline how to run the project, details on the algorithms can be found on the wiki pages.

How To Run

Set Up Amazon Cluster (Optional)
- Upload java8_bootstrap.sh to S3
- Create EMR Cluster
  - click 'Go to advanced options'
  - emr 4.7.1 with spark, hive, hadoop and ganglia
  - set java 8 as default by adding configuration to Edit software settings
  - machine type: m3.xlarge
  - number of worker instances: 10
  - choose java8_bootstrap.sh as script in bootstrap actions
- In the following, we describe how to run our application locally. The automation scripts can be used though to run everything on the Amazon cluster.
Build Jar
- mvn clean package
Run Clustering With One Of Two Modes
- Cluster tweets from file (generate yourself, or use this sample):
  - spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -input [path to twitter.dat]
- Cluster tweets from Twitter API (requires config.txt in resources folder):
  - spark-submit --class de.hpi.isg.mmds.sparkstreaming.Main target\SparkTwitterClustering-jar-with-dependencies.jar -source api
Merge Clustering Results
- spark-submit --class de.hpi.isg.mmds.sparkstreaming.ClusterInfoAggregation target\SparkTwitterClustering-jar-with-dependencies.jar
Visualize Clustering Results (Details)
- install node
- install node packages in webapp folder: npm install
- run webserver: node server.js
- open browser: http://localhost:3000/index

Automation Scripts

We have provided the following automation scripts in the folder SparkTwitterClustering/utils/:

buildAndCopyToRemote.sh
- actions:
  - build fat jar SparkTwitterClustering-jar-with-dependencies.jar on local machine
  - copy SparkTwitterClustering-jar-with-dependencies.jar, runOnCluster.sh, driver_bootstrap.sh and twitter.dat to remote server which is referenced by MASTER_PUBLIC_DNS.
- command: ./buildAndCopyToRemote.sh [MASTER_PUBLIC_DNS]
- execution machine: local
driver_bootstrap.sh
- actions:
  - install git, tmux, zsh, oh-my-zsh
  - configure tmux
  - copy twitter.dat to HDFS
- command: ./driver_bootstrap.sh
- execution machine: remote cluster driver
java8_bootstrap.sh
- actions:
  - install java 8 and set up JAVA_HOME accordingly
- command: --- supplied during cluster creation
- execution machine: all cluster nodes
runOnCluster.sh
- actions:
  - run spark job with different batch sizes, number of executors, number of cores
  - save results in runtime.csv
- command: ./runOnCluster.sh
- execution machine: remote cluster drive

Contributors

Daniel Neuschäfer-Rube

Jaqueline Pollak

Benjamin Reißaus

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
SparkTwitterClustering		SparkTwitterClustering
gather-tweets		gather-tweets
proof-of-concept		proof-of-concept
slides		slides
webapp		webapp
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter News Clustering

How To Run

Automation Scripts

Contributors

About

Releases

Packages

Contributors 2

Languages

License

JaquelineP/TwitterNewsClustering

Folders and files

Latest commit

History

Repository files navigation

Twitter News Clustering

How To Run

Automation Scripts

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages