This is an assignment for the Big Data course in Roma Tre University.
This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.
To run this project you need:
- Python 3.6.9
- Hadoop 3.2.1
- Spark 3.0.0
- pip3 intstalled in your machine. To install pip3 run the following commands in a shell
sudo apt update
sudo apt install python3-pip
Start Hadoop, open a shell and run
$HADOOP_HOME/sbin/start-dfs.sh
Download this repo or clone it by running
git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git
Move inside the downloaded directory
cd big_data_lsh_ensemble/
Execute the run.sh script by running in a shell
sh run.sh
Create a virtual environment
python3 -m venv my_env
source .my_env/bin/activate
Execute the run.sh script by running
sh run.sh