This is a repository containing a MongoDB Database Benchmark, done for Cloud Service Benchmarking [WiSe2122] course at TU Berlin. The goal of the Benchmark is to answer the following research question:
RQ 1: How does the maximum throughput change while scaling out the MongoDB database containing geospatial workload?
In more detail, we compare the maximum throughputs for MongoDB Databases containing geospatial data. Three scenarios and thus architectures of MongoDB databases will be compared: without sharding, with 2 shards, and with 3 shards.
More about the benchmark itself, SUT and background information, exact implementation details and results analysis can be read in the Benchmark Report (soon to come!).
When a benchmark is run, with the use of Terraform, firstly all necessary VMs for the database itself will be deployed on GCP, then benchmarking client will be also deployed on its own VM. The database will be preloaded with the previously generated geospatial workload, and from the start of the benchmark, queried with generated "on the go", geospatial queries.
Every 30 seconds, 5 new client processes will be created on the benchmarking client, thus simulating the increase of the database reads. Additionally, after every 30 seconds, the latency will be checked for all sent requests during that period. If the latency is exceeded for 2% of the requests, or timeout happens the benchmark will end.
Results file containing the logs made throughout the whole benchmark will be sent back to the host VM executing the benchmark. The results can be analyzed using the Jupyter Notebooks in the data_analysis
folder.
Exact details on how to execute the benchmark yourself can be found below in the Execution
section.
The results of the Benchmark, including all plots can be seen in the Benchmark Report (soon to come!).
The plots based on the results can be also seen in the figures
folder.
- Log into your GCP account and create a new project called
mongodb-benchmark
. Use this project for further instructions. Additionaly, the account should have a minimum resource quota of total CPUs of 60. - Setup a host VM inside the Compute Engine and SSH into it. (Tested with an e2-standard-8 VM with default Ubuntu 20.04 LTS, no guarantees are given for other types of VMs (some might have problem with the lack of resources to generate the workload) or the OS) If the host VM stops generating the workload, it probably requires more RAM, in that case one can simply download an already generated workload: https://tubcloud.tu-berlin.de/s/J4gyCJttfjFPgsW.
- Download all necessary packages: run
sudo apt-get update
and thensudo apt-get install git make
. - Install terraform for Linux (follow: https://learn.hashicorp.com/tutorials/terraform/install-cli)
- Clone this repository (
sudo git clone https://github.com/Corgam/mongo_benchmark
). - Create the GCP's JSON credentials file (Follow: https://cloud.google.com/iam/docs/creating-managing-service-account-keys) and save it as
credentials.json
in the root folder of the repository. (where thisREADME.md
andMakefile
are) - Inside of the repo's root folder, run the
sudo make setup
to create the required SSH keys, generate the workload and download all necessary Python packages. (The workload size can be changed in theworkload_generation/generation.py
file. Change theBIGGEST_POPULATION_RESTAURANTS
global constant, to change the possible maximum amount of restaurants per city)
- Run
sudo make mongo n=SHARDS_NUMBER
command to create MongoDB's VMs on GCP, withn
equal to the number of shards the Mongo database will have (either1
,2
or3
). For example:sudo make mongo n=2
will set up Mongo database with 2 shards. (When Terraform prints an error and the deployment did not succeed, try again, until Terraform says it is ok.) - Wait until the VMs are ready.
Before running the benchmark, make sure that all Mongo's VMs are ready to be used and connected to.
- Run the
sudo make benchmark
command to create a Benchmarking Client VM on GCP. - After VM is ready, the benchmark will start immediately.
- Wait until the end of the benchmark (the progress can be seen in the terminal).
- The results file should be located in the root directory of this repository. (named:
Results_[TIME].txt
). - Run
sudo make clean
to destroy all VMs on GCP. - To analyze the results, one can use the Jupyter Notebooks located in the
data_analysis
folder. (The packages used in the Jupyter Notebooks will not be installed automaticaly. One needs to download them themself.)