In this workshop (typically for Data Science and Big-Data sessions) we are going
to look at storing data in database systems, as opposed to storing them in text
files (e.g. .csv
).
The following assumes you have a base Linux (Ubuntu 22.04 LTS) installation. Other more recent versions may work too (let me know otherwise). If you are a windows fan, may I suggest you create a beefy VM (using VirtualBox or on the cloud) and install Ubuntu there.
You will need a ton of different software (configuration, etc.) for this workshop. At the very least you will need mysql, memcached, mongodb, cassandra, neo4j, and Python drivers for all of these. Normally, this would be a pain, but we will be using docker so everything is conveniently installed and set-up for you and placed in different docker containers (you're spoilt these days!).
The rest of the workshop (and practical) assumes you have docker installed. This can be achieved readily (on Ubuntu 22.04 LTS) with:
# Install dependencies
sudo apt-get install -y ca-certificates curl gnupg lsb-release
# Add repository key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# Add repository for docker packages
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
# Update the apt package index
sudo apt-get update
# Install new packages
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
# Check docker service is running
sudo systemctl status docker
# Add docker role to your currrent user (to avoid sudo each time, but groups are updated at login)
sudo usermod -aG docker ${USER}
# Check all of the installed software works with:
sudo docker run hello-world
After adding the docker group to your user, you will need to log out and in
again to show the new group (check with the id
Bash command). An
alternative to logging in and out again is to use su - $USER
(this
effectively re-logs you).
The Ubuntu instructions to install docker where taken from the docker site.
Clone this repository using git clone git@github.com:jp-um/data_storage_workshop.git
. (If you do not have your ssh keys and/or account set up in github you can simply git clone https://github.com/jp-um/data_storage_workshop.git
instead). You should then run this workshop from the root directory of the repository. To install git on Ubuntu, simply sudo apt install git
.
We will use the official latest mysql containers for the relational databases part of the workshop.
This is easily achieved as follows:
# Get the latest (official) image for mysql
docker pull mysql/mysql-server:latest
# List all images available locally (you should see mysql)
docker images
# Start a container from the mysql/mysql-server:latest image
# You need to give this some time (takes longish to start)
docker run --name some-mysql -e MYSQL_ROOT_HOST=% -e MYSQL_ROOT_PASSWORD=my-secret-pwd -d mysql/mysql-server:latest
# This will tell you what is going on with your instance
docker logs some-mysql
# What are the docker containers running, you should see mysql
docker ps
# From the sql directory, import the structure of the database (and some
# data) from the insurance.sql script file.
docker exec -i some-mysql mysql -h0.0.0.0 -uroot -pmy-secret-pwd < sql/insurance.sql
# Use this if you want to run the client directly in the docker container
docker exec -it some-mysql mysql -h0.0.0.0 -uroot -pmy-secret-pwd
# Once in the mysql terminal, list all databases
show databases;
# Use our newly created db
use insurance;
# See the tables
show tables;
# See some data
select * from driver;
# To quit the mysql terminal
\q
Skip to the Jupyter Lab section, if you are only interested in running the mysql exercise.
During this workshop I will show you the basic concepts of NoSQL systems. We will use four NoSQL technologies, each representing a different data model -- Memcached, MongoDB, Cassandra, and Neo4j.
We will use (the latest) official docker instances to install the NoSQL database
engines. Follow the instructions presented in each section. Each of these steps
may take a few minutes. All the names of the containers will be prefixed with
some-
(this is an arbitrary decision).
To run memcached execute:
docker run --init --name some-memcache -p 21211:11211 -d memcached
-d
detaches the docker container once initialized.
To run mongodb execute:
docker run --name some-mongo -d mongo:4.4.6
Note that the version at the end (i.e. 4.4.6) is important, as the latest (5+) mongo image requires the CPU to have AVX support which may not be available if running via some virtualization software (e.g. Windows host with an Ubuntu Guest using VirtualBox).
To run cassandra execute:
docker run --name some-cassandra -d cassandra
To run neo4j execute:
docker run --name some-neo4j -e NEO4J_AUTH=neo4j/testtest -p 7474:7474 -p 7687:7687 -d neo4j
Note that the above switches off Neo4J authentication (by setting an environment variable in the container).
You should be able to see all running containers with docker ps
. Look at
the status column of the resulting output.
I have developed example usages of each NoSQL database, using Jupyter lab. This
is a browser-based IDE which uses Jupyter Notebooks. First, you should copy all
code (as Jupyter notebooks) from the sql
and nosql directories
in a local directory on your Ubuntu installation (alternatively it is easier to
git clone
this repository as suggested above). Remember to change the
/REPLACE/WITH/YOUR/LOCAL/PATH
path below. The last line --
jpebe/data_storage_workshop
-- identifies the image,
stored on docker hub, which we will be using. Do not change this.
docker run --init -ti --rm \
--name data-storage-workshop \
--link some-mysql:mysql \
--link some-memcache:memcache \
--link some-mongo:mongo \
--link some-cassandra:cassandra \
--link some-neo4j:neo4j \
-v /REPLACE/WITH/YOUR/LOCAL/PATH:/notebooks \
-p 8888:8888 \
jpebe/data_storage_workshop
This container is linked to all the containers we set up (and are running).
You should then be able to copy the Jupyter notebook URL from the terminal into
your browser (ctrl-click will open a browser automatically). Note that you may
need to change the hostname to 127.0.0.1
(the local host).
Explore and run the NoSQL and SQL notebooks, and understand each command does.
The databases we run within the containers have no volume mounted (from Ubuntu). This means that all data created during our exercises is stored in the container which gets lost when you stop running the container. If you want your data to persist, you have to mount volumes (on your Ubuntu installation) and use these as the data directory of the NoSQL engine. For example, to keep a copy of the neo4j data, simply run the docker container in the following manner:
docker run --name some-neo4j \
-e NEO4J_AUTH=neo4j/test \
-p 7474:7474 -p 7687:7687 \
-v $HOME/neo4j/data:/data \
-v $HOME/neo4j/logs:/logs \
-d neo4j
Where $HOME/neo4j/data
and $HOME/neo4j/logs
will be mounted in the
container under /data
and /logs
respectively.
Note that in the above workshop, changes to the jupyter notebooks will, however, be persisted as these are mounted from your own local directory.
To build this image from the dockerfile, all you need is (from the dockerfile directory)
docker build -t jpebe/data_storage_workshop .
You may want to reclaim some space on your Ubuntu installation. Remove all images and containers in the following way.
docker rm -f $(docker ps -a -q)
docker rmi -f $(docker images -a -q)
Pull requests (with fixes, even to this guide) will be recieved with appreciation and thanks.
I hope you enjoy this session!
Dr Jean-Paul Ebejer