This repository contains startup and configuration scripts to configure, deploy and run the DataGraft Platform as a multi-container Docker application using Docker Compose. Each microservice's release of the DataGraft Platform is published as a pre-built Docker image on the Docker Hub repository (https://hub.docker.com/u/datagraft/).
The DataGraft Platform consists of two main parts:
- DataGraft
- Grafterizer
Both of these parts have been implemented through a microservice architecture consisting of a large number of sub-components, each contained in a Docker container. The DataGraft and Grafterizer tool user interfaces have been integrated to provide a consistent user experience, whereby their connected microservices communicate with each other through REST. The individual components are illustrated in the figure below.
DataGraft consists of the following components:
- DataGraft Portal: The portal serves several functions. Firstly, it provides the web-based front-end that is used by the data publishers. Internally, it implements the data model and provides object-relational mapping between it and the database back-end. It also enables the communication with the database and manages the storage of uploaded files (Docker volume, or Amazon RDS in production). Finally, this component implements the connection to the data hosting and access services.
- DataGraft DBMS: This component represents the database management system (PostgreSQL ) for the user data and asset catalogue. Data are stored in a separate volume (Docker volume or Amazon S3 in production).
Grafterizer has the following sub-components:
- Grafterizer: Front-end component that implements the interactive GUI for data cleaning and data transformations.
- Grafterizer dispatch service: A server component for the Grafterizer front-end that handles request authentication on its behalf (in order to ensure security) and dispatches requests for input and output across the multiple services.
- Graftwerk: A sandboxed server component that executes the data cleaning and transformation scripts that are generated by the Grafterizer front-end over the set of input data sent by the dispatcher. Graftwerk uses a proprietary load-balancing component in order to distribute the traffic coming when a larger number of users use the transformation tool.
- Graftwerk cache: A FIFO cache service for the Grafterizer front-end requests to Graftwerk.
- Vocabulary manager: Simple RDF vocabulary management service for imported vocabularies used in the RDF mapping in the front-end. Enables searching through concepts and importing.
- Jarfter: A web service component for compiling executable JARs for transformations generated by the Grafterizer front end.
This repository contains the following files:
docker-compose.yml
- Docker Compose definition file to run the DataGraft Platform as a multi-container Docker applicationstartup.sh
- startup script that should be executed the first time you run the DataGraft Platform in order to create an admin user account
To deploy and run the DataGraft Platform:
docker-compose pull
docker-compose up
The default docker-compose.yml
file is configured to deploy and run the DataGraft Platform on localhost using Postgres. Edit the environment variables for each service to change the deployment settings:
POSTGRES_PASSWORD
- password of the admin user (defined instartup.sh
)POSTGRES_DB
- name of database (defined instartup.sh
)
DATABASE_URL
- URL of the Postgres databaseDATABASE_HOST
- host of the Postgres databaseDATABASE_PASSWORD
- password of the admin user (defined instartup.sh
)RAILS_ENV
- Rails environment [development | staging | test ]SECRET_KEY_BASE
- secret key baseGRAFTERIZER_PUBLIC_PATH
- URL of Grafterizer serviceGRAFTWERK_URI
- URI of Graftwerk serviceDATAGRAFT_DEPLOY_HOST
- host of the DataGraft Portal serviceDATAGRAFT_DEPLOY_PORT
- port of the DataGraft Portal service
COOKIE_STORE_SECRET
- cookie store secretOAUTH2_CLIENT_ID
- Grafterizer UID (retrieved when configuring/starting Grafterizer in DataGraft)OAUTH2_CLIENT_SECRET
- Grafterizer secret key (retrieved when configuring/starting Grafterizer in DataGraft)GRAFTWERK_URI
- URI of the Graftwerk serviceGRAFWERK_CACHE_URI
- URI of the Graftwerk cache serviceDATAGRAFT_URI
- Public URI of the Grafterizer Portal serviceCORS_ORIGIN
- Public URI of the backend serverPUBLIC_CALLBACK_SERVER
- Same asDATAGRAFT_URI
by defaultPUBLIC_OAUTH2_SITE
- URI of OAUTH2 server
The first time you run the DataGraft Platform you will need to create an admin user account:
startup.sh
Edit the startup.sh
script if you want to change the default login name 'administrator@datagraft.net' and the default password 'password' for the admin user account. Make sure that the database setting 'datagraft-dev' (default) matches the environment settings in the docker-compose.yml
file.
The default docker-compose.yml
file is configured to deploy and run DataGraft Platform on localhost using Postgres. To configure it for the cloud deployment on Amazon S3 with Amazon RDS you will need to:
- Remove the
database
service from thedocker-compose.yml
file. - Change the environment variables for the
datagraft-portal
service.
Instead of running startup.sh
script to add a Postgres administrator user, you need to set up an AWS Identity and Access Management (AIM) user for the DataGraft Platform.
Remove the database
entry under services
.
Remove dependencies to the Postgres service under links
:
-database:database-host
Remove execution of the startup script under commands
:
bash startup.sh
Remove the following environments that used for Postgres:
DATABASE_URL
- URL of the Postgres databaseDATABASE_HOST
- host of the Postgres databaseDATABASE_PASSWORD
- password of the admin user (defined instartup.sh
)
Add the following environment variables used for AWS S3 and RDS.
AWS_RDS_DB_NAME
- name of the databaseAWS_RDS_DB_USERNAME
- name of the database userAWS_RDS_DB_PASSWORD
- password for the databaseAWS_RDS_DB_HOST
- S3 host running the databaseAWS_S3_BUCKET_NAME
- S3 buckket nameAWS_S3_ACCESS_KEY_ID
- S3 access key IDAWS_S3_ACCESS_KEY_SECRET
- S3 access key secretAWS_S3_REGION
- S3 regionSES_SMTP_USERNAME
- SMTP usernameSES_SMTP_PASSWORD
- SMTP password
For posting information about bugs, questions and discussions please use the Github Issues feature.
- Nikolay Nikolov (main contact person)
- Bjørn Marius Von Zernichow
- Brian Elvesæter
- Titi Roman
- Steffen Dalgard
- Antoine Pultier
- Dina Sukhobok
- Christian Rotari
- Ana Tarita
- Nivethika Mahasivam
- Dennis Gan
- Håvard Heitlo Holm
Available under the Eclipse Public License (v1.0).