Skip to content

Latest commit

 

History

History
89 lines (59 loc) · 6.3 KB

File metadata and controls

89 lines (59 loc) · 6.3 KB

Demo of MIP Data Factory featuring Airflow

This is a demonstration of the MIP Data Factory focusing on its workflow application, Airflow.

The demonstration runs inside a Vagrant Virtual machine and demonstrates ETL pipelines for medical data.

Installation

  • install Ansible version 2.2.0 or better. On Ubuntu you can use the script ./common/scripts/bootstrap.sh

  • install VirtualBox version 5.0 or better

  • install Vagrant version 1.8.5 or better

  • install vagrant plugin install vagrant-hostmanager

       vagrant plugin install vagrant-hostmanager
    
  • start the virtual machine with Vagrant. You will need at least 5Gb of RAM available for the VM.

      vagrant up
    

Troubleshooting

VirtualBox is complaining that the installation is incomplete

After upgrading the Linux kernel in your system you may encounter this message when running a Vagrant command:

The provider 'virtualbox' that was requested to back the machine
'airflow' is reporting that it isn't usable on this system. The
reason is shown below:

VirtualBox is complaining that the installation is incomplete. Please
run `VBoxManage --version` to see the error message which should contain
instructions on how to fix this error.

To fix it, you need to rebuild a module for Virtualbox using this command:

  sudo apt-get install --reinstall virtualbox-dkms linux-headers-generic

Usage

The virtual machine should start and install Airflow.

You can see Airflow running at localhost:14080

Marathon can be accessed on localhost:15080

Testing

Example data is provided in /data/demo folder inside the VM, but you need Matlab installed in the virtual machine to execute the SPM 12 based preprocessing pipelines.

For developers

Sources for Airflow and related projects

Deployment Organisation License Management Continuous integration
ansible-airflow CHUV License
Data Factory Organisation License Planning Continuous integration
data-tracking CHUV License Codacy Badge CircleCI
mri-meta-db CHUV License Codacy Badge CircleCI
mri-preprocessing-pipeline CHUV License
airflow-imaging-plugins CHUV License Codacy Badge
data-factory-airflow-dags CHUV License Codacy Badge

Configuration for Ansible inventory

Ansible inventory controls what software is installed and how it is configured.

It is organised by hosts (servers) and groups.

Here, we have the following organisation:

  • demo: the target host, running inside a Vagrant Virtual machine
  • managed: a group containing demo, indicating that the server is managed by Ansible and should be applied a default configuration and a set of base sofware packages
  • control: a group containing demo, indicating that this server is used to perform operations affecting the whole cluster (here we have a 'cluster' of one machine)
  • zookeeper, mesos-mixed: groups that are used to define where and how the Mesos stack is deployed
  • airflow: groups that are used to define which applications should be deployed by Marathon

Inventory configuration