Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Commit

Permalink
Merge pull request #84 from thehyve/separate-transformation
Browse files Browse the repository at this point in the history
Pipeline refactoring
  • Loading branch information
gijskant authored Oct 15, 2019
2 parents 1099f82 + 09d4b02 commit 1ff02be
Show file tree
Hide file tree
Showing 202 changed files with 849 additions and 4,881 deletions.
8 changes: 4 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -151,10 +151,10 @@ ptp_test/
# System Files
.DS_Store
Thumbs.db
luigi.cfg

.done-*
config/csr_transformation.config
config/transmart_transformation.config
config/email_config.cfg

luigi.cfg
email_config.cfg
luigi-state.pickle
e2e-temp
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ install:
- pip install -r requirements/dev_requirements.txt

script:
- ./run_tests.sh
- ./scripts/run_tests.sh

after_success:
- codecov
172 changes: 122 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,66 @@
# PMC conversion

This conversion pipeline requires Python 3.6.
[![Build status](https://travis-ci.org/thehyve/pmc-conversion.svg?branch=master)](https://travis-ci.org/thehyve/pmc-conversion/branches)
[![codecov](https://codecov.io/gh/thehyve/pmc-conversion/branch/master/graph/badge.svg)](https://codecov.io/gh)
[![license](https://img.shields.io/github/license/thehyve/pmc-conversion.svg?style=flat-square)](https://github.com//thehyve/pmc-conversion/blob/master/LICENSE)

## Installation
Data transformation and loading pipeline. It uses [Luigi](https://github.com/spotify/luigi) Python package for jobs handling
and [python_csr2transmart](https://github.com/thehyve/python_csr2transmart) package for transformation of Central Subject Registry data.

``` bash
$ python -m pip install -r requirements/requirements.txt
```
It loads data to [tranSMART](https://github.com/thehyve/transmart-core) platform using [transmart-copy](https://github.com/thehyve/transmart-core/tree/dev/transmart-copy) tool
and to [cBioPortal](https://github.com/cBioPortal/cbioportal) using [cbioportalImporter.py](https://docs.cbioportal.org/5.1-data-loading/data-loading/data-loading-for-developers) script.

## Keycloak settings
## Configuration

### Identity server
There are two types of configuration files:
- pipeline configuration: [luigi.cfg](#luigi-configuration) and [email_config.cfg](#email-configuration) files to be located in the repository root,
- [transformation configuration](https://github.com/thehyve/python_csr2transmart#usage) files,
located in `transformation_config_dir` (defined in [luigi.cfg](#luigi-configuration)).

Keycloak settings have to be set in luigi.cfg file (TransmartApiTask section), so the pipeline can access GB Backend
and tranSMART REST APIs.
### Luigi configuration

| key | description |
|-----------|-------------|
| `keycloak_url` | keycloak url that used. e.g. `https://keycloak.example.com/auth/realms/transmart-dev` |
| `client_id` | keycloak client id. |
| `offline_token` | keycloak offline token. |
Luigi configuration can be created by coping the `luigi.cfg-sample`:

``` bash
cp luigi.cfg-sample luigi.cfg
```

### Offline token
Config options overview:

| Variable | Section | Default value | Description |
|---------------------------|--------------------------|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| logging_conf | core | logging.cfg | Name of logging configuration file. |
| db_connection | task_history | sqlite://pmc-luigi-db.sqlite | Database to store task history. |
| record_taks_history | scheduler | True | Store task history True or False. Requires db_connection under task_history to be set if True |
| state_path | scheduler | luigi-state.pickle | Path to save a pickle file with the current state of the pipeline |
| remove_delay | scheduler | 86400 (1 day) | Set how long tasks should be remembered in task dashboard. Value in seconds. |
| drop_dir | GlobalConfig | /home/drop/drop_zone | Path to drop zone directory. |
| data_repo_dir | GlobalConfig | /home/pmc/data | Path to the git folder to store input and staging data. By default the pipeline creates the input, staging, load logs, intermediate file dir inside the repo_root. |
| working_dir | GlobalConfig | /home/pmc/working_dir | Directory used as working directory. Similar to /tmp. |
| transformation_config_dir | GlobalConfig | /home/pmc/config | Directory with the configuration files required for transformation. |
| load_logs_dir_name | GlobalConfig | load_logs | Directory name to store loading logs. |
| transmart_copy_jar | GlobalConfig | /home/pmc/libs | Location to transmart-copy jar file to use for data loading to tranSMART. |
| study_id | GlobalConfig | CSR_STUDY | Study ID of the study used in tranSMART. |
| top_node | GlobalConfig | \Central Subject Registry\ | Name of the top ontology tree node to display in tranSMART. |
| PGHOST | GlobalConfig | localhost | tranSMART database host. |
| PGPORT | GlobalConfig | 5432 | tranSMART database port. |
| PGDATABASE | GlobalConfig | transmart | tranSMART database name. |
| PGUSER | GlobalConfig | tm_cz | User to use for loading data to tranSMART. |
| PGPASSWORD | GlobalConfig | tm_cz | User password. |
| transmart_loader | resources | 1 | Amount of workers luigi has access to. |
| keycloak_url | TransmartApiTask | | URL to Keycloak instance used to get access to tranSMART, e.g. https://keycloak.example.com/auth/realms/transmart-dev |
| transmart_url | TransmartApiTask | | URL to tranSMART API V2. |
| gb_backend_url | TransmartApiTask | | URL to Glowing Bear Backend API. |
| client_id | TransmartApiTask | | Keycloak client ID. |
| offline_token | TransmartApiTask | | Offline token used to request an access token in order to communicate with Gb Backend and tranSMART REST APIs. |
| docker_image | CbioportalDataValidation | | Name of docker image to use during cBioPortal data validation. |
| docker_image | CbioportalDataLoading | | Name of docker image to use during cBioPortal data loading. |
| server_name | CbioportalDataLoading | | Name of the the cBioPortal server. |
| PGADMINUSER | E2eTest | postgres | Test database user. |
| PGADMINPASSWORD | E2eTest | password | Test database user password. |
| parent_drop_dir | E2eTest | | Test data dropzone directory. |

#### Offline token

The application requires an offline token to exchange it for an access token to communicate with tranSMART and GB Backend.

Expand All @@ -42,81 +80,115 @@ Before using the command you have to substitute words in uppercase with proper o

The value of the `refresh_token` field in the response is the offline token.

## Usage
### Email configuration

First make sure to copy luigi.cfg-sample and update it.
Email configuration can be created by coping the `email_config.cfg-sample`:

``` bash
$ cp luigi.cfg-sample luigi.cfg

cp email_config.cfg-sample email_config.cfg
```

Start luigi daemon, the pipeline workers will communicate with the daemon for work:
Config options overview:

``` bash
$ luigid

```
| Variable | Section | Default value | Description |
|----------|---------|-----------------------------|---------------------------------------------------------------|
| log_file | global | python.log | Logging file name. |
| receiver | email | | Email address of the receiver, can be a comma separated list. |
| sender | email | | Email address of the sender. |
| prefix | email | [CSR Data Loading Pipeline] | Prefix for subject line of the error email. |
| port | smtp | 587 | Port to use for sending emails. |
| username | smtp | | Username for email client, when not needed can be left empty. |
| password | smtp | | Password for email client. |
| host | smtp | smtp.gmail.com | Host of the email client. |

To start the full pipeline:

``` bash
$ ./run.sh
## Input data

```
In the drop directory (`drop_dir` in luigi.cfg) the clinical data files should be provided as well as a folder called NGS with all omcs data.
Each file in the drop directory has to be accompanied by a sha1 checksum file.

File naming convention:
- File name: `<filename>.<extension>`
- Sha1 file: `<filename>.<extension>.sha1`

E.g. data.txt has to have data.txt.sha1 next to it with sha1 hash of the data file.
As sha1 hashes 40 characters long the rest of the file gets ignored:

`1625be750dab24057c4c82d62d27298236ebb04c diagnosis.txt`

For more information, see the [CSR data model](https://github.com/thehyve/python_csr2transmart#data-model) description
and an example of [input data files](./test_data/dropzone/E2E_TEST_DATA).

## Usage

## Fixing manually when something went wrong
1. Make sure you have `luigi.cfg` and `email_config.cfg` properly configured (see [configuration section](#configuration))
and the input data is in the proper directory (see [input data section](#input-data)).

When pipeline fails on loading or you want to load data from the history you need to run the data load step only.
2. Install dependencies:

Pipeline requires Python >= 3.6.

``` bash
python -m pip install -r requirements/requirements.txt
```

3. Start luigi daemon:

``` bash
luigid
```

4. Start the full pipeline:

``` bash
./scripts/run.sh
```



### Other available scripts

To load data to transmart only:

``` bash
$ ./load_transmart_data.sh

./scripts/load_transmart_data.sh
```

To load data to cbioportal only:

``` bash
$ ./load_cbioportal_data.sh

./scripts/load_cbioportal_data.sh
```

To load data to both systems:

``` bash
$ ./load_data.sh

./scripts/load_data.sh
```

The pipeline creates files that start with `.done-*`.
These files created for each successfully finished task of the pipeline.
To force execution of tasks again you need to remove these files:
``` bash
$ ./remove_done_files.sh
./scripts/remove_done_files.sh
```

That's what happens as part of the script when you run load data task only.

The typical workflow to load version from history might look like this:
## Test

To run e2e tests:
``` bash
$ cd <data repo>
$ git checkout <hash of the commit which data versioin we want to load>
$ cd <this directory, with the bash scripts>
$ ./load_data.sh
$ cd <data repo>
# to clean repository from data loading logs
$ git checkout .
# return back recent version
$ git checkout master
./scripts/e2e_transmart_only.sh
```

Please note that above example won't leave any sign in the git history of the fact of data load.

To run other tests:
``` bash
./scripts/run_tests.sh
```

## License

Copyright (c) 2018, 2019 The Hyve B.V.

The PMC conversion pipeline is licensed under the MIT License. See the file `<LICENSE>`_.
The PMC conversion pipeline is licensed under the MIT License. See the file [LICENSE](LICENSE).
Loading

0 comments on commit 1ff02be

Please sign in to comment.