Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Commit

Permalink
Merge pull request #100 from thehyve/remove-cbioportal-tasks
Browse files Browse the repository at this point in the history
Remove cBioPortal data loading tasks and configuration.
  • Loading branch information
ewelinagr authored Apr 22, 2021
2 parents 6927801 + c6ec776 commit ef4ff85
Show file tree
Hide file tree
Showing 12 changed files with 12 additions and 229 deletions.
55 changes: 8 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
Data transformation and loading pipeline. It uses [Luigi](https://github.com/spotify/luigi) Python package for jobs handling
and [python_csr2transmart](https://github.com/thehyve/python_csr2transmart) package for transformation of Central Subject Registry data.

It loads data to [tranSMART](https://github.com/thehyve/transmart-core) platform using [transmart-copy](https://github.com/thehyve/transmart-core/tree/dev/transmart-copy) tool
and to [cBioPortal](https://github.com/cBioPortal/cbioportal) using [cbioportalImporter.py](https://docs.cbioportal.org/5.1-data-loading/data-loading/data-loading-for-developers) script.
It loads data to [tranSMART](https://github.com/thehyve/transmart-core) platform using [transmart-copy](https://github.com/thehyve/transmart-core/tree/dev/transmart-copy) tool.

For a production deployment instructions, start with the [deployment](#deployment) section.

Expand Down Expand Up @@ -49,16 +48,12 @@ Config options overview:
| PGDATABASE | GlobalConfig | transmart | tranSMART database name. |
| PGUSER | GlobalConfig | biomart_user | User to use for loading data to tranSMART. |
| PGPASSWORD | GlobalConfig | biomart_user | User password. |
| disable_cbioportal_task | LoadDataFromNewFilesTask | true | Skip loading data into cBioPortal. |
| transmart_loader | resources | 1 | Amount of workers luigi has access to. |
| keycloak_url | TransmartApiTask | https://keycloak.example.com/auth/realms/example | URL to Keycloak instance used to get access to tranSMART, e.g. https://keycloak.example.com/auth/realms/transmart-dev |
| transmart_url | TransmartApiTask | http://localhost:8081 | URL to tranSMART API V2. |
| gb_backend_url | TransmartApiTask | http://localhost:8083 | URL to Glowing Bear Backend API. |
| client_id | TransmartApiTask | transmart-client | Keycloak client ID. |
| offline_token | TransmartApiTask | | Offline token used to request an access token in order to communicate with Gb Backend and tranSMART REST APIs. |
| docker_image | CbioportalDataValidation | | Name of docker image to use during cBioPortal data validation. |
| docker_image | CbioportalDataLoading | | Name of docker image to use during cBioPortal data loading. |
| server_name | CbioportalDataLoading | | Name of the the cBioPortal server. |
| offline_token | TransmartApiTask | | Offline token used to request an access token in order to communicate with Gb Backend and tranSMART REST APIs. | |

#### Offline token

Expand Down Expand Up @@ -118,19 +113,12 @@ Config options overview:
### Transformation configuration
Configuration files for TranSMART and cBioPortal must be placed in `transformation_config_dir`.
Specifically, the following are expected:
Configuration files for TranSMART must be placed in `transformation_config_dir`.
Specifically, `sources_config.json` and `ontology_config.json`, described in [python_csr2transmart](https://github.com/thehyve/python_csr2transmart#usage).
- `sources_config.json` and `ontology_config.json`, described in [python_csr2transmart](https://github.com/thehyve/python_csr2transmart#usage);
the files reference the input data and need to be customized accordingly,
- `portal.properties` file for cBioPortal;
the file must match the mounted cBioPortal image version and server environment,
- `cbioportal_db_info` folder, containing configuration files for the cBioPortal database
(`cancertypes.json`, `genes.json`, `genesaliases.json`);
the files must match the mounted cBioPortal image version.
The files reference the input data and need to be customized accordingly.
Sample configuration files are provided in [test_data/test_data_NGS/config](https://github.com/thehyve/pmc-conversion/tree/master/test_data/test_data_NGS/config).
Be aware that the provided `portal.properties` is a minimal example, and must be replaced with a server-specific version to allow cBioPortal to run.
## Input data
Expand Down Expand Up @@ -196,39 +184,12 @@ When starting the full pipeline, it executes the following tasks:
5. Calls after_data_loading_update tranSMART API call to clear and rebuild the application cache.
tranSMART loading log is committed using git.
6. If cBioPortal task is not disabled:
1. Reads CSR files and transforms the data to patient and sample files to be imported into cBioPortal.
2. Validates created cBioPortal staging files with cBioPortal validator. To validate data,
the pipeline starts a Docker container using a pre-installed image (cbioportal-hg38:1.10.2).
In this container, it will run the cBioPortal validation code. The image contains specific configurations to connect
to the appropriate database. The progress is committed.
3. Loads the cBioPortal data, if data passes the validation. To load data,
the pipeline starts another Docker container using the same pre-installed image.
In this container, it will run the cBioPortal importer code.
After importing the pipeline restarts the docker container running the web server.
The progress is committed.
7. In case not all the tasks are completed successfully, an email will be sent to the configured receivers,
6. In case not all the tasks are completed successfully, an email will be sent to the configured receivers,
containing the full error report.
### Other available scripts
To load data to transmart only:
``` bash
./scripts/load_transmart_data.sh
```
To load data to cbioportal only:
``` bash
./scripts/load_cbioportal_data.sh
```
To load data to both systems:
To load data to TranSMART:
``` bash
./scripts/load_data.sh
Expand All @@ -247,7 +208,7 @@ To force execution of tasks again you need to remove these files:
### E2e tests
The `e2e_transmart_only` test will run all the pipeline tasks, except cBioPortal part.
The `e2e_transmart_only` test will run all the pipeline tasks.
When running the test, data from `drop_dir` directory configured in `luigi.cfg`
will be transformed and loaded to the currently configured tranSMART database.
This will also trigger the after_data_loading_update tranSMART API call.
Expand Down
6 changes: 3 additions & 3 deletions luigi-pipeline/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from .main import LoadDataFromNewFilesTask, TransmartDataLoader, CbioportalDataLoading, \
Sources2CsrTransformation, GitCommit, UpdateDataFiles, TransmartDataTransformation, CbioportalDataTransformation, \
CbioportalDataValidation, TransmartApiCalls
from .main import LoadDataFromNewFilesTask, TransmartDataLoader, \
Sources2CsrTransformation, GitCommit, UpdateDataFiles, TransmartDataTransformation, \
TransmartApiCalls
import logging
logging.getLogger(__name__).addHandler(logging.NullHandler())
130 changes: 0 additions & 130 deletions luigi-pipeline/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,8 @@
import os
import shutil
import threading
import time

import luigi
from csr2cbioportal import csr2cbioportal
from csr2transmart import csr2transmart
from sources2csr.sources2csr import sources2csr

Expand All @@ -17,8 +15,6 @@

logger = logging.getLogger('luigi')

CBIOPORTAL_DIR_NAME = 'cbioportal'


class GlobalConfig(luigi.Config):
drop_dir = luigi.Parameter(description='Directory files gets uploaded to.')
Expand Down Expand Up @@ -53,28 +49,18 @@ def transmart_staging_dir(self):
def load_logs_dir(self):
return os.path.join(self.data_repo_dir, self.load_logs_dir_name)

@property
def cbioportal_staging_dir(self):
return os.path.join(self.data_repo_dir, 'staging', CBIOPORTAL_DIR_NAME)

@property
def transmart_load_logs_dir(self):
return os.path.join(self.load_logs_dir, 'transmart')

@property
def cbioportal_load_logs_dir(self):
return os.path.join(self.load_logs_dir, CBIOPORTAL_DIR_NAME)


config = GlobalConfig()
git_lock = threading.RLock()
repo = get_git_repo(config.data_repo_dir)

os.makedirs(config.input_data_dir, exist_ok=True)
os.makedirs(config.cbioportal_staging_dir, exist_ok=True)
os.makedirs(config.transmart_staging_dir, exist_ok=True)
os.makedirs(config.transmart_load_logs_dir, exist_ok=True)
os.makedirs(config.cbioportal_load_logs_dir, exist_ok=True)


def calc_done_signal_content(file_checksum_pairs):
Expand Down Expand Up @@ -140,20 +126,6 @@ def run(self):
config.top_node)


class CbioportalDataTransformation(BaseTask):
"""
Task to transform data from CSR intermediate files and NGS input study files to cBioPortal importer format
"""
def run(self):
clinical_input_file = os.path.join(config.working_dir)
ngs_dir = os.path.join(config.input_data_dir, 'NGS')
if not os.path.isdir(ngs_dir):
ngs_dir = None
csr2cbioportal.csr2cbioportal(input_dir=clinical_input_file,
ngs_dir=ngs_dir,
output_dir=config.cbioportal_staging_dir)


class TransmartDataLoader(ExternalProgramTask):
"""
Task to load data to tranSMART
Expand Down Expand Up @@ -206,88 +178,6 @@ def run(self):
reload_obj.scan_subscription_queries()


class CbioportalDataValidation(ExternalProgramTask):
"""
Task to validate data for cBioPortal
This requires:
1. Docker installed
2. cBioPortal image in Docker
3. pmc user added to group 'docker'
4. A running cBioPortal instance
5. A running cBioPortal database
"""

# Set specific docker image
docker_image = luigi.Parameter(description='cBioPortal docker image', significant=False)
std_out_err_dir = os.path.join(config.cbioportal_load_logs_dir, 'validation')

# Success codes for validation
success_codes = [0, 3]

def program_args(self):
# Directory and file names for validation
input_dir = config.cbioportal_staging_dir
report_dir = config.cbioportal_load_logs_dir
db_info_dir = os.path.join(config.transformation_config_dir, 'cbioportal_db_info')
properties_file = os.path.join(config.transformation_config_dir, 'portal.properties')
report_name = 'report_pmc_test_%s.html' % time.strftime("%Y%m%d-%H%M%S")

# Build validation command. No connection has to be made to the database or web server.
docker_command = 'docker run --rm -v %s:/cbioportal/portal.properties -v %s:/study/ -v %s:/cbioportal_db_info/ -v %s:/html_reports/ %s' \
% (properties_file, input_dir, db_info_dir, report_dir, self.docker_image)

python_command = 'validateData.py -s /study/ ' \
'-P /cbioportal/portal.properties ' \
'-p /cbioportal_db_info -html /html_reports/%s -v' \
% report_name
return [docker_command, python_command]


class CbioportalDataLoading(ExternalProgramTask):
"""
Task to load data to cBioPortal
This requires:
1. Docker installed
2. cBioPortal image in Docker
3. pmc user added to group 'docker'
4. A running cBioPortal instance
5. A running cBioPortal database
"""
std_out_err_dir = os.path.join(config.cbioportal_load_logs_dir, 'loader')

# Variables
docker_image = luigi.Parameter(description='cBioPortal docker image', significant=False)
server_name = luigi.Parameter(description='Server on which pipeline is running. If running docker locally, leave '
'empty. PMC servers: pmc-cbioportal-test | '
'pmc-cbioportal-acc | pmc-cbioportal-prod', significant=False)

def program_args(self):
# Directory and file names for validation
input_dir = config.cbioportal_staging_dir
properties_file = os.path.join(config.transformation_config_dir, 'portal.properties')

python_command = 'cbioportalImporter.py -s /study/'

# Check if cBioPortal is running locally or on other server
if self.server_name == "":
# Build import command for running the pipeline locally
docker_command = 'docker run --rm -v %s:/cbioportal/portal.properties -v %s:/study/ --net cbio-net %s' \
% (properties_file, input_dir, self.docker_image)

# Restart cBioPortal web server docker container on the local machine
restart_command = "&& docker restart cbioportal"
else:
# Build the import command for running the pipeline on the PMC staging server
docker_command = 'docker run --network="host" --rm -v %s:/cbioportal/portal.properties -v %s:/study/ -v /etc/hosts:/etc/hosts %s' \
% (properties_file, input_dir, self.docker_image)

# Restart cBioPortal web server docker container which runs on a different machine
restart_command = "&& ssh %s 'docker restart cbioportal'" % self.server_name
return [docker_command, python_command, restart_command]


class GitVersionTask(BaseTask):
commit_hexsha = luigi.Parameter(description='commit to come back to')
succeeded_once = False
Expand All @@ -307,7 +197,6 @@ def complete(self):


class LoadDataFromNewFilesTask(luigi.WrapperTask):
disable_cbioportal_task = luigi.BoolParameter(description='Skip loading data into cBioPortal.', default=False)

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Expand Down Expand Up @@ -349,25 +238,6 @@ def _build_task_dependency_tree(self):
commit_transmart_load_logs.required_tasks = [transmart_api_task]
yield commit_transmart_load_logs

if not self.disable_cbioportal_task:
cbioportal_data_transformation = CbioportalDataTransformation()
cbioportal_data_transformation.required_tasks = [sources_to_csr_task]
yield cbioportal_data_transformation
cbioportal_data_validation = CbioportalDataValidation()
cbioportal_data_validation.required_tasks = [cbioportal_data_transformation]
yield cbioportal_data_validation
commit_cbio_staging = GitCommit(directory_to_add=config.cbioportal_staging_dir,
commit_message='Add cbioportal data.')
commit_cbio_staging.required_tasks = [cbioportal_data_validation]
yield commit_cbio_staging
cbioportal_data_loading = CbioportalDataLoading()
cbioportal_data_loading.required_tasks = [commit_cbio_staging]
yield cbioportal_data_loading
commit_cbio_load_logs = GitCommit(directory_to_add=config.cbioportal_load_logs_dir,
commit_message='Add cbioportal loading log.')
commit_cbio_load_logs.required_tasks = [cbioportal_data_loading]
yield commit_cbio_load_logs

def requires(self):
return self.tasks_dependency_tree

Expand Down
13 changes: 0 additions & 13 deletions luigi.cfg-sample
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,6 @@ PGUSER=biomart_user
PGPASSWORD=biomart_user # CHANGE ME


[LoadDataFromNewFilesTask]
# cBioPortal task settings
disable_cbioportal_task=True

[resources]
# Resources are used to limit the number of concurrent tasks
transmart_loader=1
Expand All @@ -64,12 +60,3 @@ transmart_url=http://localhost:8081 # CHANGE ME
gb_backend_url=http://localhost:8083 # CHANGE ME
client_id=transmart-client
offline_token=<pmc-pipeline-user-offline-token> # CHANGE ME


[CbioportalDataValidation]
docker_image=cbioportal-image:3.0.6


[CbioportalDataLoading]
docker_image=cbioportal-image:3.0.6
server_name=pmc-cbioportal-test
2 changes: 1 addition & 1 deletion scripts/e2e_transmart_only.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ set -e
set -x

$(dirname "$0")/remove_done_files.sh
python3 -m luigi --module luigi-pipeline LoadDataFromNewFilesTask --disable-cbioportal-task 'True'
python3 -m luigi --module luigi-pipeline LoadDataFromNewFilesTask
12 changes: 0 additions & 12 deletions scripts/load_cbioportal_data.sh

This file was deleted.

1 change: 0 additions & 1 deletion scripts/load_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,3 @@ set -x
SCRIPT_DIR=$(dirname $0)
cd ${SCRIPT_DIR}
./load_transmart_data.sh
./load_cbioportal_data.sh
2 changes: 0 additions & 2 deletions test_data/test_data_NGS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ Two versions of the data are provided, to enable quick rounds of testing:
- `full_dataset`, and
- `alternative`, from which patient PAT2 has been removed.


The datasets also contain test data for cBioportal in the NGS folder.
The data include the following samples:

- BIOS1T_BIOM1T (tumor) and BIOS1N_BIOM1N (normal) for patient PAT1,
Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

17 changes: 0 additions & 17 deletions test_data/test_data_NGS/config/portal.properties

This file was deleted.

0 comments on commit ef4ff85

Please sign in to comment.