Skip to content

OpenCGA Catalog

Pedro Furió edited this page Oct 3, 2016 · 4 revisions

Overview

A genomic data analysis platform need to keep track of different resources such as metadata of files, sample annotations or jobs. OpenCGA Catalog aims to collect and integrate all the information needed for executing genomic analysis. This information is organized in nine main entities: users, studies, files, samples, datasets, cohorts, individuals, disease panels and jobs.

The main tasks of Catalog are to provide:

  • Authentication and authorization to the different resources.
  • A collaborative environment.
  • File audit to keep track of files and metadata.
  • Analysis and Jobs.
  • Sample, individual and cohort annotation.
  • Security.

All this information can be stored and retrieved using our Java and RESTful web services API.

Data Models

This section describes the most relevant entities. For more detailed information about the data models such as Java source code, examples or the JSON Schemas you can visit OpenCGA Catalog Data Models page.

The most relevant entities in OpenCGA Catalog are:

  • User: Contains the data related to the user account.
  • Project: Contains information of a project, covering as many related studies as necessary.
  • Study: Main space set environment. Contain files, samples, individuals, jobs...
  • File: Information regarding a submitted or generated file.
  • Sample: Information regarding the sample. Closely related to file entity.
  • Individual: Contain the information regarding the individual from whom the sample has been taken.
  • Cohort: Group sets of samples with some common feature(s).
  • 🚧 Dataset: Group sets of files.
  • Disease panel: Define a disease panel containing the variants, genes and/or regions of interest.
  • Job: Job analysis launched using any of the files or samples.

Catalog provides an authenticated environment. More details about the authentication system can be found here.

Catalog can be also a collaborative environment. This allows users share their own data with other users giving them permissions to create, read, modify, share and/or delete. This behaviour is explained in detail here.

File management

To preserve the consistency of the metadata over the set of files related to a project or a study, Catalog must have access to the files. Files are completely managed by Catalog. An extensive description of the Catalog file management can be found here.

This tracking could be done in several ways:

  • Copy or move the data into the "Catalog main directory" using the OpenCGA Catalog command line. From that moment, the data will be fully managed by Catalog, e.g., move, renaming, ...

  • Synchronizing a study directory. When creating a Study, a study location URI can be defined. At the creation process, Catalog scans the folder and records the information to start tracking files. Any modification of this directory (and subdirectories) without notify Catalog will create an inconsistent state of the metadata.

  • Synchronizing a single file. The most flexible (and expensive) way of synchronizing data is the single file tracking. Providing the URI for a file or directory, it can be linked into Catalog. The only restrictions over this way of synchronizing are: The file have to be accessible to Catalog, and, if the file is renamed, moved or modified, Catalog have to be notified (relink).

To specify the physical location of a file, Catalog uses URI. With the different data managers (IOManager), Catalog can manage resources under different file protocols (specified by the URI schema): file:// (PosixIOMAnager), 🚧 hdfs:// (HdfsIOManager), 🚧 i:// (IRodsIOManager), ...

Read files

All files can be used as input for analysis tools. Also, depending on some configurable thresholds, some basic read operations can be done using:

  • grep
  • head
  • tail
File life cycle

Catalog will contain a file entry for each tracked file from the study. Every file entry in catalog has a field “status” which represents the status of this tracking.

Several changes have been made on the file life cycle between v0.5.0 and v0.6.0. Due to the v0.6.0 still is under development, more changes could happen.

v0.6.0
  • STAGED File entry has been created in catalog, but file is not ready. Has not related file.

  • READY The file is correctly tracked.

  • MISSING File is not accessible from catalog.

  • TRASHED File marked for deletion. Will be deleted by the daemon.

  • DELETED File no longer exists. Completely removed from the file system, but the file entry will remain.

  • UNTRACKED The file is in the study folder, but doesn’t have a file entry in Catalog. This is not a real Catalog state because the files in this "status" are those without a file entry in Catalog (therefore, without status field).

Historical states, not longer in usage:

  • INDEXING File is being indexed by OpenCGA-Storage. Removed in v0.6.0
  • UPLOADING File is being uploaded. See “uploading files”. Replaced by STAGED
  • UPLOADED File uploaded but not completely reassembled. Replaced by STAGED
  • DELETING File marked for deletion. Replaced by TRASEHD.

🚧 The uploading process is designed to upload files using the REST API ...

Link external files

To synchronize a single file into Catalog without moving or coping it to a specific folder, the file have to be “linked” externally. With this action, the file entry will contain a URI to the external file. If the file pointed by the URI disappears (being moved, renamed, or deleted), the linked file status will change to MISSING.

Deleting an external file will only delete the file entry, and will never delete the external file.

If a file externally synchronized if renamed or moved in Catalog won’t rename the original file name or position.

Check files in study folder

The study file checking performs two actions.

First, for each READY, MISSING and TRASHED file in the study, checks if the file is accessible. Each READY file that loses the file tracking will be set to MISSING, and for those MISSING files that are newly accessible, will be set to READY.

The second operation will be the scan of the study folder, in order to find UNTRACKED files. This files will be reported, but not added to Catalog.

Resync study

If new files appear in the study folder, instead of adding each file manually one by one, the whole study can be synchronized. By doing this action, all the UNTRACKED files inside the study folder will be registered in Catalog.

Moving files
Index in OpenCGA-Storage
Deleting

The file deletion is made by two steps. First, the file is sent to the trash, renaming it to .deleted_<date>_<file-name> and the status to TRASHED. The trash is an intermediate status where files can be recovered (#131) or purged. When the file is trashed, a deletion date can be specified.

The daemon will work as a garbage collector, checking for each trashed file if the deletion date has expired. When this happens, the daemon will remove the file from the file system and change the file status to DELETED.

🚧 A TRASHED file which loses the file tracking, will change its status to DELETED. Deleting an external file will never delete the original file.

🚧 Optionally, a file can be can be locked to avoid deletions. To delete this files, first, the lock have to be opened. –DforceDelete=true to delete.

Job tracker

Job life cycle

Implementation

Java and RESTful APIs microservice in the future MongoDB implementation (plugin postregsql) and collection schema

Clone this wiki locally