Skip to content

Commit

Permalink
Merge pull request #87 from broadinstitute/contrib_workflow
Browse files Browse the repository at this point in the history
reflect new contribution workflow
  • Loading branch information
ErinWeisbart authored Jun 13, 2024
2 parents 39d248c + 0a45c59 commit 8fae7e6
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 19 deletions.
43 changes: 26 additions & 17 deletions documentation/contributing_to_cpg.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,35 @@ If you would like to contribute to our documentation, please make a [Pull Reques
We particularly welcome contributions to our list of [publications using data from the Cell Painting Gallery](publications.md) and to [workflows accessing the Cell Painting Gallery](workflows.md).

To ask a question that is not covered by our documentation, you are also welcome to create an [Issue](https://github.com/broadinstitute/cellpainting-gallery/issues) in the Cell Painting Gallery repository.
Please note that dataset-specific questions should be directed to the respective dataset repository linked in the [README](https://github.com/broadinstitute/cellpainting-gallery/README.md).
Please note that the Cell Painting Gallery is not a place to ask dataset-specific questions.
Instead, please direct such questions to the respective dataset repository linked in the [README](https://github.com/broadinstitute/cellpainting-gallery/README.md) or if no dataset repository, to the authors of any associated publication.

## Contributing Data to the Gallery

Contributions can be in the form of complete datasets or additions to extant datasets (such as segmentations or deep-learning generated profiles).
Contributions can be in the form of complete datasets or additions to extant datasets (e.g. segmentations or deep-learning generated profiles).
Please contact @erinweisbart or @shntnu to initiate discussion of a data contribution.
After receiving approval, you will receive a detailed contribution workflow customized for your data contribution.

To be approved, your dataset must meet the following requirements:
For new datasets, please include the following details in your contact:

### Assay structure
1) assay used (standard Cell Painting or describe the variation. If you would like to contribute data from a derivative assay it must be useable for morphological profiling in that it stains/labels multiple cellular compartments/organelles.)
2) approximate data size
3) components you wish to contribute (all are described in [data structure](https://broadinstitute.github.io/cellpainting-gallery/data_structure.html)) (major components: `images`, `analysis`, `backend`, `load_data_csv`, `profiles`. optional components: `pipelines`, `qc`, etc.). Note that `metadata` is required.
4) institutional identifier to use for data (e.g. `broad`, `anonymous`)
5) suggested top level project tag. Typically this is a 1 word summary of the project (e.g. cpg0011-lipocyteprofiler, cpg0016-jump, cpg0022-cmqtl) and sometimes also includes the last name of the first author (e.g. cpg0010-caie-drugresponse, cpg0028-kelley-resistance, cpg0031-caicedo-cmvip)

All datasets in the Cell Painting Gallery must be from a published Cell Painting Assay version or a close derivative.
If you would like to contribute data from a derivative assay it must be useable for morphological profiling in that it stains/labels multiple cellular compartments/organelles.
For existing datasets, please the following details include in your contact:

### Accompanying information
1) top-level project tag that your data corresponds to (e.g. `cpg0016-jump`)
2) approximate data size
3) components you wish to contribute

Any data contributions to Cell Painting Gallery must be accompanied by a pull request to the [Cell Painting Gallery repository](https://github.com/broadinstitute/cellpainting-gallery/) with updates to the README to add your dataset to [Available datasets](https://github.com/broadinstitute/cellpainting-gallery/README.md).
If your dataset is associated with a publication, please also edit [Publications](https://github.com/broadinstitute/cellpainting-gallery/docs/publications.md).
After approval, we will assign you a project identifier and create a new [Github Discussion](https://github.com/broadinstitute/cellpainting-gallery/discussions) to provide next steps and track data deposition.

## Preparing for data deposition

### Data structure
In preparation for transferring data, please perform all of the following steps:

#### Remove special characters in folder names
### Remove special characters in folder names

To the maximum extent possible, please avoid the following in your folder names

Expand All @@ -39,15 +45,18 @@ To the maximum extent possible, please avoid the following in your folder names

Please delete these characters if they are present in your folder names.

#### Prepare project-specific naming
### Prepare project-specific naming

Reference [data structure](data_structure.md) for comprehensive information on folder structure and naming.
Your data must strictly comply with the data structure we have laid out.
Additionally it must include all, unblinded metadata.

If your new dataset is approved for inclusion, we will assign you a project identifier.
Your source name can be your institution or it can be anonymized.
### Validate your data

#### Validate your data
We are building a [data validator](http://github.com/broadinstitute/cpg/cpgdata) to check compliance with our required structure.
It is currently in alpha and for internal use but we plan to develop it to the point that contributors can use it to validate their data before deposition in the future.

Use our [data validator](http://github.com/broadinstitute.org/cpg) to validate that your data complies with our required structure.
### Create a pull-request

Any data contributions to Cell Painting Gallery must be accompanied by a pull request to the [Cell Painting Gallery repository](https://github.com/broadinstitute/cellpainting-gallery/) with updates to the README to add your dataset to [Available datasets](https://github.com/broadinstitute/cellpainting-gallery/README.md).
If your dataset is associated with a publication, please also edit [Publications](https://github.com/broadinstitute/cellpainting-gallery/docs/publications.md).
12 changes: 10 additions & 2 deletions documentation/data_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ It can be anonymized (e.g. `s3://cellpainting-gallery/cpg0016-jump/` contains `s

Not all projects will have all parent structures.

The "completeness" of a project can be checked using thouris [data validator](https://github.com/broadinstitute/cpg).
The "completeness" of a project can be checked using our [data validator](https://github.com/broadinstitute/cpg).
Please note that it is in alpha and furhter functionality and documentation are under development.

## `images` folder structure

Expand All @@ -47,7 +48,7 @@ cellpainting-gallery/
```

Within the outer `images` folder, there are `YYYY_MM_DD_<batch-name>` subfolders for each batch.
Each batch folder should start with `YYYY_MM_DD` of the date that image acquisition started (or your best guess thereof).
Each batch folder typically starts with `YYYY_MM_DD` of the date that image acquisition started.
The rest of the batch folder name can be a simple ordinal (e.g. `YYYY_MM_DD_Batch1`) or more descriptive of its contents (e.g. `2020_01_02_TestPhalloidinConcentration`).
A single batch typically contains all of the plates that were imaged (or started acquisition) on that day.
However, for simplifying project tracking and analysis, sometimes plates imaged on the same day are divided into multiple batches where each batch is a different experimental condition (e.g. `2020_01_02_LowPhalloidin` and `2020_01_02_HighPhalloidin`).
Expand Down Expand Up @@ -294,6 +295,11 @@ In this example batch:
Within the `load_data_csv` folder is a folder for each batch and within each batch folder is a folder for each plate.
Within the plate folder there are typically two files - a `load_data.csv` for pipelines that do not use an illumination correction function and a `load_data_with_illum.csv` for pipelines that do use an illumination correction function, however atypical workflows can have other arrangements such as a separate CSV for each pipeline in the workflow.

The `load_data.csv` maps the actual file names and paths and their metadata (e.g. channel number, channel name) to the naming information passed to CellProfiler for running the images in a CellProfiler pipeline.
More information on `load_data.csv`'s and their contents is available in [CellProfiler documentation](https://cellprofiler-manual.s3.amazonaws.com/CellProfiler-4.2.6/modules/fileprocessing.html#loaddata).

Note that we do not currently enforce `load_data.csv` path requirements so `load_data.csv`'s may have paths that are either S3 paths (`s3://cellpainting-gallery/`) or mounted paths (`/home/ubuntu/bucket/`) and may have paths that match their current Cell Painting Gallery locations or may have paths of their original location before being transferred to the gallery.

```
└── load_data_csv
   └── 2021_04_26_Batch1
Expand Down Expand Up @@ -328,8 +334,10 @@ All datasets have at least `barcode_platemap.csv` and `PLATEMAP.txt` files.
Within `barcode_platemap.csv`, there are two columns: `Assay_Plate_Barcode` and `Plate_Map_Name`.

- `Assay_Plate_Barcode` matches the plate name used for analysis.
This may be a full string match to the platenames as acquired off the imager and stored in the `images` folder (e.g.`BR00117035__2021-05-02T16_02_51-Measurement1`) or it may be a truncation of the full string as long as it is still a unique identifier (e.g. `BR00117035`).
- `Plate_Map_Name` is the name of a platemap in the `platemaps/BATCH/platemap` folder.
There may be one-to-one or many-to-one correspondence between `Assay_Plate_Barcode` and `Plate_Map_Name`.
Platemap naming can vary greatly from dataset to dataset depending upon the source and their data tracking/naming conventions.

Within `PLATEMAP.txt` there at least `plate_map_name` and `well_position` columns and may be any additional number of metadata columns.

Expand Down

0 comments on commit 8fae7e6

Please sign in to comment.