Merge pull request #87 from broadinstitute/contrib_workflow

reflect new contribution workflow
broadinstitute · Jun 13, 2024 · 8fae7e6 · 8fae7e6
2 parents 39d248c + 0a45c59
commit 8fae7e6
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 19 deletions.
diff --git a/documentation/contributing_to_cpg.md b/documentation/contributing_to_cpg.md
@@ -8,29 +8,35 @@ If you would like to contribute to our documentation, please make a [Pull Reques
 We particularly welcome contributions to our list of [publications using data from the Cell Painting Gallery](publications.md) and to [workflows accessing the Cell Painting Gallery](workflows.md).
 
 To ask a question that is not covered by our documentation, you are also welcome to create an [Issue](https://github.com/broadinstitute/cellpainting-gallery/issues) in the Cell Painting Gallery repository.
-Please note that dataset-specific questions should be directed to the respective dataset repository linked in the [README](https://github.com/broadinstitute/cellpainting-gallery/README.md).
+Please note that the Cell Painting Gallery is not a place to ask dataset-specific questions.
+Instead, please direct such questions to the respective dataset repository linked in the [README](https://github.com/broadinstitute/cellpainting-gallery/README.md) or if no dataset repository, to the authors of any associated publication.
 
 ## Contributing Data to the Gallery
 
-Contributions can be in the form of complete datasets or additions to extant datasets (such as segmentations or deep-learning generated profiles).
+Contributions can be in the form of complete datasets or additions to extant datasets (e.g. segmentations or deep-learning generated profiles).
 Please contact @erinweisbart or @shntnu to initiate discussion of a data contribution.
-After receiving approval, you will receive a detailed contribution workflow customized for your data contribution.
 
-To be approved, your dataset must meet the following requirements:
+For new datasets, please include the following details in your contact:
 
-### Assay structure
+1) assay used (standard Cell Painting or describe the variation. If you would like to contribute data from a derivative assay it must be useable for morphological profiling in that it stains/labels multiple cellular compartments/organelles.)
+2) approximate data size
+3) components you wish to contribute (all are described in [data structure](https://broadinstitute.github.io/cellpainting-gallery/data_structure.html)) (major components: `images`, `analysis`, `backend`, `load_data_csv`, `profiles`. optional components: `pipelines`, `qc`, etc.). Note that `metadata` is required.
+4) institutional identifier to use for data (e.g. `broad`, `anonymous`)
+5) suggested top level project tag. Typically this is a 1 word summary of the project (e.g. cpg0011-lipocyteprofiler, cpg0016-jump, cpg0022-cmqtl) and sometimes also includes the last name of the first author (e.g. cpg0010-caie-drugresponse, cpg0028-kelley-resistance, cpg0031-caicedo-cmvip)
 
-All datasets in the Cell Painting Gallery must be from a published Cell Painting Assay version or a close derivative.
-If you would like to contribute data from a derivative assay it must be useable for morphological profiling in that it stains/labels multiple cellular compartments/organelles.
+For existing datasets, please  the following details include in your contact:
 
-### Accompanying information
+1) top-level project tag that your data corresponds to (e.g. `cpg0016-jump`)
+2) approximate data size
+3) components you wish to contribute
 
-Any data contributions to Cell Painting Gallery must be accompanied by a pull request to the [Cell Painting Gallery repository](https://github.com/broadinstitute/cellpainting-gallery/) with updates to the README to add your dataset to [Available datasets](https://github.com/broadinstitute/cellpainting-gallery/README.md).
-If your dataset is associated with a publication, please also edit [Publications](https://github.com/broadinstitute/cellpainting-gallery/docs/publications.md).
+After approval, we will assign you a project identifier and create a new [Github Discussion](https://github.com/broadinstitute/cellpainting-gallery/discussions) to provide next steps and track data deposition.
+
+## Preparing for data deposition
 
-### Data structure
+In preparation for transferring data, please perform all of the following steps:
 
-#### Remove special characters in folder names
+### Remove special characters in folder names
 
 To the maximum extent possible, please avoid the following in your folder names
 
@@ -39,15 +45,18 @@ To the maximum extent possible, please avoid the following in your folder names
 
 Please delete these characters if they are present in your folder names.
 
-#### Prepare project-specific naming
+### Prepare project-specific naming
 
 Reference [data structure](data_structure.md) for comprehensive information on folder structure and naming.
 Your data must strictly comply with the data structure we have laid out.
 Additionally it must include all, unblinded metadata.
 
-If your new dataset is approved for inclusion, we will assign you a project identifier.
-Your source name can be your institution or it can be anonymized.
+### Validate your data
 
-#### Validate your data
+We are building a [data validator](http://github.com/broadinstitute/cpg/cpgdata) to check compliance with our required structure.
+It is currently in alpha and for internal use but we plan to develop it to the point that contributors can use it to validate their data before deposition in the future.
 
-Use our [data validator](http://github.com/broadinstitute.org/cpg) to validate that your data complies with our required structure.
+### Create a pull-request
+
+Any data contributions to Cell Painting Gallery must be accompanied by a pull request to the [Cell Painting Gallery repository](https://github.com/broadinstitute/cellpainting-gallery/) with updates to the README to add your dataset to [Available datasets](https://github.com/broadinstitute/cellpainting-gallery/README.md).
+If your dataset is associated with a publication, please also edit [Publications](https://github.com/broadinstitute/cellpainting-gallery/docs/publications.md).
diff --git a/documentation/data_structure.md b/documentation/data_structure.md
@@ -24,7 +24,8 @@ It can be anonymized (e.g. `s3://cellpainting-gallery/cpg0016-jump/` contains `s
 
 Not all projects will have all parent structures.
 
-The "completeness" of a project can be checked using thouris [data validator](https://github.com/broadinstitute/cpg).
+The "completeness" of a project can be checked using our [data validator](https://github.com/broadinstitute/cpg).
+Please note that it is in alpha and furhter functionality and documentation are under development.
 
 ## `images` folder structure
 
@@ -47,7 +48,7 @@ cellpainting-gallery/
 ```
 
 Within the outer `images` folder, there are `YYYY_MM_DD_<batch-name>` subfolders for each batch.
-Each batch folder should start with `YYYY_MM_DD` of the date that image acquisition started (or your best guess thereof).
+Each batch folder typically starts with `YYYY_MM_DD` of the date that image acquisition started.
 The rest of the batch folder name can be a simple ordinal (e.g. `YYYY_MM_DD_Batch1`) or more descriptive of its contents (e.g. `2020_01_02_TestPhalloidinConcentration`).
 A single batch typically contains all of the plates that were imaged (or started acquisition) on that day.
 However, for simplifying project tracking and analysis, sometimes plates imaged on the same day are divided into multiple batches where each batch is a different experimental condition (e.g. `2020_01_02_LowPhalloidin` and `2020_01_02_HighPhalloidin`).
@@ -294,6 +295,11 @@ In this example batch:
 Within the `load_data_csv` folder is a folder for each batch and within each batch folder is a folder for each plate.
 Within the plate folder there are typically two files - a `load_data.csv` for pipelines that do not use an illumination correction function and a `load_data_with_illum.csv` for pipelines that do use an illumination correction function, however atypical workflows can have other arrangements such as a separate CSV for each pipeline in the workflow.
 
+The `load_data.csv` maps the actual file names and paths and their metadata (e.g. channel number, channel name) to the naming information passed to CellProfiler for running the images in a CellProfiler pipeline.
+More information on `load_data.csv`'s and their contents is available in [CellProfiler documentation](https://cellprofiler-manual.s3.amazonaws.com/CellProfiler-4.2.6/modules/fileprocessing.html#loaddata).
+
+Note that we do not currently enforce `load_data.csv` path requirements so `load_data.csv`'s may have paths that are either S3 paths (`s3://cellpainting-gallery/`) or mounted paths (`/home/ubuntu/bucket/`) and may have paths that match their current Cell Painting Gallery locations or may have paths of their original location before being transferred to the gallery.
+
 ```
 └── load_data_csv
      └── 2021_04_26_Batch1
@@ -328,8 +334,10 @@ All datasets have at least `barcode_platemap.csv` and `PLATEMAP.txt` files.
 Within `barcode_platemap.csv`, there are two columns: `Assay_Plate_Barcode` and `Plate_Map_Name`.  
 
 - `Assay_Plate_Barcode` matches the plate name used for analysis.
+This may be a full string match to the platenames as acquired off the imager and stored in the `images` folder (e.g.`BR00117035__2021-05-02T16_02_51-Measurement1`) or it may be a truncation of the full string as long as it is still a unique identifier (e.g. `BR00117035`).
 - `Plate_Map_Name` is the name of a platemap in the `platemaps/BATCH/platemap` folder.
 There may be one-to-one or many-to-one correspondence between `Assay_Plate_Barcode` and `Plate_Map_Name`.
+Platemap naming can vary greatly from dataset to dataset depending upon the source and their data tracking/naming conventions.
 
 Within `PLATEMAP.txt` there at least `plate_map_name` and `well_position` columns and may be any additional number of metadata columns.