Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UK - Biobank New BIDS dataset #29

Open
mpompolas opened this issue Jan 22, 2021 · 7 comments
Open

UK - Biobank New BIDS dataset #29

mpompolas opened this issue Jan 22, 2021 · 7 comments

Comments

@mpompolas
Copy link
Member

mpompolas commented Jan 22, 2021

ULTIMATE GOAL - Create a new REPO of UK-BioBank

For the purpose of this new BIDS dataset, we want to keep the final preprocessed files, and the derivatives that correspond to them (a gradient-corrected scan has a different segmentation than the original).

The new BIDS folder should appear as an identical copy of UK-Biobank (same number of files AND same LABELS) but within a different folder name: e.g. UK_BioBank_processed, and also have the derivatives that were manually checked.

BEFORE MANUAL CHECK

Sandrine's pipeline seems ready to go.
At this stage, I suggest we keep all the intermediate files for easy identification of potential problems. If space becomes an issue on Joplin we reevaluate: maybe do it in batches.

AFTER MANUAL CHECK

We should have files within the /UK_BioBank_processed/derivatives folder. Labels should be without RPI,gradcorr etc. suffixes, so on your code when you add the suffix _manual, make sure your strip those off.

Regarding the anatomy files (not the derivatives), we want to keep the last file of the pre-processing only, with the same name as the original:
e.g. Instead of: sub-1000252_T2w_RPI_r_gradcorr.nii.gz it should be sub-1000252_T2w.nii.

This will make things very easy for later processing through the Ivadomed pipeline.
So to sum it up:

  1. Rename the reoriented/resampled file to what the original was,
  2. Delete the rest of the processing files *RPI, *RPI_r_gradcorr etc..

NOTES

A few more files are needed for a complete BIDS folder: dataset_description.json and participants.json (you only have participants.tsv) - Maybe a README.TXT as well(?). Just copy these from the original UK-BioBank dataset.

The preprocessing steps should be documented somewhere: The easiest place would in the dataset_description.
Document git-version of SpinalCordToolbox and the function calls that were used with their parameters.
Another place could be the .json that is associated to each .nii.gz but that is a bit more work.
There is also the gradcorr file that needs to be documented somehow.... Don't have any input on that. As a start, maybe document which facility it came from(?)

@jcohenadad
Copy link
Member

thank you for initiating this @mpompolas, few precisions:

  • the repos should be under git-annex (not on duke)
  • the repos name should be the same as the original repos (unprocessed) with added suffix: -processed

@sandrinebedard
Copy link
Member

We should have files within the /UK_BioBank_processed/derivatives folder. Labels should be without RPI,gradcorr etc. suffixes, so on your code when you add the suffix _manual, make sure your strip those off.

@mpompolas So I can add to this branch a modified version of my script for manual corrections manual_correction.py so the output name of manual correction would be for example sub-1000032_T1w_seg-manual.nii.gz instead of sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz directly, is that right?

@mpompolas
Copy link
Member Author

the repos should be under git-annex (not on duke)
the repos name should be the same as the original repos (unprocessed) with added suffix: -processed

Thanks @jcohenadad , just edited my instructions.

So I can add to this branch a modified version of my script for manual corrections manual_correction.py so the output name of manual correction would be for example sub-1000032_T1w_seg-manual.nii.gz instead of sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz directly, is that right?

@sandrinebedard exactly. For creating this dataset, we will solely use code from this branch.

@sandrinebedard
Copy link
Member

I had some thoughts about the datasets we want to create. We talked about the fact that the derivatives folder would only be in the UK_BioBank_processed dataset. However, my pipeline for cord CSA takes as an input the raw images and also manual segmentation and disc label in the derivatives. So there will be a problem if the derivatives are in UK_BioBank_processed.

ideas:

  • I could modify my process_data.sh to take in the new dataset, so removing steps of resampling, reorientation and gradcorr but we would have to create the dataset before I can run my pipeline
  • Would it be possible to have the same derivatives folder associated to both datasets or something like that?

@jcohenadad do you have some thoughts on this?

@jcohenadad
Copy link
Member

@sandrinebedard good point.

I could modify my process_data.sh to take in the new dataset, so removing steps of resampling, reorientation and gradcorr but we would have to create the dataset before I can run my pipeline

I would lean towards this approach. You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

Would it be possible to have the same derivatives folder associated to both datasets or something like that?

I would advise against it. I'm afraid we will end up with out-of-sync derivatives (eg. segmentation manually corrected in dataset1 but we forgot to update it in dataset2).

@mpompolas
Copy link
Member Author

You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

I agree with @jcohenadad on splitting the script into two parts.

Would it be possible to have the same derivatives folder associated to both datasets or something like that?

The idea is to completely separate the original from the preprocessed dataset. If we put segmentations within the same folder from multiple datasets (I assume you would differentiate them with a suffix) it will become complicated later on to differentiate which ones we will use for training since we tend to have a standardized suffix in all datasets "_seg-manual" or "_labels-disk-manual" etc.
This standardization will make things very easy when we need to select multiple Datasets/BIDS folders as inputs in training.

@sandrinebedard
Copy link
Member

I would lean towards this approach. You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

@jcohenadad @mpompolas I agree, splitting the script seems like the best idea, I will get into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants