See release notes.
This is a repo to download HCP data using Python, subject by subject, pre-process it to extract timeseries data, and then delete the large image files in parallel all using Python and R. To use this repo you will need:
- Anaconda and Python >= 3.6
- An account with the HCP database. In using this repository you agree to the Open Access Terms established by the Connectome Consortium.
- Amazon S3 access with the HCP database. We will use the
boto3
package for Amazon S3 access. See here on configuring your credentials. - Workbench installed and the
wb_command
added to your PATH variable.
Exact instructions for doing the above have been purposely avoided because they are platform dependent. See below for a quick test to see whether you have succeeded.
Currently the repo downloads dense time series (dtseries), dense labels (dlabels) and generates parcellated timeseries using Workbench (ptseries). Then it parses the CIFTI2 format parcellated timeseries to simply return a python dictionary whose keys are ROI names and values are time series data.
Most of these instructions are for a Linux machine and have been tested out using Ubuntu 18.04 with Conda 4.6.7 and Python 3.7.
To set up the Anaconda use the environment.yml
file in the repo. The build numbers as well as version numbers have been excluded to let conda figure out what is best for the versioning on your platoform.
conda env create -f environment.yml
This is a short explanation of the inner workings of the code in this repository. There are two main files, download_hcp.py
and automate.py
download_hcp
is best thought of as a module that implements sub functions for downloading, processing and cleaning up remainder files.- The three main functions are
download_subject(), process_subject()
andclean_subject()
. - The
download_subject()
function does what it says, it downloads data regarding a particular subject id like '100610'. But it also filters the downloads for what you need. Currenty the filtration keywords are hardcoded indownload_subject()
. You need to have AWS S3 access/credentials andboto3
installed for this function to work. - We implement
process_subject()
to run the workbench command. It should takes the dense time series (.dtseries) and a parcellation label file (.dlabel) as input. It returns a list of output files. To call workbench it uses thesubprocess
module. You need to have workbench downloaded, installed, and its binaries added to your path for this to work. - We implement
clean_subject()
to clean up the large downloaded files once we have generated the parcellated time series. Its input is a list of files to keep on disk. It should return nothing but utilizingmap
for parallelizing means functions have to return something (see below). - We also display disk usage statistics during runtime.
automate.py
calls above functions using python's parallelism enabling modules
If you have Amazon S3 and boto
set up correctly with your credentials, you should be able to activate your environment, fire up python, and run
from download_hcp import *
dfiles = download_subject('100610')
and get no errors.
If you have workbench installed and correctly added to path, then in your conda environment, you should be able to fire up python and say
import subprocess
subprocess.run(['wb_command','-help'])
and get meaningful output.
Prof. YMB suggested that having large amounts of RAM even with just a few cores should allow for some parallelization: each of the *_subject()
functions should be parallelizable using the multiprocessing
package. This is easy a la functional programming!
- The
do_subject()
function chains together the above functions so that we can usemultiprocessing.Pool.map()
function on our list of subject ids. The last function in the chain should return the final python object to be stored on disk corresponding to each subject. - We implement a
process_ptseries()
function that can be called byclean_subject()
. This function should take the generated ptseries file in CIFTI2 format and return a python dictionary containing ROI names and related time series. This function utilizes anR
module under the hood which should get automatically installed. Theclean_subject()
function, that originally had nothing to return, can now return this object so thatmap
works. (recall,map
applies a function to each element of a list, and in particular can never change the length of a list).
Note how do_subject
really only does:
clean_subject(idx, process_subject(*download_subject(idx)))
and parallelization only involves:
with mp.Pool(N) as pool:
result = pool.map(do_subject, subject_ids)
where N
is the number of parallel processes. That's so clean even I am surprised that it worked out this way.
Note: The installation of
R::cifti
below is no longer needed to be done manually since v2. The module should automatically check if it exists, and if not, install it for you.
We utilize an R module in this repo. If you set up the environment using the provided .yml file, and it worked without errors you should be good. Else you need to first, install rpy2 for conda using:
conda install rpy2
on your environment in use. That should install the R packages needed to use R from within python. Next install the cifti
package from CRAN:
# import rpy2's package module
import rpy2.robjects.packages as rpackages
# import R's utility package
utils = rpackages.importr('utils')
utils.install_packages('cifti')
It should prompt you to pick a CRAN server for the session. If the installation is successful, it should end with
.
.
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (cifti)
You can confirm successful installation by opening python and running:
from rpy2.robjects.packages import importr
importr('cifti')
which should return:
>>> importr('cifti')
rpy2.robjects.packages.Package as a <module 'cifti'>
You may have to install a development packageis on your system for xml2
, etc. Just use sudo apt-get install xml2-dev
or whatever is missing.
Alternatively, you can set up the environment using the environment.yml
file.