From 03b962a6e245e01d46c29c984bd941f21358dff4 Mon Sep 17 00:00:00 2001 From: Oufattole Date: Sat, 14 Sep 2024 14:27:43 -0400 Subject: [PATCH] Possible expand_shards documentation typo expand_shards should take in the same input as data.root in the documentation right? --- docs/source/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/usage.md b/docs/source/usage.md index 245fbb3..e04d9bc 100644 --- a/docs/source/usage.md +++ b/docs/source/usage.md @@ -212,7 +212,7 @@ aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data.path="baz.p A MEDS dataset can have multiple shards, each stored as a `.parquet` file containing subsets of the full dataset. We can make use of Hydra's launchers and multi-run (`-m`) capabilities to start an extraction job for each shard (`data=sharded`), either in series or in parallel (e.g., using `joblib`, or `submitit` for Slurm). To load data with multiple shards, a data root needs to be provided, along with an expression containing a comma-delimited list of files for each shard. We provide a function `expand_shards` to do this, which accepts a sequence representing `/`. It also accepts a file directory, where all `.parquet` files in its directory and subdirectories will be included. ```bash -aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards qux/#)" -m +aces-cli cohort_name="foo" cohort_dir="bar/" data.standard=meds data=sharded data.root="baz/" "data.shard=$(expand_shards baz/)" -m ``` ### ESGPT