This repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.
It compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.
Dataframe Library | Parallel | Out-of-core | CPU/GPU | Evaluation |
---|---|---|---|---|
Pandas | no | no [1] | CPU | eager |
Dask | yes | yes | CPU | lazy |
Spark | yes | yes | CPU | lazy |
cuDF | yes | no | GPU | eager |
Dask-cuDF | yes | yes | GPU | lazy |
[1] Pandas can read data in chunks, but they have to be processed independently.
Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba
- Install Miniconda3
- Install Mamba:
conda install mamba -n base -c conda-forge
- Clone this git repository
git clone https://github.com/sbl-sdsc/df-parallel.git
- Create CONDA environment
mamba env create -f df-parallel/environment.yml
- Activate the CONDA environment
conda activate df-parallel
- Launch Jupyter Lab
jupyter lab
- Deactivate the CONDA environment
conda deactivate
To remove the CONDA environment, run
conda env remove -n df-parallel
To launch Jupyter Lab on Expanse, use the galyleo script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s ACCESS program or for a trial allocation, contact consult@sdsc.edu.
- Clone this git repository
git clone https://github.com/sbl-sdsc/df-parallel.git
2a. Run on CPU (Pandas, Dask, and Spark dataframes):
galyleo launch --account <account_number> --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml "${HOME}/df-parallel/environment.yml" --mamba
2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):
galyleo launch --account <account_number> --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba
After Jupyter Lab has been launched, run the Notebook DownloadData.ipynb to create a dataset. In this notebook, specify the number of copies (ncopies
) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.
Results for running on SDSC Expanse GPU node with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).
Datafile size (gene_info.tsv as of June 2022):
- Dataset 1: 5.4 GB (18 GB in Pandas)
- Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)
- Dataset 3: 43.7 GB (8 x Dataset 1)
Dataframe Library | time(5.4 GB) (s) | time(21.4 GB) (s) | time(43.7 GB) (s) | Parallel | Out-of-core | CPU/GPU |
---|---|---|---|---|---|---|
Pandas | 56.3 | 222.4 | -- [2] | no | no | CPU |
Dask | 15.7 | 42.1 | 121.8 | yes | yes | CPU |
Spark | 14.2 | 31.2 | 56.5 | yes | yes | CPU |
cuDF | 3.2 | -- [2] | -- [2] | yes | no | GPU |
Dask-cuDF | 7.3 | 11.9 | 19.0 | yes | yes | GPU |
[2] out of memory