diff --git a/AUTHORS.md b/AUTHORS.md index 32aeff8..8f79b3b 100755 --- a/AUTHORS.md +++ b/AUTHORS.md @@ -7,6 +7,9 @@ All contributing authors are listed in this file below. The repository history at https://github.com/ljwoods2/zarrtraj and the CHANGELOG show individual code contributions. +New contributors should add themselves to the end of this file AND to +the file CITATION.cff at the end of the top-level authors list. + ## Chronological list of authors +## [0.3.0] 2024-10-24 + +## Authors +- ljwoods2 + +## Added +- added CITATION.cff file (issue #69, PR #68) + ## [0.2.1] 2024-07-28 diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..40ae659 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,65 @@ +# This CITATION.cff file was generated with cffinit. +# Visit https://bit.ly/cffinit to generate yours today! + +cff-version: 1.2.0 +title: 'Zarrtraj: A Python package for streaming molecular dynamics trajectories from cloud services' +message: >- + If you use this software, please cite it using the + metadata from this file. +type: software +authors: + - given-names: Lawson + email: lawsonw84@gmail.com + family-names: Woods + orcid: 'https://orcid.org/0009-0003-0713-4167' + affiliation: >- + School of Computing and Augmented Intelligence, + Arizona State University, Tempe, Arizona, United + States of America + - given-names: Hugo + family-names: MacDermott-Opeskin + orcid: 'https://orcid.org/0000-0002-7393-7457' + affiliation: >- + Open Molecular Software Foundation, Davis, CA, United + States of America + email: hugomacdermott-opeskin@mdanalysis.org + - given-names: Edis + family-names: Jakupovic + orcid: 'https://orcid.org/0000-0001-8813-6356' + affiliation: >- + Center for Biological Physics, Arizona State + University, Tempe, AZ, United States of America + - given-names: Yuxuan + orcid: 'https://orcid.org/0000-0003-4390-8556' + family-names: Zhuang + affiliation: >- + Department of Computer Science, Stanford University, + Stanford, CA 94305, USA. + - given-names: Richard + orcid: 'https://orcid.org/0000-0002-3241-1846' + family-names: Gowers + name-particle: J + affiliation: Charm Therapeutics, London, United Kingdom + - given-names: Oliver + family-names: Beckstein + affiliation: >- + Center for Biological Physics, Arizona State + University, Tempe, AZ, United States of America + orcid: 'https://orcid.org/0000-0003-1340-0831' +identifiers: + - type: doi + value: 10.5281/zenodo.13887976 +repository-code: 'https://github.com/Becksteinlab/zarrtraj' +url: 'https://zarrtraj.readthedocs.io/en/latest/index.html' +abstract: >- + Zarrtraj is an MDAnalysis MDAKit for streaming H5MD and + ZarrMD trajectory files from cloud storage like AWS S3, + Google Cloud Buckets, and Azure Data lakes and Blob + Storage +keywords: + - streaming + - molecular-dynamics + - file-format + - mdanalysis + - zarr +license: GPL-3.0-or-later diff --git a/docs/source/index.rst b/docs/source/index.rst index 919eb21..8651a10 100755 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -16,6 +16,7 @@ This means users can interact with massive trajectory files without ever storing :caption: Contents: installation + yiip_example walkthrough api performance_considerations diff --git a/docs/source/yiip_example.rst b/docs/source/yiip_example.rst new file mode 100644 index 0000000..30a6a32 --- /dev/null +++ b/docs/source/yiip_example.rst @@ -0,0 +1,31 @@ +YiiP Protein Example +==================== + +To get started immediately with *Zarrtraj*, we have made the topology and trajectory of the +`YiiP protein in a POPC membrane `_ +publicly available for streaming. The trajectory is stored in in the `zarrmd` format +for optimal streaming performance. + +To access the trajectory, follow this example: + +.. code-block:: python + + import zarrtraj + import MDAnalysis as mda + import fsspec + + + with fsspec.open("gcs://zarrtraj-test-data/YiiP_system.pdb", "r") as top: + + u = mda.Universe( + top, "gcs://zarrtraj-test-data/yiip.zarrmd", topology_format="PDB" + ) + + for ts in u.trajectory: + # Do something + + +While there is not yet an officially recommended way to access cloud-stored topologies, this +method of opening a Python `File`-like object from the topology URL in PDB format using +`FSSpec `_ +works with MDAnalysis 2.7.0. Check back later for further development! \ No newline at end of file diff --git a/joss_paper/RMSD.png b/joss_paper/RMSD.png new file mode 100644 index 0000000..7d5ec9c Binary files /dev/null and b/joss_paper/RMSD.png differ diff --git a/joss_paper/benchmark.png b/joss_paper/benchmark.png new file mode 100644 index 0000000..edafbd0 Binary files /dev/null and b/joss_paper/benchmark.png differ diff --git a/joss_paper/figure_1.ipynb b/joss_paper/figure_1.ipynb new file mode 100644 index 0000000..3bb22ff --- /dev/null +++ b/joss_paper/figure_1.ipynb @@ -0,0 +1,78 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Generate iteration speed figure for JOSS paper. Iteration times come from https://becksteinlab.github.io/zarrtraj/#/" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "\n", + "labels = ['SSD, XTC', 'SSD, H5MD', 'AWS S3, H5MD', 'SSD, ZarrMD', 'AWS S3, ZarrMD']\n", + "values = [1.49, 4.76, 10.30,3.10, 6.53] \n", + "colors = ['#009e73', '#e69f00', '#e69f00','#56b4e9', '#56b4e9']\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plt.bar(labels, values, color=colors)\n", + "\n", + "\n", + "plt.title('Comparison of Trajectory Iteration Speed by Storage Medium')\n", + "plt.ylabel('Time (minutes)')\n", + "\n", + "\n", + "plt.xticks(rotation=45, ha='right')\n", + "plt.tight_layout()\n", + "\n", + "plt.savefig('benchmark.png')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "zarrtraj", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/joss_paper/figure_2.ipynb b/joss_paper/figure_2.ipynb new file mode 100644 index 0000000..9bf43e5 --- /dev/null +++ b/joss_paper/figure_2.ipynb @@ -0,0 +1,336 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Generate RMSD speed figure for JOSS paper. See \"zarrtra/data/write_aligned*.py\" scripts to see \n", + "trajectory writing code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import MDAnalysis as mda\n", + "import zarrtraj\n", + "import os\n", + "\n", + "os.environ[\"S3_REGION_NAME\"] = \"us-west-1\"\n", + "os.environ[\"AWS_PROFILE\"] = \"sample_profile\"\n", + "\n", + "TOPOL = \"/scratch/ljwoods2/workspace/zarrtraj/zarrtraj/data/yiip_equilibrium/YiiP_system.pdb\"\n", + "H5MD_DISK_TRAJ = \"/scratch/ljwoods2/workspace/zarrtraj/zarrtraj/data/yiip_aligned_compressed.h5md\"\n", + "H5MD_S3_TRAJ = \"s3://zarrtraj-test-data/yiip_aligned_compressed.h5md\"\n", + "ZARRMD_TRAJ_DISK = \"/scratch/ljwoods2/workspace/zarrtraj/zarrtraj/data/yiip_aligned_compressed.zarrmd\"\n", + "ZARRMD_TRAJ_S3 = \"s3://zarrtraj-test-data/yiip_aligned_compressed.zarrmd\"\n", + "XTC_TRAJ = \"/scratch/ljwoods2/workspace/zarrtraj/zarrtraj/data/yiip_equilibrium/YiiP_system_90ns_center_aligned.xtc\"\n", + "\n", + "h5md_disk_u = mda.Universe(TOPOL, H5MD_DISK_TRAJ)\n", + "h5md_s3_u = mda.Universe(TOPOL, H5MD_S3_TRAJ)\n", + "zarrmd_disk_u = mda.Universe(TOPOL, ZARRMD_TRAJ_DISK)\n", + "zarrm_s3_u = mda.Universe(TOPOL, ZARRMD_TRAJ_S3)\n", + "xtc_u = mda.Universe(TOPOL, XTC_TRAJ)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'xtc_u' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[2], line 11\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[38;5;66;03m## Dask \u001b[39;00m\n\u001b[1;32m 6\u001b[0m \n\u001b[1;32m 7\u001b[0m \u001b[38;5;66;03m### XTC \u001b[39;00m\n\u001b[1;32m 9\u001b[0m start \u001b[38;5;241m=\u001b[39m time\u001b[38;5;241m.\u001b[39mtime()\n\u001b[0;32m---> 11\u001b[0m R \u001b[38;5;241m=\u001b[39m RMSD(xtc_u, xtc_u,\n\u001b[1;32m 12\u001b[0m select\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbackbone\u001b[39m\u001b[38;5;124m\"\u001b[39m, \n\u001b[1;32m 13\u001b[0m )\n\u001b[1;32m 14\u001b[0m R\u001b[38;5;241m.\u001b[39mrun(backend\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdask\u001b[39m\u001b[38;5;124m'\u001b[39m, n_workers\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m4\u001b[39m)\n\u001b[1;32m 16\u001b[0m stop \u001b[38;5;241m=\u001b[39m time\u001b[38;5;241m.\u001b[39mtime()\n", + "\u001b[0;31mNameError\u001b[0m: name 'xtc_u' is not defined" + ] + } + ], + "source": [ + "from MDAnalysis.analysis.rms import RMSD\n", + "import time\n", + "\n", + "\n", + "## Dask \n", + "\n", + "### XTC \n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(xtc_u, xtc_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='dask', n_workers=4)\n", + "\n", + "stop = time.time()\n", + "\n", + "xtc_time_dask = stop-start\n", + "\n", + "### ZarrMD, disk\n", + "start = time.time()\n", + "\n", + "R = RMSD(zarrmd_disk_u, zarrmd_disk_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='dask', n_workers=4)\n", + "\n", + "stop = time.time()\n", + "\n", + "zarrmd_disk_time_dask = stop-start\n", + "\n", + "## H5MD, disk\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(h5md_disk_u, h5md_disk_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='dask', n_workers=4)\n", + "\n", + "stop = time.time()\n", + "\n", + "h5md_disk_time_dask = stop-start\n", + "\n", + "## ZarrMD, S3\n", + "start = time.time()\n", + "\n", + "R = RMSD(zarrm_s3_u, zarrm_s3_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='dask', n_workers=4)\n", + "\n", + "stop = time.time()\n", + "\n", + "zarrmd_s3_time_dask = stop-start\n", + "\n", + "## H5MD, S3\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(h5md_s3_u, h5md_s3_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='dask', n_workers=4)\n", + "\n", + "stop = time.time()\n", + "\n", + "h5md_s3_time_dask = stop-start\n", + "\n", + "## Serial\n", + "\n", + "## XTC\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(xtc_u, xtc_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='serial')\n", + "\n", + "stop = time.time()\n", + "\n", + "xtc_time_serial = stop-start\n", + "\n", + "\n", + "## ZarrMD, disk\n", + "start = time.time()\n", + "\n", + "R = RMSD(zarrmd_disk_u, zarrmd_disk_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='serial')\n", + "\n", + "\n", + "stop = time.time()\n", + "\n", + "zarrmd_disk_time_serial = stop-start\n", + "\n", + "## H5MD, disk\n", + "\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(h5md_disk_u, h5md_disk_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='serial')\n", + "\n", + "\n", + "stop = time.time()\n", + "\n", + "h5md_disk_time_serial = stop-start\n", + "\n", + "## ZarrMD, S3\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(zarrm_s3_u, zarrm_s3_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='serial')\n", + "\n", + "stop = time.time()\n", + "\n", + "zarrm_s3_time_serial = stop-start\n", + "\n", + "## H5MD, S3\n", + "\n", + "start = time.time()\n", + "\n", + "R = RMSD(h5md_s3_u, h5md_s3_u,\n", + " select=\"backbone\", \n", + ")\n", + "R.run(backend='serial')\n", + "\n", + "\n", + "stop = time.time()\n", + "\n", + "h5md_s3_time_serial = stop-start" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.5079891562461853\n", + "1.5919220209121705\n", + "4.785086027781168\n", + "15.081525770823161\n", + "0.8399416049321492\n", + "1.9880372802416484\n", + "5.266235820452372\n", + "19.71386777162552\n", + "2.5871417999267576\n", + "5.6665968219439184\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipykernel_2862458/849001616.py:51: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.\n", + " ax1.set_xticklabels(labels, rotation=45, ha='right')\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.ticker as ticker\n", + "\n", + "labels = [\n", + "'XTC, Dask', \n", + "'XTC, Serial', \n", + "'H5MD, Dask', \n", + "'H5MD, Serial', \n", + "'ZarrMD, Dask', \n", + "'ZarrMD, Serial', \n", + "# '',\n", + "'H5MD, Dask ', \n", + "'H5MD, Serial ', \n", + "'ZarrMD, Dask ', \n", + "'ZarrMD, Serial ']\n", + "\n", + "values = [\n", + "# XTC, disk\n", + "30.479349374771118 / 60.0,\n", + "95.51532125473022 / 60.0,\n", + "# H5MD, disk\n", + "287.1051616668701 / 60.0,\n", + "904.8915462493896 / 60.0,\n", + "# ZarrMD, disk\n", + "50.396496295928955 / 60.0,\n", + "119.2822368144989 / 60.0,\n", + "# Sep\n", + "# 0,\n", + "# H5MD, S3\n", + "315.97414922714233 / 60.0,\n", + "1182.8320662975311 / 60.0,\n", + "# ZarrMD, S3\n", + "155.22850799560547 / 60.0,\n", + "339.99580931663513 / 60.0,\n", + "]\n", + "\n", + "for value in values:\n", + " print(value)\n", + "colors = [\n", + "'#009e73', '#009e73', \n", + "'#e69f00', '#e69f00', \n", + "'#56b4e9', '#56b4e9', \n", + "# 'none',\n", + "'#e69f00', '#e69f00', \n", + "'#56b4e9', '#56b4e9']\n", + "\n", + "\n", + "fig1, ax1 = plt.subplots(figsize=(12, 8))\n", + "ax1.bar(labels, values, color=colors)\n", + "ax1.set_xticklabels(labels, rotation=45, ha='right')\n", + "ax1.set_ylabel('Time (minutes)')\n", + "\n", + "# Axis 2 (labels)\n", + "ax2 = ax1.twiny()\n", + "ax2.spines[\"bottom\"].set_position((\"axes\", -0.20))\n", + "ax2.tick_params('both', length=0, width=0, which='minor')\n", + "ax2.tick_params('both', direction='in', which='major')\n", + "ax2.xaxis.set_ticks_position(\"bottom\")\n", + "ax2.xaxis.set_label_position(\"bottom\")\n", + "\n", + "ax2.set_xticks([0.0, 0.6, 1.0])\n", + "ax2.xaxis.set_major_formatter(ticker.NullFormatter())\n", + "ax2.xaxis.set_minor_locator(ticker.FixedLocator([0.3, 0.8]))\n", + "ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(['Disk (SSD)', 'AWS S3']))\n", + "\n", + "\n", + "\n", + "plt.title('Comparison of RMSD Calculation Speed for Different Storage Backends, Trajectory Formats, and Execution Strategies')\n", + "\n", + "plt.tight_layout()\n", + "\n", + "plt.savefig('RMSD.png')\n", + "plt.show()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "zarrtraj", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/joss_paper/paper.bib b/joss_paper/paper.bib new file mode 100644 index 0000000..b742aae --- /dev/null +++ b/joss_paper/paper.bib @@ -0,0 +1,318 @@ +@article{FAIR:2019, + title = {Make scientific data FAIR}, + volume = {570}, + ISSN = {1476-4687}, + url = {http://dx.doi.org/10.1038/d41586-019-01720-7}, + DOI = {10.1038/d41586-019-01720-7}, + number = {7759}, + journal = {Nature}, + publisher = {Springer Science and Business Media LLC}, + author = {Stall, Shelley and Yarmey, Lynn and Cutcher-Gershenfeld, Joel and Hanson, Brooks and Lehnert, Kerstin and Nosek, Brian and Parsons, Mark and Robinson, Erin and Wyborn, Lesley}, + year = {2019}, + month = jun, + pages = {27–29} +} + +@misc{FoldingAtHome:2020, + title = {Foldingathome COVID-19 Datasets}, + url = {https://registry.opendata.aws/foldingathome-covid19}, + note = {Accessed: September 25, 2024} +} + +@article{GPCRmd:2019, + title = {Bringing Molecular Dynamics Simulation Data into View}, + volume = {44}, + ISSN = {0968-0004}, + url = {http://dx.doi.org/10.1016/j.tibs.2019.06.004}, + DOI = {10.1016/j.tibs.2019.06.004}, + number = {11}, + journal = {Trends in Biochemical Sciences}, + publisher = {Elsevier BV}, + author = {Hildebrand, Peter W. and Rose, Alexander S. and Tiemann, Johanna K.S.}, + year = {2019}, + month = nov, + pages = {902–913} +} + +@article{GPCRome:2020, + title = {GPCRmd uncovers the dynamics of the 3D-GPCRome}, + volume = {17}, + ISSN = {1548-7105}, + url = {http://dx.doi.org/10.1038/s41592-020-0884-y}, + DOI = {10.1038/s41592-020-0884-y}, + number = {8}, + journal = {Nature Methods}, + publisher = {Springer Science and Business Media LLC}, + author = {Rodríguez-Espigares, Ismael and Torrens-Fontanals, Mariona and Tiemann, Johanna K. S. and Aranda-García, David and Ramírez-Anguita, Juan Manuel and Stepniewski, Tomasz Maciej and Worp, Nathalie and Varela-Rial, Alejandro and Morales-Pastor, Adrián and Medel-Lacruz, Brian and Pándy-Szekeres, Gáspár and Mayol, Eduardo and Giorgino, Toni and Carlsson, Jens and Deupi, Xavier and Filipek, Slawomir and Filizola, Marta and Gómez-Tamayo, José Carlos and Gonzalez, Angel and Gutiérrez-de-Terán, Hugo and Jiménez-Rosés, Mireia and Jespers, Willem and Kapla, Jon and Khelashvili, George and Kolb, Peter and Latek, Dorota and Marti-Solano, Maria and Matricon, Pierre and Matsoukas, Minos-Timotheos and Miszta, Przemyslaw and Olivella, Mireia and Perez-Benito, Laura and Provasi, Davide and Ríos, Santiago and R. Torrecillas, Iván and Sallander, Jessica and Sztyler, Agnieszka and Vasile, Silvana and Weinstein, Harel and Zachariae, Ulrich and Hildebrand, Peter W. and De Fabritiis, Gianni and Sanz, Ferran and Gloriam, David E. and Cordomi, Arnau and Guixà-González, Ramon and Selent, Jana}, + year = {2020}, + month = jul, + pages = {777–787} +} + +@article{H5MD:2014, + title = {H5MD: A structured, efficient, and portable file format for molecular data}, + journal = {Computer Physics Communications}, + volume = {185}, + number = {6}, + pages = {1546-1553}, + year = {2014}, + issn = {0010-4655}, + doi = {https://doi.org/10.1016/j.cpc.2014.01.018}, + url = {https://www.sciencedirect.com/science/article/pii/S0010465514000447}, + author = {Pierre {de Buyl} and Peter H. Colberg and Felix Höfling}, + keywords = {Molecular simulation, HDF5}, + abstract = {We propose a new file format named “H5MD” for storing molecular simulation data, such as trajectories of particle positions and velocities, along with thermodynamic observables that are monitored during the course of the simulation. H5MD files are HDF5 (Hierarchical Data Format) files with a specific hierarchy and naming scheme. Thus, H5MD inherits many benefits of HDF5, e.g., structured layout of multi-dimensional datasets, data compression, fast and parallel I/O, and portability across many programming languages and hardware platforms. H5MD files are self-contained, and foster the reproducibility of scientific data and the interchange of data between researchers using different simulation programs and analysis software. In addition, the H5MD specification can serve for other kinds of data (e.g. experimental data) and is extensible to supplemental data, or may be part of an enclosing file structure.} +} + +@inproceedings{H5MDReader:2021, + address = {Austin, TX}, + title = {{MPI}-parallel {Molecular} {Dynamics} {Trajectory} {Analysis} with the {H5MD} {Format} in the {MDAnalysis} {Python} {Package}}, + url = {https://conference.scipy.org/proceedings/scipy2021/edis_jakupovic.html}, + doi = {10.25080/majora-1b6fd038-005}, + abstract = {Molecular dynamics (MD) computer simulations help elucidate details of the molecular processes in complex biological systems, from protein dynamics to drug discovery. One major issue is that these MD simulation files are now commonly terabytes in size, which means analyzing the data from these files becomes a painstakingly expensive task. In the age of national supercomputers, methods of parallel analysis are becoming a necessity for the efficient use of time and high performance computing (HPC) resources but for any approach to parallel analysis, simply reading the file from disk becomes the performance bottleneck that limits overall analysis speed. One promising way around this file I/O hurdle is to use a parallel message passing interface (MPI) implementation with the HDF5 (Hierarchical Data Format 5) file format to access a single file simultaneously with numerous processes on a parallel file system. Our previous feasibility study suggested that this combination can lead to favorable parallel scaling with hundreds of CPU cores, so we implemented a fast and user-friendly HDF5 reader (the H5MDReader class) that adheres to H5MD (HDF5 for Molecular Dynamics) specifications. We made H5MDReader (together with a H5MD output class H5MDWriter) available in the MDAnalysis library, a Python package that simplifies the process of reading and writing various popular MD file formats by providing a streamlined user-interface that is independent of any specific file format. We benchmarked H5MDReader's parallel file reading capabilities on three HPC clusters: ASU Agave, SDSC Comet, and PSC Bridges. The benchmark consisted of a simple split-apply-combine scheme of an I/O bound task that split a 90k frame (113 GiB) coordinate trajectory into chunks for processes, where each process performed the commonly used RMSD (root mean square distance after optimal structural superposition) calculation on their chunk of data, and then gathered the results back to the root process. For baseline performance, we found maximum I/O speedups at 2 full nodes, with Agave showing 20x, and a maximum computation speedup on Comet of 373x on 384 cores (all three HPCs scaled well in their computation task). We went on to test a series of optimizations attempting to speed up I/O performance, including adjusting file system stripe count, implementing a masked array feature that only loads relevant data for the computation task, front loading all I/O by loading the entire trajectory into memory, and manually adjusting the HDF5 dataset chunk shapes. We found the largest improvement in I/O performance by optimizing the chunk shape of the HDF5 datasets to match the iterative access pattern of our analysis benchmark. With respect to baseline serial performance, our best result was a 98x speedup at 112 cores on ASU Agave. In terms of absolute time saved, the analysis went from 4623 seconds in the baseline serial run to 47 seconds in the parallel, properly chunked run. Our results emphasize the fact that file I/O is not just dependent on the access pattern of the file, but more so the synergy between access pattern and the layout of the file on disk.}, + urldate = {2021-07-05}, + booktitle = {Proceedings of the 20th {Python} in {Science} {Conference}}, + author = {Jakupovic, Edis and Beckstein, Oliver}, + editor = {Agarwal, Meghann and Calloway, Chris and Niederhut, Dillon and Shupe, David}, + year = {2021}, + pages = {40--48}, +} + +@INPROCEEDINGS{MDAKits:2023, + title = "{MDAKits}: A framework for {FAIR-compliant} molecular + simulation analysis", + booktitle = "Proceedings of the Python in Science Conference", + author = "Alibay, Irfan and Wang, Lily and Naughton, Fiona and Kenney, + Ian and Barnoud, Jonathan and Gowers, Richard and Beckstein, + Oliver", + publisher = "SciPy", + pages = "76--84", + year = 2023, + conference = "Python in Science Conference", + location = "Austin, Texas" +} + + +@InProceedings{MDAnalysis:2016, + author = { {R}ichard {J}. {G}owers and {M}ax {L}inke and {J}onathan {B}arnoud and {T}yler {J}. {E}. {R}eddy and {M}anuel {N}. {M}elo and {S}ean {L}. {S}eyler and {J}an {D}omański and {D}avid {L}. {D}otson and {S}ébastien {B}uchoux and {I}an {M}. {K}enney and {O}liver {B}eckstein }, + title = { {M}{D}{A}nalysis: {A} {P}ython {P}ackage for the {R}apid {A}nalysis of {M}olecular {D}ynamics {S}imulations }, + booktitle = { {P}roceedings of the 15th {P}ython in {S}cience {C}onference }, + pages = { 98 - 105 }, + year = { 2016 }, + editor = { {S}ebastian {B}enthall and {S}cott {R}ostrup }, + doi = { 10.25080/Majora-629e541a-00e } +} + + +@article{MDAnalysis:2011, + author = {Michaud-Agrawal, Naveen and Denning, Elizabeth J. and Woolf, Thomas B. and Beckstein, Oliver}, + title = {MDAnalysis: A toolkit for the analysis of molecular dynamics simulations}, + journal = {Journal of Computational Chemistry}, + volume = {32}, + number = {10}, + pages = {2319-2327}, + keywords = {molecular dynamics simulations, analysis, proteins, object-oriented design, software, membrane systems, Python programming language}, + doi = {https://doi.org/10.1002/jcc.21787}, + url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.21787}, + eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.21787}, + abstract = {Abstract MDAnalysis is an object-oriented library for structural and temporal analysis of molecular dynamics (MD) simulation trajectories and individual protein structures. It is written in the Python language with some performance-critical code in C. It uses the powerful NumPy package to expose trajectory data as fast and efficient NumPy arrays. It has been tested on systems of millions of particles. Many common file formats of simulation packages including CHARMM, Gromacs, Amber, and NAMD and the Protein Data Bank format can be read and written. Atoms can be selected with a syntax similar to CHARMM's powerful selection commands. MDAnalysis enables both novice and experienced programmers to rapidly write their own analytical tools and access data stored in trajectories in an easily accessible manner that facilitates interactive explorative analysis. MDAnalysis has been tested on and works for most Unix-based platforms such as Linux and Mac OS X. It is freely available under the GNU General Public License from http://mdanalysis.googlecode.com. © 2011 Wiley Periodicals, Inc. J Comput Chem 2011}, + year = {2011} +} + +@misc{MDDB:2024, + title={The need to implement FAIR principles in biomolecular simulations}, + author={Rommie Amaro and Johan Åqvist and Ivet Bahar and Federica Battistini and Adam Bellaiche and Daniel Beltran and Philip C. Biggin and Massimiliano Bonomi and Gregory R. Bowman and Richard Bryce and Giovanni Bussi and Paolo Carloni and David Case and Andrea Cavalli and Chie-En A. Chang and Thomas E. Cheatham III au2 and Margaret S. Cheung and Cris Chipot and Lillian T. Chong and Preeti Choudhary and Gerardo Andres Cisneros and Cecilia Clementi and Rosana Collepardo-Guevara and Peter Coveney and Roberto Covino and T. Daniel Crawford and Matteo Dal Peraro and Bert de Groot and Lucie Delemotte and Marco De Vivo and Jonathan Essex and Franca Fraternali and Jiali Gao and Josep Lluís Gelpí and Francesco Luigi Gervasio and Fernando Danilo Gonzalez-Nilo and Helmut Grubmüller and Marina Guenza and Horacio V. Guzman and Sarah Harris and Teresa Head-Gordon and Rigoberto Hernandez and Adam Hospital and Niu Huang and Xuhui Huang and Gerhard Hummer and Javier Iglesias-Fernández and Jan H. Jensen and Shantenu Jha and Wanting Jiao and William L. Jorgensen and Shina Caroline Lynn Kamerlin and Syma Khalid and Charles Laughton and Michael Levitt and Vittorio Limongelli and Erik Lindahl and Kresten Lindorff-Larsen and Sharon Loverde and Magnus Lundborg and Yun Lyna Luo and Francisco Javier Luque and Charlotte I. Lynch and Alexander MacKerell and Alessandra Magistrato and Siewert J. Marrink and Hugh Martin and J. Andrew McCammon and Kenneth Merz and Vicent Moliner and Adrian Mulholland and Sohail Murad and Athi N. Naganathan and Shikha Nangia and Frank Noe and Agnes Noy and Julianna Oláh and Megan O'Mara and Mary Jo Ondrechen and José N. Onuchic and Alexey Onufriev and Silvia Osuna and Anna R. Panchenko and Sergio Pantano and Carol Parish and Michele Parrinello and Alberto Perez and Tomas Perez-Acle and Juan R. Perilla and B. Montgomery Pettitt and Adriana Pietropalo and Jean-Philip Piquemal and Adolfo Poma and Matej Praprotnik and Maria J. Ramos and Pengyu Ren and Nathalie Reuter and Adrian Roitberg and Edina Rosta and Carme Rovira and Benoit Roux and Ursula Röthlisberger and Karissa Y. Sanbonmatsu and Tamar Schlick and Alexey K. Shaytan and Carlos Simmerling and Jeremy C. Smith and Yuji Sugita and Katarzyna Świderek and Makoto Taiji and Peng Tao and Irina G. Tikhonova and Julian Tirado-Rives and Inaki Tunón and Marc W. Van Der Kamp and David Van der Spoel and Sameer Velankar and Gregory A. Voth and Rebecca Wade and Ariel Warshel and Valerie Vaissier Welborn and Stacey Wetmore and Chung F. Wong and Lee-Wei Yang and Martin Zacharias and Modesto Orozco}, + year={2024}, + eprint={2407.16584}, + archivePrefix={arXiv}, + primaryClass={q-bio.BM}, + url={https://arxiv.org/abs/2407.16584}, +} + +@article{MDsrv:2022, + author = {Kampfrath, Michelle and Staritzbichler, René and Hernández, Guillermo Pérez and Rose, Alexander S and Tiemann, Johanna K S and Scheuermann, Gerik and Wiegreffe, Daniel and Hildebrand, Peter W}, + title = "{MDsrv: visual sharing and analysis of molecular dynamics simulations}", + journal = {Nucleic Acids Research}, + volume = {50}, + number = {W1}, + pages = {W483-W489}, + year = {2022}, + month = {05}, + abstract = "{Molecular dynamics simulation is a proven technique for computing and visualizing the time-resolved motion of macromolecules at atomic resolution. The MDsrv is a tool that streams MD trajectories and displays them interactively in web browsers without requiring advanced skills, facilitating interactive exploration and collaborative visual analysis. We have now enhanced the MDsrv to further simplify the upload and sharing of MD trajectories and improve their online viewing and analysis. With the new instance, the MDsrv simplifies the creation of sessions, which allows the exchange of MD trajectories with preset representations and perspectives. An important innovation is that the MDsrv can now access and visualize trajectories from remote datasets, which greatly expands its applicability and use, as the data no longer needs to be accessible on a local server. In addition, initial analyses such as sequence or structure alignments, distance measurements, or RMSD calculations have been implemented, which optionally support visual analysis. Finally, based on Mol*, MDsrv now provides faster and more efficient visualization of even large trajectories compared to its predecessor tool NGL.}", + issn = {0305-1048}, + doi = {10.1093/nar/gkac398}, + url = {https://doi.org/10.1093/nar/gkac398}, + eprint = {https://academic.oup.com/nar/article-pdf/50/W1/W483/44375694/gkac398.pdf}, +} + + +@article {MDverse:2024, + article_type = {journal}, + title = {MDverse, shedding light on the dark matter of molecular dynamics simulations}, + author = {Tiemann, Johanna KS and Szczuka, Magdalena and Bouarroudj, Lisa and Oussaren, Mohamed and Garcia, Steven and Howard, Rebecca J and Delemotte, Lucie and Lindahl, Erik and Baaden, Marc and Lindorff-Larsen, Kresten and Chavent, Matthieu and Poulain, Pierre}, + editor = {Haider, Shozeb and Cui, Qiang}, + volume = 12, + year = 2024, + month = {aug}, + pub_date = {2024-08-30}, + pages = {RP90061}, + citation = {eLife 2024;12:RP90061}, + doi = {10.7554/eLife.90061}, + url = {https://doi.org/10.7554/eLife.90061}, + abstract = {The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the \textit{dark matter of MD} — data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.}, + keywords = {molecular dynamics, simulation, modeling, FAIR}, + journal = {eLife}, + issn = {2050-084X}, + publisher = {eLife Sciences Publications, Ltd}, +} + +@article{MLMDMethods:2023, +author = {Jackson, Nicholas E. and Savoie, Brett M. and Statt, Antonia and Webb, Michael A.}, +title = {Introduction to Machine Learning for Molecular Simulation}, +journal = {Journal of Chemical Theory and Computation}, +volume = {19}, +number = {14}, +pages = {4335-4337}, +year = {2023}, +doi = {10.1021/acs.jctc.3c00735}, + note ={PMID: 37489106}, +URL = { + https://doi.org/10.1021/acs.jctc.3c00735 +}, +eprint = { + https://doi.org/10.1021/acs.jctc.3c00735 +} +} + +@Article{NumPy:2020, + title = {Array programming with {NumPy}}, + author = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J. + van der Walt and Ralf Gommers and Pauli Virtanen and David + Cournapeau and Eric Wieser and Julian Taylor and Sebastian + Berg and Nathaniel J. Smith and Robert Kern and Matti Picus + and Stephan Hoyer and Marten H. van Kerkwijk and Matthew + Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del + R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre + G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and + Warren Weckesser and Hameer Abbasi and Christoph Gohlke and + Travis E. Oliphant}, + year = {2020}, + month = sep, + journal = {Nature}, + volume = {585}, + number = {7825}, + pages = {357--362}, + doi = {10.1038/s41586-020-2649-2}, + publisher = {Springer Science and Business Media {LLC}}, + url = {https://doi.org/10.1038/s41586-020-2649-2} +} + +@ARTICLE{PANGEO:2022, + AUTHOR={Stern, Charles and Abernathey, Ryan and Hamman, Joseph and Wegener, Rachel and Lepore, Chiara and Harkins, Sean and Merose, Alexander }, + + TITLE={Pangeo Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data Production}, + + JOURNAL={Frontiers in Climate}, + + VOLUME={3}, + + YEAR={2022}, + + URL={https://www.frontiersin.org/journals/climate/articles/10.3389/fclim.2021.782909}, + + DOI={10.3389/fclim.2021.782909}, + + ISSN={2624-9553}, + + ABSTRACT={

Pangeo Forge is a new community-driven platform that accelerates science by providing high-level recipe frameworks alongside cloud compute infrastructure for extracting data from provider archives, transforming it into analysis-ready, cloud-optimized (ARCO) data stores, and providing a human- and machine-readable catalog for browsing and loading. In abstracting the scientific domain logic of data recipes from cloud infrastructure concerns, Pangeo Forge aims to open a door for a broader community of scientists to participate in ARCO data production. A wholly open-source platform composed of multiple modular components, Pangeo Forge presents a foundation for the practice of reproducible, cloud-native, big-data ocean, weather, and climate science without relying on proprietary or cloud-vendor-specific tooling.

} +} + +@inproceedings{ParallelAnalysis:2010, + author = {Tu, Tiankai and Rendleman, Charles A. and Miller, Patrick J. and Sacerdoti, Federico and Dror, Ron O. and Shaw, David E.}, + title = {Accelerating parallel analysis of scientific simulation data via Zazen}, + year = {2010}, + publisher = {USENIX Association}, + address = {USA}, + abstract = {As a new generation of parallel supercomputers enables researchers to conduct scientific simulations of unprecedented scale and resolution, terabyte-scale simulation output has become increasingly commonplace. Analysis of such massive data sets is typically I/O-bound: many parallel analysis programs spend most of their execution time reading data from disk rather than performing useful computation. To overcome this I/O bottleneck, we have developed a new data access method. Our main idea is to cache a copy of simulation output files on the local disks of an analysis cluster's compute nodes, and to use a novel task-assignment protocol to co-locate data access with computation. We have implemented our methodology in a parallel disk cache system called Zazen. By avoiding the overhead associated with querying metadata servers and by reading data in parallel from local disks, Zazen is able to deliver a sustained read bandwidth of over 20 gigabytes per second on a commodity Linux cluster with 100 nodes, approaching the optimal aggregated I/O bandwidth attainable on these nodes. Compared with conventional NFS, PVFS2, and Hadoop/HDFS, respectively, Zazen is 75, 18, and 6 times faster for accessing large (1-GB) files, and 25, 13, and 85 times faster for accessing small (2-MB) files. We have deployed Zazen in conjunction with Anton--a special-purpose supercomputer that dramatically accelerates molecular dynamics (MD) simulations-- and have been able to accelerate the parallel analysis of terabyte-scale MD trajectories by about an order of magnitude.}, + booktitle = {Proceedings of the 8th USENIX Conference on File and Storage Technologies}, + pages = {10}, + numpages = {1}, + location = {San Jose, California}, + series = {FAST'10} +} + +@article{SharingMD:2019, +author = {Abraham, Mark and Apostolov, Rossen and Barnoud, Jonathan and Bauer, Paul and Blau, Christian and Bonvin, Alexandre M.J.J. and Chavent, Matthieu and Chodera, John and Čondić-Jurkić, Karmen and Delemotte, Lucie and Grubmüller, Helmut and Howard, Rebecca J. and Jordan, E. Joseph and Lindahl, Erik and Ollila, O. H. Samuli and Selent, Jana and Smith, Daniel G. A. and Stansfeld, Phillip J. and Tiemann, Johanna K.S. and Trellet, Mikael and Woods, Christopher and Zhmurov, Artem}, +title = {Sharing Data from Molecular Simulations}, +journal = {Journal of Chemical Information and Modeling}, +volume = {59}, +number = {10}, +pages = {4093-4099}, +year = {2019}, +doi = {10.1021/acs.jcim.9b00665}, + note ={PMID: 31525920}, +URL = { + https://doi.org/10.1021/acs.jcim.9b00665 +}, +eprint = { + https://doi.org/10.1021/acs.jcim.9b00665 +} +} + +@article{SplitApplyCombine:2011, + title={The Split-Apply-Combine Strategy for Data Analysis}, + volume={40}, + url={https://www.jstatsoft.org/index.php/jss/article/view/v040i01}, + doi={10.18637/jss.v040.i01}, + abstract={Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements.}, + number={1}, + journal={Journal of Statistical Software}, + author={Wickham, Hadley}, + year={2011}, + pages={1–29} +} + +@article{YiiP:2019, +author = "Shujie Fan and Oliver Beckstein", +title = "{Molecular Dynamics trajectories of membrane protein YiiP}", +year = "2019", +month = "5", +url = "https://figshare.com/articles/dataset/Molecular_Dynamics_trajectories_of_membrane_protein_YiiP/8202149", +doi = "10.6084/m9.figshare.8202149.v1" +} + +@misc{Zarr:2024, + doi = {10.5281/ZENODO.3773449}, + url = {https://zenodo.org/doi/10.5281/zenodo.3773449}, + author = {Alistair Miles, and jakirkham, and M Bussonnier, and Josh Moore, and Dimitri Papadopoulos Orfanos, and Davis Bennett, and David Stansby, and Joe Hamman, and James Bourbeau, and Andrew Fulton, and Gregory Lee, and Ryan Abernathey, and Norman Rzepka, and Zain Patel, and Mads R. B. Kristensen, and Sanket Verma, and Saransh Chopra, and Matthew Rocklin, and AWA BRANDON AWA, and Max Jones, and Martin Durant, and Elliott Sales de Andrade, and Vincent Schut, and raphael dussin, and Shivank Chaudhary, and Chris Barnes, and Juan Nunez-Iglesias, and shikharsg, }, + title = {zarr-developers/zarr-python: v3.0.0-alpha}, + publisher = {Zenodo}, + year = {2024}, + copyright = {Creative Commons Attribution 4.0 International} +} + +@misc{Zstandard:2021, + series = {Request for Comments}, + number = 8878, + howpublished = {RFC 8878}, + publisher = {RFC Editor}, + doi = {10.17487/RFC8878}, + url = {https://www.rfc-editor.org/info/rfc8878}, + author = {Yann Collet and Murray Kucherawy}, + title = {{Zstandard Compression and the 'application/zstd' Media Type}}, + pagetotal = 45, + year = 2021, + month = feb, + abstract = {Zstandard, or "zstd" (pronounced "zee standard"), is a lossless data compression mechanism. This document describes the mechanism and registers a media type, content encoding, and a structured syntax suffix to be used when transporting zstd-compressed content via MIME. Despite use of the word "standard" as part of Zstandard, readers are advised that this document is not an Internet Standards Track specification; it is being published for informational purposes only. This document replaces and obsoletes RFC 8478.}, +} + + +@article{Liu:2010, + author = {Liu, Pu and Agrafiotis, Dimitris K and Theobald, Douglas L}, + journal = {J Comput Chem}, + month = {May}, + number = 7, + pages = {1561-1563}, + title = {Fast determination of the optimal rotational matrix for macromolecular superpositions}, + volume = 31, + year = 2010} diff --git a/joss_paper/paper.md b/joss_paper/paper.md new file mode 100644 index 0000000..e442bbe --- /dev/null +++ b/joss_paper/paper.md @@ -0,0 +1,230 @@ +--- +title: 'Zarrtraj: A Python package for streaming molecular dynamics trajectories from cloud services' +tags: + - streaming + - molecular-dynamics + - file-format + - mdanalysis + - zarr +authors: + - name: Lawson Woods + orcid: 0009-0003-0713-4167 + affiliation: [1, 2] + - name: Hugo MacDermott-Opeskin + orcid: 0000-0002-7393-7457 + affiliation: [3] + - name: Edis Jakupovic + affiliation: [4, 5] + orcid: 0000-0001-8813-6356 + - name: Yuxuan Zhuang + orcid: 0000-0003-4390-8556 + affiliation: [6, 7] + - name: Richard J Gowers + orcid: 0000-0002-3241-1846 + affiliation: [8] + - name: Oliver Beckstein + orcid: 0000-0003-1340-0831 + affiliation: [4, 5] +affiliations: + - name: School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, United States of America + index: 1 + - name: School of Molecular Sciences, Arizona State University, Tempe, Arizona, United States of America + index: 2 + - name: Open Molecular Software Foundation, Davis, CA, United States of America + index: 3 + - name: Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America + index: 4 + - name: Department of Physics, Arizona State University, Tempe, Arizona, United States of America + index: 5 + - name: Department of Computer Science, Stanford University, Stanford, CA 94305, USA. + index: 6 + - name: Departments of Molecular and Cellular Physiology and Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA. + index: 7 + - name: Charm Therapeutics, London, United Kingdom + index: 8 +date: 23 October 2024 +bibliography: paper.bib +--- + +# Summary + +Molecular dynamics (MD) simulations provide a microscope into the behavior of +atomic-scale environments otherwise prohibitively difficult to observe. However, +the resulting trajectory data are too often siloed in a single institutions' +HPC environment, rendering it unusable by the broader scientific community. +Additionally, it is increasingly common for trajectory data to be entirely +stored in a cloud storage provider, rather than a traditional on-premise storage site. +*Zarrtraj* enables these trajectories to be read directly from cloud storage providers +like AWS, Google Cloud, and Microsoft Azure into MDAnalysis, a popular Python +package for analyzing trajectory data, providing a method to open up access to +trajectory data to anyone with an internet connection. Enabling cloud streaming +for MD trajectories empowers easier replication of published analysis results, +analyses of large, conglomerate datasets from different sources, and training +machine learning models without downloading and storing trajectory data. + +# Statement of need + +The computing power in HPC environments has increased to the point where +running simulation algorithms is often no longer the constraint in +obtaining scientific insights from molecular dynamics trajectory data. +Instead, the ability to process, analyze and share large volumes of data provide +new constraints on research in this field [@SharingMD:2019]. + +Other groups in the field recognize this same need for adherence to +FAIR principles [@FAIR:2019] including +MDsrv, a tool that can stream MD trajectories into a web browser for visual exploration [@MDsrv:2022], +GCPRmd, a web service that builds on MDsrv to provide a predefined set of analysis results and simple +geometric features for G-protein-coupled receptors [@GPCRmd:2019] [@GPCRome:2020], +MDDB (Molecular Dynamics Data Bank), an EU-scale +repository for bio-simulation data [@MDDB:2024], +and MDverse, a prototype search engine +for publicly-available GROMACS simulation data [@MDverse:2024]. + +While these efforts currently offer solutions for indexing, +searching, and visualizing MD trajectory data, the problem of distributing trajectories +in way that enables *NumPy*-like slicing and parallel reading for use in arbitrary analysis +tasks remains. + +Although exposing download links on the open internet offers a simple solution to this problem, +on-disk representations of molecular dynamics trajectories often range in size +up to TBs in scale [@ParallelAnalysis:2010] [@FoldingAtHome:2020], +so a solution which could prevent this +duplication of storage and unnecessary download step would provide greater utility +for the computational molecular sciences ecosystem, especially if it +provides access to slices or subsampled portions of these large files. + +To address this need, we developed *Zarrtraj* as a prototype for streaming +trajectories into analysis software using an established trajectory +format. *Zarrtraj* extends MDAnalysis [@MDAnalysis:2016], a popular +Python-based library for the analysis of molecular simulation data in a wide +range of formats, to also accept remote file locations for trajectories instead +of local filenames. Instead of being integrated directly into MDAnalysis, +*Zarrtraj* is built as an external MDAKit [@MDAKits:2023] that automatically +registers its capabilities with MDAnalysis on import and thus acts as a plugin. +*Zarrtraj* enables streaming MD trajectories in the popular HDF5-based H5MD format [@H5MD:2014] +from AWS S3, Google Cloud Buckets, and Azure Blob Storage and Data Lakes without ever downloading them. +*Zarrtraj* relies on the *Zarr* [@Zarr:2024] package for +streaming array-like data from a variety of storage mediums and on [Kerchunk](https://github.com/fsspec/kerchunk), +which extends the capability of *Zarr* by allowing it to read HDF5 files. +*Zarrtraj* leverages *Zarr*'s ability to read a slice of a file and to read a +file in parallel and it implements the standard MDAnalysis trajectory reader +API, which taken together make it compatible with analysis algorithms that use +the "split-apply-combine" parallelization strategy [@SplitApplyCombine:2011]. +In addition to the H5MD format, *Zarrtraj* can stream and write trajectories in +the experimental ZarrMD format, which ports the H5MD layout to the *Zarr* +file type. + +This work builds on the existing MDAnalysis `H5MDReader` +[@H5MDReader:2021], and uses *NumPy* [@NumPy:2020] as a common interface in-between MDAnalysis +and the file storage medium. *Zarrtraj* was inspired and made possible by similar efforts in the +geosciences community to align data practices with FAIR principles [@PANGEO:2022]. + +With *Zarrtraj*, we envision research groups making their data publicly available +via a cloud URL so that anyone can reuse their trajectories and reproduce their results. +Large databases, like MDDB and MDverse, can expose a URL associated with each +trajectory in their databases so that users can make a query and immediately use the resulting +trajectories to run an analysis on the hits that match their search. Groups seeking to +collect a large volume of trajectory data to train machine learning models [@MLMDMethods:2023] can make use +of our tool to efficiently and inexpensively obtain the data they need from these published +URLs. + +# Features and Benchmarks + +Once imported, *Zarrtraj* allows passing trajectory URLs just like ordinary files: +```python +import zarrtraj +import MDAnalysis as mda + +u = mda.Universe("topology.pdb", "s3://sample-bucket-name/trajectory.h5md") +``` + +Initial benchmarks show that *Zarrtraj* can iterate serially +through an AWS S3 cloud trajectory (load into memory one frame at a time) +at roughly 1/2 or 1/3 the speed it can iterate through the same trajectory from disk and roughly +1/5 to 1/10 the speed it can iterate through the same trajectory on disk in XTC format (\autoref{fig:benchmark}). +However, it should be noted that this speed is influenced by network bandwidth and that +writing parallelized algorithms can offset this loss of speed as in \autoref{fig:RMSD}. + +![Benchmarks performed on a machine with 2 Intel Xeon 2.00GHz CPUs, 32GB of RAM, and an SSD configured with RAID 0. The trajectory used for benchmarking was the YiiP trajectory from MDAnalysisData [@YiiP:2019], a 9000-frame (90ns), 111,815 particle simulation of a membrane-protein system. The original 3.47GB XTC trajectory was converted into an uncompressed 11.3GB H5MD trajectory and an uncompressed 11.3GB ZarrMD trajectory using the MDAnalysis `H5MDWriter` and *Zarrtraj* `ZarrMD` writers, respectively. XTC trajectory read using the MDAnalysis `XTCReader` for comparison. \label{fig:benchmark}](benchmark.png) + +![RMSD benchmarks performed on the same machine as \autoref{fig:benchmark}. YiiP trajectory aligned to first frame as reference using `MDAnalysis.analysis.align.AlignTraj` and converted to compressed, quantized H5MD (7.8GB) and ZarrMD (4.9GB) trajectories. RMSD performed using development branch of MDAnalysis (2.8.0dev) with "serial" and "dask" backends. See [this notebook](https://github.com/Becksteinlab/zarrtraj/blob/d4ab7710ec63813750d7224fe09bf5843e513570/joss_paper/figure_2.ipynb) for full benchmark codes. \label{fig:RMSD}](RMSD.png) + +*Zarrtraj* is capable of making use of *Zarr*'s powerful compression and quantization when writing ZarrMD trajectories. +The uncompressed MDAnalysisData YiiP trajectory in ZarrMD format is reduced from 11.3GB uncompressed +to just 4.9GB after compression with the Zstandard algorithm [@Zstandard:2021] +and quantization to 3 digits of precision. See [performance considerations](https://zarrtraj.readthedocs.io/en/latest/performance_considerations.html) +for more. + +# Example + +The YiiP membrane protein trajectory [@YiiP:2019] used for benchmarking in this +paper is publicly available for streaming from the Google Cloud Bucket +*gcs://zarrtraj-test-data/yiip.zarrmd*. The topology file in PDB format, which contains +information about the chemical composition of the system, can also be accessed +remotely from the same bucket (*gcs://zarrtraj-test-data/YiiP_system.pdb*) using +[fsspec](https://filesystem-spec.readthedocs.io/en/latest/), although this is +currently an experimental feature and details may change. + +In the following example (see also the [YiiP Example in the zarrtraj +docs](https://zarrtraj.readthedocs.io/en/latest/yiip_example.html)), we access +the topology file and the trajectory from the *gcs://zarrtraj-test-data* cloud +bucket. We initially create an `MDAnalysis.Universe`, the basic object in +MDAnalysis that ties static topology data and dynamic trajectory data together +and manages access to all data. We iterate through a slice of the trajectory, +starting from frame index 100 and skipping forward in steps of 20 frames: + +```python +import zarrtraj +import MDAnalysis as mda +import fsspec + +with fsspec.open("gcs://zarrtraj-test-data/YiiP_system.pdb", "r") as top: + u = mda.Universe(top, "gcs://zarrtraj-test-data/yiip.zarrmd", + topology_format="PDB") + + for timestep in u.trajectory[100::20]: + print(timestep) +``` + +Inside the loop over trajectory frames we print information for the current +frame `timestep` although in principle, any kind of analysis code can run here and +process the coordinates available in `u.atoms.positions`. + +The `Universe` object can be used as if the underlying trajectory file were a +local file. For example, we can use `u` from the preceeding example with one of +the standard analysis tools in MDAnalysis, the calculation of the root mean +square distance (RMSD) after optimal structural superposition [@Liu:2010] in +the `MDAnalysis.analysis.rms.RMSD` class. In the example below we select only the +C$_\alpha$ atoms of the protein with a MDAnalysis selection. We run the +analysis with the `.run()` method while stepping through the trajectory at +increments of 100 frames. We then print the first and last data point from the +results array: + +```python +>>> import MDAnalysis.analysis.rms +>>> R = MDAnalysis.analysis.rms.RMSD(u, select="protein and name CA").run( + step=100, verbose=True) +100%|██████████████████████████████████████████| 91/91 [00:28<00:00, 3.21it/s] +>>> print(f"Initial RMSD (frame={R.results.rmsd[0, 0]:g}): " + f"{R.results.rmsd[0, 2]:.3f} Å") +Initial RMSD (frame=0) : 0.000 Å +>>> print(f"Final RMSD (frame={R.results.rmsd[-1, 0]:g}): " + f"{R.results.rmsd[-1, 2]:.3f} Å") +Final RMSD (frame=9000) : 2.373 Å +``` + +This example demonstrates that the *Zarrtraj* interface enables seamless use of +cloud-hosted trajectories with the standard tools that are either available +with MDAnalysis itself, through MDAKits [@MDAKits:2023] (see the [MDAKit +registry](https://mdakits.mdanalysis.org/mdakits.html) for available packages), +or any script or package that uses MDAnalysis for file I/O. + + +# Acknowledgements + +We thank Dr. Jenna Swarthout Goddard for supporting the GSoC program at MDAnalysis and +Dr. Martin Durant, author of Kerchunk, for helping refine and merge features in his upstream code base +necessary for this project. LW was a participant in the Google Summer of Code 2024 program. +Some work on *Zarrtraj* was supported by the National Science Foundation under grant number 2311372. + +# References diff --git a/pyproject.toml b/pyproject.toml index 3996118..c3e52c4 100755 --- a/pyproject.toml +++ b/pyproject.toml @@ -106,3 +106,21 @@ line_length = 80 COLUMN_LIMIT = 80 INDENT_WIDTH = 4 USE_TABS = false + +classifiers = [ + 'Development Status :: 4 - Beta', + 'Intended Audience :: Science/Research', + 'License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)', + 'Operating System :: POSIX', + 'Operating System :: MacOS :: MacOS X', + 'Operating System :: Microsoft :: Windows', + 'Programming Language :: Python', + 'Programming Language :: Python :: 3.10', + 'Programming Language :: Python :: 3.11', + 'Programming Language :: Python :: 3.12', + 'Programming Language :: Python :: 3.13', + 'Topic :: Scientific/Engineering', + 'Topic :: Scientific/Engineering :: Bio-Informatics', + 'Topic :: Scientific/Engineering :: Chemistry', + 'Topic :: Software Development :: Libraries :: Python Modules', +] \ No newline at end of file