dantro
—from data and dentro (Greek for tree)—is a Python package that provides a uniform interface for hierarchically structured and semantically heterogeneous data.
It is built around three main features:
- data handling: loading heterogeneous data into a tree-like data structure and providing a uniform interface for it
- data transformation: performing arbitrary operations on the data, if necessary using lazy evaluation
- data visualization: creating a visual representation of the processed data
Together, these stages constitute a data processing pipeline: an automated sequence of predefined, configurable operations. Akin to a Continuous Integration pipeline, a data processing pipeline provides a uniform, consistent, and easily extensible infrastructure that contributes to more efficient and reproducible workflows. This can be beneficial especially in a scientific context, for instance when handling data that was generated by computer simulations.
dantro
is meant to be integrated into projects and to be used to set up such a data processing pipeline.
It is designed to be easily customizable to the requirements of the project it is integrated into, even if the involved data is hierarchically structured or semantically heterogeneous.
Furthermore, it allows a configuration-based specification of all operations via YAML configuration files; the resulting pipeline can then be controlled entirely via these configuration files and without requiring code changes.
The dantro
package is open source software released under the LGPLv3+ license (see copyright notice below).
It was developed alongside the Utopia project, but is an independent package.
We describe the motivation and scope of dantro
in more detail in this publication in the Journal of Open Source Software.
For more information on the package, its features, philosophy, and integration, please visit its documentation at dantro.readthedocs.io
.
If you encounter any issues with dantro
or have suggestions or questions of any kind, please open an issue via the project page.
The dantro
package is available on the Python Package Index and via conda-forge
.
If you are unsure which installation method works best for you, we recommend to use conda
.
Note that — in order to make full use of dantro
's features — it is meant to be integrated into your project and customized to its needs.
Basic usage examples and an integration guide can be found in the package documentation.
Installation via conda
As a first step, install Anaconda or Miniconda, if you have not already done so. You can then use the following command to install dantro and its dependencies:
$ conda install -c conda-forge dantro
Installation via pip
If you already have a Python installation on your system, you probably already have pip
installed as well.
To install dantro and its dependencies, invoke the following command:
$ pip install dantro
In case the pip
command is not available, follow these instructions to install it or switch to the conda
-based installation.
Note that if you have both Python 2 and Python 3 installed, you might have to use the pip3
command instead.
dantro
is implemented and tested for Python >= 3.8 and depends on the following packages:
Package Name | Minimum Version | Purpose |
---|---|---|
numpy | 1.21 | |
xarray | 0.16.2 | For labelled N-dimensional arrays |
dask | 2.10 | To work with large data |
toolz | 0.10 | For dask.delayed |
distributed | 2.10 | For distributed computing |
scipy | 1.7.3 | As engine for NetCDF files |
sympy | 1.7 | For symbolic math operations |
h5py | 3.6 | For reading HDF5 datasets |
matplotlib | 3.3 | For data visualization |
seaborn | 0.11 | For advanced data visualization |
networkx | 2.6 | For network visualization |
ruamel.yaml | 0.16.12 | For parsing YAML configuration files |
dill | 0.3.3 | For advanced pickling |
paramspace | 2.5.6 | For dictionary- or YAML-based parameter spaces |
For installation of versions that are not on the PyPI, pip
allows specifying an URL to a git repository:
$ pip install git+<clone-link>@<some-branch-name>
Here, replace clone-link
with the clone URL of this project and some-branch-name
with the name of the branch that you want to install the package from (see the pip
documentation for details).
Alternatively, omit the @
and everything after it.
If you do not have SSH keys available, use the HTTPS link.
If you would like to contribute to dantro
(yeah!), you should clone the repository to a local directory:
$ git clone <clone-link>
For development purposes, it makes sense to work in a specific virtual environment for dantro and install dantro in editable mode:
$ python3 -m venv ~/.virtualenvs/dantro
$ source ~/.virtualenvs/dantro/bin/activate
(dantro) $ pip install -e ./dantro
For development purposes, the following additional packages are required.
Package Name | Minimum Version | Purpose |
---|---|---|
pytest | 3.4 | Testing framework |
pytest-cov | 2.5 | Coverage report |
tox | 3.1 | Test environments |
Sphinx | 4.* | Documentation generator |
sphinx-book-theme | 0.2.* | Modern sphinx theme |
pre-commit | 2.15 | For commit hooks |
black | 22.3.0 | For code formatting |
To install these development-related dependencies, enter the virtual environment, navigate to the cloned repository, and perform the installation using:
(dantro) $ cd dantro
(dantro) $ pip install -e .[dev]
With these dependencies having been installed, make sure to set up the git hook that allows pre-commit to run before making a commit:
(dantro) $ pre-commit install
The corresponding dependencies needed for the hooks will be installed automatically upon a first commit. For more information on commit hooks, see the commit hooks section below.
To assert correct functionality, tests are written alongside all features.
The pytest
and tox
packages are used as testing frameworks.
All tests are carried out for Python 3.7 through 3.10 using the GitLab CI/CD and the newest versions of all dependencies.
When merging to the master branch, dantro
is additionally tested against the specified minimum versions.
Test coverage and pipeline status can be seen on the project page.
To run all defined tests, call:
(dantro) $ python -m pytest -v tests/ --cov=dantro --cov-report=term-missing
This also provides a coverage report, showing the lines that are not covered by the tests.
Alternatively, with tox
, it is possible to select different python environments for testing.
Given that the interpreter is available, the test for a specific environment can be carried out with the following command:
(dantro) $ tox -e py37
To build dantro
's documentation locally via Sphinx, install the required dependencies and invoke the make doc
command:
(dantro) $ cd doc
(dantro) $ make doc
You can then view the documentation by opening the doc/_build/html/index.html
file.
Note: Sphinx is configured such that warnings will be regarded as errors, making detection of markup mistakes easier.
You can inspect the error logs gathered in the doc/build_errors.log
file.
For Python-related Sphinx referencing errors, see the doc/.nitpick-ignore
file for exceptions
When developing dantro and pushing to the feature branch, the build:doc
job of the CI pipeline additionally creates a documentation preview.
The result can either be downloaded from the job artifacts or the deployed GitLab environment.
Upon warnings or errors in the build, the job will exit with an orange warning sign.
You can inspect the build_errors.log
file via the exposed CI artifacts.
To streamline dantro development, a number of automations are used which take care of code formatting and perform some basic checks.
These automations are managed by pre-commit and are run when invoking git commit
(hence the name).
If these so-called hooks determine a problem, they will display an error and you will not be able to commit just yet.
Some of the hooks automatically fix the error (e.g.: removing whitespace), others require some manual action on your part.
Either way, you will have to stage these changes manually (using git add
, as usual).
To check which changes were made by the hooks, use git diff
.
Once you applied the requested changes, invoke git commit
anew.
This will again trigger the hooks, but — with all issues resolved — the hooks should now all pass and lead you to the usual commit message prompt.
The most notable hooks are:
Both isort and black are configured in the pyproject.toml
file.
For the other hooks' configuration, see .pre-commit-config.yaml
.
All hooks are also being run in the GitLab CI/CD check:hooks
job.
If you have trouble setting up the hooks or if they create erroneous results, please let us know.
If you use a zsh
terminal (default for macOS users since Catalina) and try to install extra requirements like the test and/or documentation dependencies, you will probably get an error similar to zsh: no matches found: .[test_deps]
.
This can be fixed by escaping the square brackets, i.e. writing .\[test_deps\]
or .\[doc_deps\]
.
dantro is licensed under the GNU Lesser General Public License Version 3 or any later version.
dantro -- a python package for handling and plotting hierarchical data
Copyright (C) 2018 – 2022 dantro developers
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
A copy of the GNU General Public License Version 3, and the GNU Lesser General Public License Version 3 extending it, is distributed with the source code of this program; see COPYING
and COPYING.LESSER
, respectively.
The copyright holders of dantro are collectively referred to as dantro developers in the respective copyright notices and disclaimers.
dantro has been developed by (in alphabetical order):
- Unai Fischer Abaigar
- Benjamin Herdeanu
- Daniel Lake
- Yunus Sevinchan
- Jeremias Traub
- Julian Weninger
Contact the developers via: dantro-dev@iup.uni-heidelberg.de