- Developer cheat sheet
- What contributions are most suitable for
datasalad
- Code organization
- Style guide
Hatch is used as a convenience solution for packaging and development tasks.
Hatch takes care of managing dependencies and environments, including the Python interpreter itself.
If not installed yet, installing via pipx is recommended (pipx install hatch
).
Below is a list of some provided convenience commands.
An accurate overview of provided convenience scripts can be obtained by running: hatch env show
.
All command setup can be found in pyproject.toml
, and given alternatively managed dependencies, all commands can also be used without hatch
.
hatch test [--cover]
There is also a setup for matrix test runs, covering all current Python versions:
hatch run tests:run [<select tests>]
This can also be used to run tests for a specific Python version only:
hatch run tests.py3.10:run [<select tests>]
hatch run docs:build
# clean with
hatch run docs:clean
hatch run types:check
Check commit messages for compliance with Conventional Commits
hatch run cz:check-commits
hatch run cz:show-changelog
hatch run cz:bump-version
The new version is determined automatically from the nature of the (conventional) commits made since the last release. A changelog is generated and committed.
In cases where the generated changelog needs to be edited afterwards (typos, unnecessary complexity, etc.), the created version tag needs to be advanced.
hatch build
hatch publish
This project's main purpose is to be a core library for DataLad. But it is just a library, not a component in the DataLad system. This means that library components cannot rely on a particular "DataLad setup". Neither the presence of DataLad datasets (rather than any Git/git-annex repository), nor a particular DataLad-enforced configuration, or dependencies, or runtime behavior can be assumed. If such assumptions are made, the respective code is likely not a suitable contribution.
It is a supported scenario to use this library outside the context of a DataLad package, and contributions must be evaluated with respect to implied addition of further software dependencies. Code that implements support for a particular (web)service and/or relies on a special-purpose support package is likely better contributed elsewhere (see below).
Code "for working with data" that has wide applicability, some previous real-world exposure, good tests and a light dependency footprint is a candidate for inclusion into the library.
Special interest, highly domain-specific functionality is likely better suited for a topical library of DataLad extension package.
Generic functionality for the DataLad system is best directed to the datalad
core package.
If in doubt, it is advisable to file an issue and ask for feedback before preparing a contribution.
In datasalad
, all code is organized in shallow sub-packages. Each sub-package is located in a directory within the datasalad
package.
Consequently, there are no top-level source files other than a few exceptions for technical reasons (__init__.py
, conftest.py
).
A sub-package contains any number of code files, and a tests
directory with all test implementations for that particular sub-package, and only for that sub-package. Other, deeper directory hierarchies are not to be expected.
There is no limit to the number of files. Contributors should strive for files with less than 500 lines of code.
Within a sub-package, code should generally use relative imports. The corresponding tests should also import the tested code via relative imports.
Code users should be able to import the most relevant functionality from the sub-package's __init__.py
. Only items importable from the sub-package's top-level are considered to be part of its "public" API. If a sub-module is imported in the sub-package's __init__.py
, consider adding __all__
to the sub-module to restrict wildcard imports from the sub-module, and to document what is considered to be part of the "public" API.
Sub-packages should be as self-contained as possible. This means that any organization principles like all-exceptions-go-into-a-single-location-in-datasalad do not immediately/necessarily apply. For example, each sub-package should define its special-purpose exceptions separately from others. When functionality is shared between sub-packages, absolute imports should be made.
A contribution must be complete with code, tests, and documentation.
datasalad
is a core library for a software ecosystem. Therefore, tests are essential. A high test-coverage is desirable. Contributors should aim for near-complete coverage (or better). Tests must be dedicated for the code of a particular contribution. It is not sufficient, if other code happens to also exercise a new feature.
Docstrings should be complete with information on parameters, return values, and exception behavior. Documentation should be added to and rendered with the sphinx-based documentation.
Commits and commit messages must be Conventional Commits. Their compliance is checked for each pull request. The following commit types are recognized:
feat
: introduces a new featurefix
: address a problem, fix a bugdoc
: update the documentationrf
: refactor code with no change of functionalityperf
: enhance performance of existing functionalitytest
: add/update/modify test implementationsci
: change CI setupstyle
: beautificationchore
: results of routine tasks, such as changelog updatesrevert
: revert a previous changebump
: version update
Any breaking change must have at least one line of the format
BREAKING CHANGE: <summary of the breakage>
in the body of the commit message that introduces the breakage. Breaking changes can be introduced in any type of commit. Any number of breaking changes can be described in a commit message (one per line). Breaking changes trigger a major version update, and form a dedicated section in the changelog.
Contributions submitted via a pull-request (PR), are expected to be a clear, self-describing series of commits. The PR description is largely irrelevant, and could be used for a TODO list, or conversational content. All essential information concerning the code and changes must be contained in the commit messages.
Commit series should be "linear", with individual commits being self-contained, well-described changes.
If possible, only loosely related changes should be submitted in separate PRs to simplify reviewing and shorten time-to-merge.
Long-standing, inactive PRs (draft mode or not) are frowned upon, as they drain attention. It is often better to close a PR and open a new one, once work resumes. Maintainers may close inactive PRs for this reason at any time.