Recommendation for packaging a package using standalone CUDA programs and large datafiles #417

tloredo · 2023-05-15T13:48:07Z

tloredo
May 15, 2023

Hello, Mesoners-

I'm creating a Python package for a collection of software created by a colleague and his student comprising a few standalone C/CUDA programs (compiled with simple nvcc commands) that read and write data to local files, and Python drivers. The Python code is pure Python (not extensions) that runs the codes, reads the output, and provides the results to the user via NumPy arrays. The programs will not be used outside of the package. They use a large set of fixed data files.

First, how would you recommend handling the standalone programs? Can I build them via nvcc using the installer? And once built, where should they be put? I suppose they could be handled as Python scripts and installed somewhere in PATH. But since they are not intended for use outside of the package, that doesn't seem appropriate.

Second, how would you recommend handling the large data files (~100 MB)? My current plan is to mimic how some AstroPy packages handle such cases: The data are not included in the project's repo, but rather stored in a permanent, citable location (e.g., Dataverse or Zenodo). The package provides a function that fetches the data and stores it in a user-designated location (by default, a hidden directory in the home directory). A user-specific configuration file identifies the data location for subsequent runs. A good example is the dustmaps package: gregreen/dustmaps: A uniform interface for a number of 2D and 3D maps of interstellar dust reddening/extinction.. The data installation functions can optionally be executed via setup.py; maybe Meson would use a different approach.

Any advice would be appreciated.

-Tom

Answered by rgommers

May 15, 2023

Hi @tloredo, thanks for your question and interest.

To answer this one first:

My current plan is to mimic how some AstroPy packages handle such cases: The data are not included in the project's repo, but rather stored in a permanent, citable location [...]

This seems very reasonable. I'll note that scikit-learn, scikit-image and SciPy all have data loaders that work along similar lines. I believe scikit-learn has custom code for data set downloading, while scikit-image and SciPy both use https://github.com/fatiando/pooch as an optional dependency.

If the data is optional, I'd not add the option to do the data retrieval in the package build files. Rather, just let the user do import mypk…

View full answer

rgommers · 2023-05-15T16:58:57Z

rgommers
May 15, 2023
Maintainer

Hi @tloredo, thanks for your question and interest.

To answer this one first:

My current plan is to mimic how some AstroPy packages handle such cases: The data are not included in the project's repo, but rather stored in a permanent, citable location [...]

This seems very reasonable. I'll note that scikit-learn, scikit-image and SciPy all have data loaders that work along similar lines. I believe scikit-learn has custom code for data set downloading, while scikit-image and SciPy both use https://github.com/fatiando/pooch as an optional dependency.

If the data is optional, I'd not add the option to do the data retrieval in the package build files. Rather, just let the user do import mypkg; mypkg.download_all_data() afterwards. Or provide a separate script if the data is needed outside of an installed package.

If the data is non-optional, then I'd use Git LFS or similar and not provide a download function.

So either way: not as part of the Meson (or setup.py) build. If you do want that after all, you'd expose it as a build option (i.e., a CLI flag, as in https://mesonbuild.com/Build-options.html). In your meson.build file you can then use run_command for example, to execute the script/code that downloads the data.

6 replies

tloredo May 16, 2023
Author

Thank you, Ralf (@rgommers), for the very helpful tips and suggestions. I'm trying out a new copier template for this project, and the template developers also suggested investigating pooch for the data downloads. Though Git LFS does seem worth considering as well. The project has two packages (one relying on the other), both with large data files, and while most users will probably want to use the provided data, for at least one of the packages, the provided data may be optionally replaced with custom data, so LFS may not make sense for that one.

After posting, it occurred to me that the standalone programs are basically playing the role of dylibs here, so it might make sense to keep them in the package tree, which seems to be what you're suggesting.

Distributing that via PyPI in a portable way with multi-platform support to third parties is perhaps more involved, but it sounds like this is only for internal usage, right?

I'm doing this mainly to simplify use of this other team's code by my team (we'd be installing on just a few machines). However, it will have a broader (but small) audience (astronomers working in a rather specific area—perhaps dozens of users). It would be helpful if it could be published to PyPI (source only). But I don't know if PyPI plays well with Git LFS. Perhaps pip installing via GitHub or a local clone or tarball is best.

In any case, you gave me enough help to get going—much appreciated!

rgommers May 16, 2023
Maintainer

it occurred to me that the standalone programs are basically playing the role of dylibs here, so it might make sense to keep them in the package tree, which seems to be what you're suggesting.

Yes indeed, they're kinda like Python extension modules conceptually, if it's only used from your Python code.

It would be helpful if it could be published to PyPI (source only). But I don't know if PyPI plays well with Git LFS. Perhaps pip installing via GitHub or a local clone or tarball is best.

Oh, source-only would be okay, it's wheels where you'd be in for a lot of pain. There is a problem though with the size of the sdist you'd upload, because indeed there is no Git involved anymore so unless you'd use something like pooch you'd have to include the data files. And that would inflate your sdist size to over the default limit for PyPI, which means you have to go ask for an exception, which can take a long time to be approved.

eli-schwartz May 16, 2023

Meson has builtin language support for cuda, so you could just use:

executable(
    'myprog',
    'myprog.cu',
    install: true,
    install_dir: py.get_install_dir() / 'pkgname' / 'desired-subdir'
)

tloredo May 16, 2023
Author

Thanks, @eli-schwartz; I wasn't aware of that, and I appreciate the concrete example! —Tom

tloredo May 17, 2023
Author

so unless you'd use something like pooch you'd have to include the data files. And that would inflate your sdist size to over the default limit for PyPI, which means you have to go ask for an exception, which can take a long time to be approved.

Thanks for that clarification—now the decision is more or less made for me to go with a data download approach. This is all really helpful input for a Meson (and PyPI) newbie!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendation for packaging a package using standalone CUDA programs and large datafiles #417

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Recommendation for packaging a package using standalone CUDA programs and large datafiles #417

tloredo May 15, 2023

Replies: 1 comment · 6 replies

rgommers May 15, 2023 Maintainer

tloredo May 16, 2023 Author

rgommers May 16, 2023 Maintainer

eli-schwartz May 16, 2023

tloredo May 16, 2023 Author

tloredo May 17, 2023 Author

tloredo
May 15, 2023

Replies: 1 comment 6 replies

rgommers
May 15, 2023
Maintainer

tloredo May 16, 2023
Author

rgommers May 16, 2023
Maintainer

tloredo May 16, 2023
Author

tloredo May 17, 2023
Author