-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to handle the large number of QA files for public release of the fastspecfit VACs #106
Comments
FYI, it is possible to directly embed PNG data into html without requiring a separate png file to exist on disk. So e.g. you could keep the png data as N>>1 blobs in a format that is optimized for random access (e.g. hdf5) and then embed into html dynamically generated by fastspecfit.desi.lbl.gov. Or store N>>1 html "files" as blobs in that format if even the html part can be pre-generated. Alternatively, IIRC @dstndstn had a clever trick for creating a disk image that appears as a single file to NERSC but could be mounted by a docker instance to see the N>>1 files within that disk image. Completely separate from fastspecfit itself, it would be useful to work out an example recipe for the generic problem of how to serve O(millions) of pre-generated "files" without actually generating millions of files on disk. i.e. generating them from scratch is too slow to do on the fly, but there are too many of them to keep on disk, so what's the most efficient way to cache+serve them for random access? |
For those of you with access to the NERSC users Slack space, there's a discussion here which @dstndstn initiated-- |
I was also reading about (gnu) Tar's "--seek" option -- assume the file is seekable -- which is supposed to allow faster extractions. Doesn't work on compressed tarballs. Maybe worth checking that out, though tar is supposed to auto-detect seekability. |
So the squashfs disk-image format might be an option too -- there's an "unsquashfs" command that looks like you could use squashfs like 'tar', but presumably with indexing etc definitely built in! My guess is that you want a directory structure for this to work really well (aaa/bbb/aaabbb.html). |
(I mean, just mounting the squashfs image would be much preferable and easier -- let the kernel do the work! -- but that would require some permissions changes from the Spin team, as discussed in the thread you mention above.) |
squashfs experiment very successful. This is showing timings for a second run of each program - ie, with disk cache hot:
|
@sbailey has argued that generating QA for every single DESI target in each public VAC is probably not sustainable (18M targets in Iron alone), so we need to get creative for how the QA can be generated and rendered in the public web-app (https://fastspecfit.desi.lbl.gov). @dstndstn has proposed using the Spin disks themselves to host the files (as a tarball?) but we'll need to look into the details and make sure that we get NERSC on board to help us come up with a (hopefully long-term) solution.
The text was updated successfully, but these errors were encountered: