Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle the large number of QA files for public release of the fastspecfit VACs #106

Open
moustakas opened this issue Feb 23, 2023 · 6 comments
Labels
must-do Must be addressed.

Comments

@moustakas
Copy link
Member

@sbailey has argued that generating QA for every single DESI target in each public VAC is probably not sustainable (18M targets in Iron alone), so we need to get creative for how the QA can be generated and rendered in the public web-app (https://fastspecfit.desi.lbl.gov). @dstndstn has proposed using the Spin disks themselves to host the files (as a tarball?) but we'll need to look into the details and make sure that we get NERSC on board to help us come up with a (hopefully long-term) solution.

@sbailey
Copy link
Contributor

sbailey commented Feb 27, 2023

FYI, it is possible to directly embed PNG data into html without requiring a separate png file to exist on disk. So e.g. you could keep the png data as N>>1 blobs in a format that is optimized for random access (e.g. hdf5) and then embed into html dynamically generated by fastspecfit.desi.lbl.gov. Or store N>>1 html "files" as blobs in that format if even the html part can be pre-generated.

Alternatively, IIRC @dstndstn had a clever trick for creating a disk image that appears as a single file to NERSC but could be mounted by a docker instance to see the N>>1 files within that disk image.

Completely separate from fastspecfit itself, it would be useful to work out an example recipe for the generic problem of how to serve O(millions) of pre-generated "files" without actually generating millions of files on disk. i.e. generating them from scratch is too slow to do on the fly, but there are too many of them to keep on disk, so what's the most efficient way to cache+serve them for random access?

@moustakas
Copy link
Member Author

For those of you with access to the NERSC users Slack space, there's a discussion here which @dstndstn initiated--
https://nerscusers.slack.com/archives/C01LPA84AGM/p1677776147290869

@dstndstn
Copy link

dstndstn commented Aug 7, 2023

I was also reading about (gnu) Tar's "--seek" option -- assume the file is seekable -- which is supposed to allow faster extractions. Doesn't work on compressed tarballs. Maybe worth checking that out, though tar is supposed to auto-detect seekability.

@dstndstn
Copy link

dstndstn commented Aug 7, 2023

So the squashfs disk-image format might be an option too -- there's an "unsquashfs" command that looks like you could use squashfs like 'tar', but presumably with indexing etc definitely built in! My guess is that you want a directory structure for this to work really well (aaa/bbb/aaabbb.html).

@dstndstn
Copy link

dstndstn commented Aug 7, 2023

(I mean, just mounting the squashfs image would be much preferable and easier -- let the kernel do the work! -- but that would require some permissions changes from the Spin team, as discussed in the thread you mention above.)

@dstndstn
Copy link

dstndstn commented Aug 7, 2023

squashfs experiment very successful. This is showing timings for a second run of each program - ie, with disk cache hot:

> time tar xf /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.tar --seek healpix/sv1/dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png

real	0m0.849s
user	0m0.423s
sys	    0m0.423s

> time ./rdsquashfs -u dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.squashfs 
creating fastspec-sv1-dark-28475-39628433029860119.png

real	0m0.030s
user	0m0.012s
sys	    0m0.014s

@moustakas moustakas added the must-do Must be addressed. label Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
must-do Must be addressed.
Projects
None yet
Development

No branches or pull requests

3 participants