how to handle the large number of QA files for public release of the fastspecfit VACs #106

moustakas · 2023-02-23T16:51:20Z

@sbailey has argued that generating QA for every single DESI target in each public VAC is probably not sustainable (18M targets in Iron alone), so we need to get creative for how the QA can be generated and rendered in the public web-app (https://fastspecfit.desi.lbl.gov). @dstndstn has proposed using the Spin disks themselves to host the files (as a tarball?) but we'll need to look into the details and make sure that we get NERSC on board to help us come up with a (hopefully long-term) solution.

sbailey · 2023-02-27T18:04:38Z

FYI, it is possible to directly embed PNG data into html without requiring a separate png file to exist on disk. So e.g. you could keep the png data as N>>1 blobs in a format that is optimized for random access (e.g. hdf5) and then embed into html dynamically generated by fastspecfit.desi.lbl.gov. Or store N>>1 html "files" as blobs in that format if even the html part can be pre-generated.

Alternatively, IIRC @dstndstn had a clever trick for creating a disk image that appears as a single file to NERSC but could be mounted by a docker instance to see the N>>1 files within that disk image.

Completely separate from fastspecfit itself, it would be useful to work out an example recipe for the generic problem of how to serve O(millions) of pre-generated "files" without actually generating millions of files on disk. i.e. generating them from scratch is too slow to do on the fly, but there are too many of them to keep on disk, so what's the most efficient way to cache+serve them for random access?

moustakas · 2023-03-05T12:59:58Z

For those of you with access to the NERSC users Slack space, there's a discussion here which @dstndstn initiated--
https://nerscusers.slack.com/archives/C01LPA84AGM/p1677776147290869

dstndstn · 2023-08-07T12:20:54Z

I was also reading about (gnu) Tar's "--seek" option -- assume the file is seekable -- which is supposed to allow faster extractions. Doesn't work on compressed tarballs. Maybe worth checking that out, though tar is supposed to auto-detect seekability.

dstndstn · 2023-08-07T12:32:23Z

So the squashfs disk-image format might be an option too -- there's an "unsquashfs" command that looks like you could use squashfs like 'tar', but presumably with indexing etc definitely built in! My guess is that you want a directory structure for this to work really well (aaa/bbb/aaabbb.html).

dstndstn · 2023-08-07T12:33:10Z

(I mean, just mounting the squashfs image would be much preferable and easier -- let the kernel do the work! -- but that would require some permissions changes from the Spin team, as discussed in the thread you mention above.)

dstndstn · 2023-08-07T17:39:19Z

squashfs experiment very successful. This is showing timings for a second run of each program - ie, with disk cache hot:

> time tar xf /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.tar --seek healpix/sv1/dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png

real	0m0.849s
user	0m0.423s
sys	    0m0.423s

> time ./rdsquashfs -u dark/284/28475/fastspec-sv1-dark-28475-39628433029860119.png /pscratch/sd/d/dstn/fastspecfit-fuji-v2.0-html-healpix-sv1-dark.squashfs 
creating fastspec-sv1-dark-28475-39628433029860119.png

real	0m0.030s
user	0m0.012s
sys	    0m0.014s

moustakas added the must-do Must be addressed. label Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to handle the large number of QA files for public release of the fastspecfit VACs #106

how to handle the large number of QA files for public release of the fastspecfit VACs #106

moustakas commented Feb 23, 2023

sbailey commented Feb 27, 2023

moustakas commented Mar 5, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023

how to handle the large number of QA files for public release of the fastspecfit VACs #106

how to handle the large number of QA files for public release of the fastspecfit VACs #106

Comments

moustakas commented Feb 23, 2023

sbailey commented Feb 27, 2023

moustakas commented Mar 5, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023

dstndstn commented Aug 7, 2023