Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for rearranging the paleo pism xarrays before writing to zarr #1

Open
jkingslake opened this issue Apr 2, 2021 · 6 comments
Assignees

Comments

@jkingslake
Copy link
Member

Hi @talbrecht,
I thought I would get your opinion on rearranging the xarrays slightly before you write them to zarr.

The code is here: https://gist.github.com/jkingslake/f974d22e1ea72b4a6f583581406f81d3

Essentially, it splits the data variables along the id dimension into 4 new dimensions that correspond to the four parameters in the ensemble.

Let me know what you think.

@jkingslake jkingslake assigned jkingslake and talbrecht and unassigned jkingslake Apr 2, 2021
@jkingslake
Copy link
Member Author

So, should I upload the ensemble data again in the new form?
Yes that would be great, thanks!

The other thing we might have to be careful about is chunk size. I have read that we should be aimed for chunks that are 100 or a few hundred MBs to make computations on clusters most efficient. I think they are ok as they are, but maybe a little small (in present.zarr they the main datarrays are chunked in ~500kb).

I tried rechunking with xarray.datarray.chunk, but this seemed to fail. My next attempt will be with [the rechunker package] (https://rechunker.readthedocs.io/en/latest/), which is specifically designed for this. But maybe it also makes sense to see if you can put it in to zarr in a better chunk size. I was trying size with the following sizes, maybe you can try rechunking on your HPC before upload: ({'time': 1, 'x': 381, 'y': 381, 'par_esia':4}) using rechunker.

@talbrecht
Copy link
Collaborator

talbrecht commented Apr 8, 2021

I rearranged the ensemble data accordingly to the 4 varied parameters and successfully re-chunked the arrays. But somehow the upload of the timeseries failed, see jupyter notebook 69548d1?! (Or see here)

@jkingslake
Copy link
Member Author

jkingslake commented Apr 8, 2021

I wonder if it is related to already having zarrs with this name in the bucket in this location.

Does work if you write to a new zarr directory with
mapper = gcs.get_mapper('gs://ldeo-glaciology/paleo_ensemble/'+mf+'_2.zarr').

@jkingslake
Copy link
Member Author

...but I see that the other uploads worked ok and I can now see the data in the new 'unstacked' format, so maybe my suggestion above isn't the issue.

@talbrecht
Copy link
Collaborator

talbrecht commented Apr 8, 2021

Right, I would guess so too, but I thought, this repo space would then be wasted. In mode='w' ir should actually overwrite existing folders. And would you say, the chuck size is large enough now? I couldn't find out how to chunk in the to_zarr function?!
Just give it a try!

@jkingslake
Copy link
Member Author

Hi Torsten, sorry I lost track of this issue. I think the chunk is OK now. Perhaps a little small, but it currently isn't a big deal I think because the whole dataset isn't very big. We might want to revisit the chunk size if we put up a fuller version of the ensemble.

One small issue that I just realized is that the unstacking procedure we are using causes us to lose the attributes of par_ppq, par_esia, etc.

Saving the attributes of each of them before the unstack stage and then re-writing them once par_ppq, par_esia have become coordinates seems to work: https://gist.github.com/jkingslake/0946ae96f065f8def236867f6895b5c9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants