Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

57 add documentation #71

Merged
merged 15 commits into from
May 6, 2024
Merged

57 add documentation #71

merged 15 commits into from
May 6, 2024

Conversation

vanlankveldthijs
Copy link
Contributor

To resolve issue #57

There are a few remaining issues to resolve.

  1. The pull request has 2 example notebooks: one with and one without storing and reloading the ordered data before performing the timing tests. To be determined which is better. This one should be called demo_order_stm.ipynb and the other removed.

  2. It seems like stmat.copy does not have the desired result: when I try to run STM operations on the cloned data, there seems to be some parts missing (I think it was the whole .stm part). Solved by loading the whole dataset twice, but that is a bit of an ugly fix.

  3. It seems that when using small chunk sizes, storing to zarr fails. Solved by storing using 1000 size chunks and rechunking after loading. This is a bit ugly.

  4. Possible issue with storing the ordered data to zarr when this needs to overwrite existing data. I think shutil.retree also sometimes gives issues (maybe when the is no directory).

  5. Timing tests using the prerequisite small chunk size (500 in this case) become very slow. Don't know whether we can really fix this though.
    Currently running the final timing tests on both notebooks. To be committed when done.

thijsvl added 9 commits March 26, 2024 13:52
Note that the example notebook is still a direct copy of the operations example notebook.
There are two example notebooks: with or without storing and reloading
the reordered STM before performing timing tests. To be determined
which should be used.
@rogerkuou
Copy link
Member

Thanks @vanlankveldthijs for the implementation. Following our discussion, the next steps can be:

  1. Demonstrate the performance enhancement from re-ordering in a separate md page.
  2. Elaborate on the factors affecting the re-ordering effect: 1) homogeneity of point distribution; 2) spatial spreadness of points. These two factors can influence the choice of chunksize and spatial closeness of points in the same chunk
  3. Only do performance comparision on subset, which should be enough
  4. Use %%timeit for a robust profiling
  5. Still keep an example notebook for the illustration of re-order operation (you can decide either include this in the existing notebooks or make a new one)

Attaching here two example notebooks I made for comparison: notebooks.zip

Copy link
Member

@rogerkuou rogerkuou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vanlankveldthijs, I also reviewed other changes in you implemented. In general, they are good. Just some minor comments:

  1. I suggested a change on .gitignore;
  2. The build workflows are failing because the unit tests are filling.
  3. The ruff workflow failed because some linting problem. Sarah have already fixed this in Add a function to enrich STM using data from another dataset #66. You do not need to worry about this.

.gitignore Outdated Show resolved Hide resolved
@vanlankveldthijs
Copy link
Contributor Author

Consolidated demo notebooks into one.

Fixed stm.copy issue (should've been stm.copy())

Changed demo notebook to only test subset operation and to test this on large and small chunks (relative to dataset) to see impact.

@vanlankveldthijs
Copy link
Contributor Author

Fixed pytests by checking for existence of time in chunksizes.

Please check whether implementation could be nicer (better python).

@rogerkuou
Copy link
Member

Hi @vanlankveldthijs, thanks for the nice work and sorry for taking so long to review. I quite like the new notebooks. I think it is good for merging now.

As I have mentioned in the previous comment, the ruff workflow should have been fixed by Sarah in #66.

Two warnings in the notebook are documented in #74 and #73. They are not relevant for this PR and can be fixed later

@rogerkuou rogerkuou merged commit 72cf071 into main May 6, 2024
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants