Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index and other optimisations #922

Merged
merged 24 commits into from
Feb 24, 2024
Merged

Index and other optimisations #922

merged 24 commits into from
Feb 24, 2024

Conversation

shawnlaffan
Copy link
Owner

A set of optimisations to several index and related calculations.

Summary of main changes:

  • Hierarchical calculations are now supported. This allows cluster tree calculations to build on the child results rather than rebuilding everything for each node.
  • Several endemism calculations now re-use results where the central and whole variants will be the same.
  • Significance assessments are faster.

Take advantage of the label hash global precalc,
and use hash aliases instead of refs.
Might as well avoid any recursion overheads.
And stop throwing errors when ref is undefined
in get_basedata_ref.
This was we avoid cloning basedata refs,
analysis args and the like.
No need to find the index names when they are
in the base_list_ref already.

Also use refaliasing to avoid some derefs
and declutter loop variables.
Passing in the base list allows fewer grep comparisons.
This makes a large difference when there are many lists
with many keys.
Passing in the base list allows fewer grep comparisons.
This makes a large difference when there are many lists
with many keys.
This allows future optimisations when
calculating indices for cluster trees.
This allows several indices to be optimised when
calculated for cluster nodes, providing they
are done starting from the tips.

PE has been optimised in this commit.
Avoids a lot of hash creation and deletion
with large datasets.
Use a treenode method that caches, rather
than repeatedly calling methods to get
the same answer.
It is cleaner to pack the node and child names
in their own structure.  That also enables
later additions without adding yet more
top level arguments.
Use direct assignment if starting with empty list.
This can be a _very_ hot loop so even small
differences add up.
These are cleared as we go to avoid leakage.
If the second neighbour set is empty then the
whole and central variants return the same results.
So short circuit in these cases.
@shawnlaffan shawnlaffan merged commit 3c590de into master Feb 24, 2024
8 checks passed
@shawnlaffan shawnlaffan deleted the indices_2024 branch February 24, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant