Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of spectral units #311

Merged
merged 12 commits into from
Jul 30, 2024
Merged

Improve handling of spectral units #311

merged 12 commits into from
Jul 30, 2024

Conversation

ajjackson
Copy link
Collaborator

@ajjackson ajjackson commented Jul 26, 2024

It looks like spectral intensities are sometimes handled in energy units (such as when imported from CASTEP DOS). If these are converted to/from their proper (reciprocal energy) units, this can cause divide-by-zero errors as pint performs a wavenumber <-> wavelength conversion and takes the reciprocal of the spectral intensities, including empty bins.

Here we introduce another mechanism for dealing with spectral units: a few extra unit conversions in a special "reciprocal_spectroscopy" context (which comes an "rs" alias for convenience). This stops Pint from choking on the 1/(1/cm) <-> 1/meV conversions.

The broadening system is made a bit more robust; previously the variable-width broadening assumed that things can always be converted to energy units.

Perhaps the most serious issue addressed here is the way the Spectrum data "getters" (hidden behind a @property interface) work.

The current spectrum getters create a unit conversion factor and then
multiply the raw array by it. There are two problems with this:

  • Multiplying any iterator by a Quantity or Unit leads to Pint
    checking through all the data for consistency. This can be quite
    expensive, and is unnecessary for these trusted array objects. When
    iterating over spectra yielded from a Spectrum1DCollection, the cost
    can become quite noticeable.

  • Some unit conversions in the spectroscopy context involve taking
    a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if
    this is done to a factor the data itself will not be inverted. If
    we start with a Spectrum1D with x_data in wavenumber (1/cm) then

    spectrum.x_data.to("cm")

    and

    spectrum.x_data_unit = "cm"
    spectrum.x_data

    will not give the same result; the second method leaves the actual
    values untouched (as cm / (1/cm) = 1).

To further improve these conversions I included some technically-redundant energy-wavenumber conversions in the "reciprocal_spectroscopy" context. With these available, .to(..., "rs") conversions can avoid needlessly taking reciprocals of the data and raising divide-by-zero errors. That should be more efficient, too, but it's not clear if that works in practice.

- Pint tends to choke on conversions between reciprocal energy and
  wavenumber. These don't come up too often but are an issue when
  dealing with histograms and spectra.

- Here we add a few more "definitions" for Pint to use in such cases.

- This creates a slightly scary range of paths around unit
  conversions. By putting them in a separate context, we can ensure
  they are only introduced when they are likely to be needed.
DOS units should be reciprocal of energy/frequency axis: this gives
dimensionless area (count) and scales properly with unit changes.

It looks like DOS was expressed in enegy/frequency units in part to
avoid some Pint conversion challenges. The new reciprocal_spectroscopy
context should allow those to be done more safely and directly.

There are still some more test failures to be examined and cleaned
up.
Copy link
Contributor

github-actions bot commented Jul 26, 2024

Test Results

    22 files  ±0      22 suites  ±0   29m 15s ⏱️ - 1m 15s
 1 048 tests ±0   1 042 ✅ ±0   6 💤 ±0  0 ❌ ±0 
10 420 runs  ±0  10 357 ✅ ±0  63 💤 ±0  0 ❌ ±0 

Results for commit 517e8d0. ± Comparison against base commit 75fe42e.

♻️ This comment has been updated with latest results.

- Broadening logic tries to convert to hartree and back again, but
  this behaves unpredictably (or otherwide badly) for non-energy
  units.

- Instead we can use the bin units; this still requires that the
  relevant parameters are dimensionally-consistent _with each other_.

- While working on this I noticed that very small bin
  values (i.e. when using large bin _units_) are incorrectly
  identified as having equal width due to the "atol" which is intended
  to stabilise comparisons near zero. Bin widths should always be
  finite, so we can make this stricter and always base agreement on
  the relative width.
@ajjackson ajjackson marked this pull request as ready for review July 29, 2024 16:02
The current spectrum getters create a unit conversion factor and then
multiply the raw array by it. There are two problems with this:

- Multiplying any iterator by a Quantity or Unit leads to Pint
  checking through all the data for consistency. This can be quite
  expensive, and is unnecessary for these trusted array objects. When
  iterating over spectra yielded from a Spectrum1DCollection, the cost
  can become quite noticeable.

- Some unit conversions in the spectroscopy context involve taking
  a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if
  this is done to a _factor_ the data itself will not be inverted. If
  we start with a Spectrum1D with x_data in wavenumber (1/cm) then

    spectrum.x_data.to("cm")

  and

    spectrum.x_data_unit = "cm"
    spectrum.x_data

  will not give the same result; the second method leaves the actual
  values untouched (as cm / (1/cm) = 1).
By providing Pint with "direct" conversion routes between wavenumber
and energy (in each direction and inversion), Pint avoids
unnecessarily taking the reciprocal of data and encountering
divide-by-zero/infinite-value situations.

It doesn't look like the existing divide-by-zero situations were
causing problems with correctness of results, but its nice to clear
the warnings (and save some CPU?) so that when we see this warning we
know it is worth investigating.
@ajjackson
Copy link
Collaborator Author

ajjackson commented Jul 30, 2024

I performed a little speed test of conversions in different contexts. With this definition file:

@context reciprocal_spectroscopy = rs
    1 / [frequency] <-> 1 / [length]: 1 / value / speed_of_light
    1 / [energy] -> 1 / [frequency]: planck_constant * value
    1 / [frequency] -> 1 / [energy]: value / planck_constant
    [length] -> 1 / [energy]: value / planck_constant / speed_of_light
    1 / [energy] -> [length]: value * planck_constant * speed_of_light
    1 / [length] -> [energy]: value * planck_constant * speed_of_light
    [energy] -> 1 / [length]: value / planck_constant / speed_of_light
@end

@context bracketed_spectroscopy = bs
    1 / [frequency] <-> 1 / [length]: 1 / (value * speed_of_light)
    1 / [energy] -> 1 / [frequency]: planck_constant * value
    1 / [frequency] -> 1 / [energy]: value / planck_constant
    [length] -> 1 / [energy]: value / (planck_constant * speed_of_light)
    1 / [energy] -> [length]: value * (planck_constant * speed_of_light)
    1 / [length] -> [energy]: value * (planck_constant * speed_of_light)
    [energy] -> 1 / [length]: value / (planck_constant * speed_of_light)
@end

and this speed test code:

import timeit

import numpy as np
from pint import UnitRegistry


ureg = UnitRegistry()
ureg.load_definitions("./definitions.txt")

sparse_data = np.random.random((10000, 200))
sparse_data[sparse_data > 0.5] = 0

sparse_spectrum = ureg.Quantity(sparse_data, "1/cm")

print("Default spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'spectroscopy')", globals=globals(), number=1000))

print("Reciprocal spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'reciprocal_spectroscopy')", globals=globals(), number=1000))

print("Reciprocal spectroscopy context with parenthesis groups:")
print(timeit.timeit("sparse_spectrum.to('meV', 'bracketed_spectroscopy')", globals=globals(), number=1000))

the output is

Default spectroscopy context:
/Users/adam.jackson/micromamba/envs/euphonic-np-test/lib/python3.10/site-packages/pint/facets/plain/quantity.py:1054: RuntimeWarning: divide by zero encountered in divide
  return self.__class__(other_magnitude / self._magnitude, 1 / self._units)
4.869501542067155
Reciprocal spectroscopy context:
3.7165777920745313
Reciprocal spectroscopy context with parenthesis groups:
2.6495610419660807

So we see that

  • .to() is quite expensive
  • there is almost a performance factor x2 gained by preparing appropriate definitions

Usually a factor x2 is not a big thing to scream about, but this is so slow that it is a credible bottleneck and worth sorting out.

That being said, things narrow a lot when replacing .to() with .ito() to convert in-place (which has a 100x impact in this example!), so we should look for ways to benefit from that inside Euphonic/Abins:

Default spectroscopy context:
/Users/adam.jackson/micromamba/envs/euphonic-np-test/lib/python3.10/site-packages/pint/facets/plain/quantity.py:1054: RuntimeWarning: divide by zero encountered in divide
  return self.__class__(other_magnitude / self._magnitude, 1 / self._units)
0.04182825004681945
Reciprocal spectroscopy context:
0.03280979208648205
Reciprocal spectroscopy context with parenthesis groups:
0.03438079194165766

Performance tests show this gives a significant improvement for calls
to `.to()` and slightly worse performance for (the much-faster) `.ito()`.
@ajjackson
Copy link
Collaborator Author

I think it could be worth refactoring how spectra work; currently they store a numpy array and unit information, and dynamically construct a Quantity when properties are accessed. This is quite expensive, and it would be simpler to store a full Quantity. The current behaviour of "lazily" dealing with unit changes could be retained by deferring an ito() to the property getter.

Presumably some features work directly on the private array, and these would have to deal with an extra call to magnitude. It seems unlikely that this performance benefit outweight the downsides, but we can do some benchmarking if unsure.

Anyway, this can all be a separate PR!

@ajjackson
Copy link
Collaborator Author

ajjackson commented Jul 30, 2024

@mducle @oerc0122 This could use another set of eyes before merging. In principle it is straightforward stuff, but messing around with unit conversions always scares me a bit. Are there any additional sanity checks or stress tests we can try in order to make sure we didn't accidentally give permission to do anything wrong?

Copy link
Collaborator

@oerc0122 oerc0122 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good if it works, just a couple of stylistic questions.

CHANGELOG.rst Show resolved Hide resolved
euphonic/validate.py Outdated Show resolved Hide resolved
Reviewer found the ``_ =`` distracting, but it is nice to clarify here
that we don't care about the return value. Switching to ``.ito()``
also achieves that (and comes with a miniscule performance advantage.)
I hope to use toolz more in future, for now it is just needed for this
one test!
@ajjackson ajjackson added the enhancement New feature or request label Jul 30, 2024
@ajjackson ajjackson enabled auto-merge (squash) July 30, 2024 16:02
@ajjackson ajjackson merged commit 994a4e4 into master Jul 30, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants