Improve handling of spectral units #311

ajjackson · 2024-07-26T16:04:21Z

It looks like spectral intensities are sometimes handled in energy units (such as when imported from CASTEP DOS). If these are converted to/from their proper (reciprocal energy) units, this can cause divide-by-zero errors as pint performs a wavenumber <-> wavelength conversion and takes the reciprocal of the spectral intensities, including empty bins.

Here we introduce another mechanism for dealing with spectral units: a few extra unit conversions in a special "reciprocal_spectroscopy" context (which comes an "rs" alias for convenience). This stops Pint from choking on the 1/(1/cm) <-> 1/meV conversions.

The broadening system is made a bit more robust; previously the variable-width broadening assumed that things can always be converted to energy units.

Perhaps the most serious issue addressed here is the way the Spectrum data "getters" (hidden behind a @property interface) work.

The current spectrum getters create a unit conversion factor and then
multiply the raw array by it. There are two problems with this:

Multiplying any iterator by a Quantity or Unit leads to Pint
checking through all the data for consistency. This can be quite
expensive, and is unnecessary for these trusted array objects. When
iterating over spectra yielded from a Spectrum1DCollection, the cost
can become quite noticeable.
Some unit conversions in the spectroscopy context involve taking
a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if
this is done to a factor the data itself will not be inverted. If
we start with a Spectrum1D with x_data in wavenumber (1/cm) then
```
spectrum.x_data.to("cm")
```
and
```
spectrum.x_data_unit = "cm"
spectrum.x_data
```
will not give the same result; the second method leaves the actual
values untouched (as cm / (1/cm) = 1).

To further improve these conversions I included some technically-redundant energy-wavenumber conversions in the "reciprocal_spectroscopy" context. With these available, .to(..., "rs") conversions can avoid needlessly taking reciprocals of the data and raising divide-by-zero errors. That should be more efficient, too, but it's not clear if that works in practice.

- Pint tends to choke on conversions between reciprocal energy and wavenumber. These don't come up too often but are an issue when dealing with histograms and spectra. - Here we add a few more "definitions" for Pint to use in such cases. - This creates a slightly scary range of paths around unit conversions. By putting them in a separate context, we can ensure they are only introduced when they are likely to be needed.

DOS units should be reciprocal of energy/frequency axis: this gives dimensionless area (count) and scales properly with unit changes. It looks like DOS was expressed in enegy/frequency units in part to avoid some Pint conversion challenges. The new reciprocal_spectroscopy context should allow those to be done more safely and directly. There are still some more test failures to be examined and cleaned up.

github-actions · 2024-07-26T16:24:34Z

Test Results

22 files ±0 22 suites ±0 29m 15s ⏱️ - 1m 15s
1 048 tests ±0 1 042 ✅ ±0 6 💤 ±0 0 ❌ ±0
10 420 runs ±0 10 357 ✅ ±0 63 💤 ±0 0 ❌ ±0

Results for commit 517e8d0. ± Comparison against base commit 75fe42e.

♻️ This comment has been updated with latest results.

- Broadening logic tries to convert to hartree and back again, but this behaves unpredictably (or otherwide badly) for non-energy units. - Instead we can use the bin units; this still requires that the relevant parameters are dimensionally-consistent _with each other_. - While working on this I noticed that very small bin values (i.e. when using large bin _units_) are incorrectly identified as having equal width due to the "atol" which is intended to stabilise comparisons near zero. Bin widths should always be finite, so we can make this stricter and always base agreement on the relative width.

The current spectrum getters create a unit conversion factor and then multiply the raw array by it. There are two problems with this: - Multiplying any iterator by a Quantity or Unit leads to Pint checking through all the data for consistency. This can be quite expensive, and is unnecessary for these trusted array objects. When iterating over spectra yielded from a Spectrum1DCollection, the cost can become quite noticeable. - Some unit conversions in the spectroscopy context involve taking a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if this is done to a _factor_ the data itself will not be inverted. If we start with a Spectrum1D with x_data in wavenumber (1/cm) then spectrum.x_data.to("cm") and spectrum.x_data_unit = "cm" spectrum.x_data will not give the same result; the second method leaves the actual values untouched (as cm / (1/cm) = 1).

By providing Pint with "direct" conversion routes between wavenumber and energy (in each direction and inversion), Pint avoids unnecessarily taking the reciprocal of data and encountering divide-by-zero/infinite-value situations. It doesn't look like the existing divide-by-zero situations were causing problems with correctness of results, but its nice to clear the warnings (and save some CPU?) so that when we see this warning we know it is worth investigating.

ajjackson · 2024-07-30T11:40:42Z

I performed a little speed test of conversions in different contexts. With this definition file:

@context reciprocal_spectroscopy = rs
    1 / [frequency] <-> 1 / [length]: 1 / value / speed_of_light
    1 / [energy] -> 1 / [frequency]: planck_constant * value
    1 / [frequency] -> 1 / [energy]: value / planck_constant
    [length] -> 1 / [energy]: value / planck_constant / speed_of_light
    1 / [energy] -> [length]: value * planck_constant * speed_of_light
    1 / [length] -> [energy]: value * planck_constant * speed_of_light
    [energy] -> 1 / [length]: value / planck_constant / speed_of_light
@end

@context bracketed_spectroscopy = bs
    1 / [frequency] <-> 1 / [length]: 1 / (value * speed_of_light)
    1 / [energy] -> 1 / [frequency]: planck_constant * value
    1 / [frequency] -> 1 / [energy]: value / planck_constant
    [length] -> 1 / [energy]: value / (planck_constant * speed_of_light)
    1 / [energy] -> [length]: value * (planck_constant * speed_of_light)
    1 / [length] -> [energy]: value * (planck_constant * speed_of_light)
    [energy] -> 1 / [length]: value / (planck_constant * speed_of_light)
@end

and this speed test code:

import timeit

import numpy as np
from pint import UnitRegistry


ureg = UnitRegistry()
ureg.load_definitions("./definitions.txt")

sparse_data = np.random.random((10000, 200))
sparse_data[sparse_data > 0.5] = 0

sparse_spectrum = ureg.Quantity(sparse_data, "1/cm")

print("Default spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'spectroscopy')", globals=globals(), number=1000))

print("Reciprocal spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'reciprocal_spectroscopy')", globals=globals(), number=1000))

print("Reciprocal spectroscopy context with parenthesis groups:")
print(timeit.timeit("sparse_spectrum.to('meV', 'bracketed_spectroscopy')", globals=globals(), number=1000))

the output is

Default spectroscopy context:
/Users/adam.jackson/micromamba/envs/euphonic-np-test/lib/python3.10/site-packages/pint/facets/plain/quantity.py:1054: RuntimeWarning: divide by zero encountered in divide
  return self.__class__(other_magnitude / self._magnitude, 1 / self._units)
4.869501542067155
Reciprocal spectroscopy context:
3.7165777920745313
Reciprocal spectroscopy context with parenthesis groups:
2.6495610419660807

So we see that

.to() is quite expensive
there is almost a performance factor x2 gained by preparing appropriate definitions

Usually a factor x2 is not a big thing to scream about, but this is so slow that it is a credible bottleneck and worth sorting out.

That being said, things narrow a lot when replacing .to() with .ito() to convert in-place (which has a 100x impact in this example!), so we should look for ways to benefit from that inside Euphonic/Abins:

Default spectroscopy context:
/Users/adam.jackson/micromamba/envs/euphonic-np-test/lib/python3.10/site-packages/pint/facets/plain/quantity.py:1054: RuntimeWarning: divide by zero encountered in divide
  return self.__class__(other_magnitude / self._magnitude, 1 / self._units)
0.04182825004681945
Reciprocal spectroscopy context:
0.03280979208648205
Reciprocal spectroscopy context with parenthesis groups:
0.03438079194165766

Performance tests show this gives a significant improvement for calls to `.to()` and slightly worse performance for (the much-faster) `.ito()`.

ajjackson · 2024-07-30T12:01:52Z

I think it could be worth refactoring how spectra work; currently they store a numpy array and unit information, and dynamically construct a Quantity when properties are accessed. This is quite expensive, and it would be simpler to store a full Quantity. The current behaviour of "lazily" dealing with unit changes could be retained by deferring an ito() to the property getter.

Presumably some features work directly on the private array, and these would have to deal with an extra call to magnitude. It seems unlikely that this performance benefit outweight the downsides, but we can do some benchmarking if unsure.

Anyway, this can all be a separate PR!

ajjackson · 2024-07-30T13:00:59Z

@mducle @oerc0122 This could use another set of eyes before merging. In principle it is straightforward stuff, but messing around with unit conversions always scares me a bit. Are there any additional sanity checks or stress tests we can try in order to make sure we didn't accidentally give permission to do anything wrong?

oerc0122

All looks good if it works, just a couple of stylistic questions.

CHANGELOG.rst

euphonic/validate.py

tests_and_analysis/test/euphonic_test/test_castep_reader.py

Reviewer found the ``_ =`` distracting, but it is nice to clarify here that we don't care about the return value. Switching to ``.ito()`` also achieves that (and comes with a miniscule performance advantage.)

I hope to use toolz more in future, for now it is just needed for this one test!

ajjackson added 2 commits July 25, 2024 16:54

ajjackson added 3 commits July 26, 2024 17:46

Fix spectrum unit checks, CASTEP DOS testing

0f6e69c

CHANGELOG update

029e617

ajjackson marked this pull request as ready for review July 29, 2024 16:02

ajjackson added 2 commits July 29, 2024 17:27

ajjackson added 2 commits July 30, 2024 12:43

Use parens to force reciprocal_spectroscopy application order

960ae06

Performance tests show this gives a significant improvement for calls to `.to()` and slightly worse performance for (the much-faster) `.ito()`.

Cleanup for linter

78961e0

CHANGELOG update

fa0e382

ajjackson requested review from mducle and oerc0122 July 30, 2024 12:56

oerc0122 approved these changes Jul 30, 2024

View reviewed changes

CHANGELOG.rst Show resolved Hide resolved

euphonic/validate.py Outdated Show resolved Hide resolved

tests_and_analysis/test/euphonic_test/test_castep_reader.py Outdated Show resolved Hide resolved

ajjackson added 2 commits July 30, 2024 16:24

Use a non-returning function for unit conversion check

890ac91

Reviewer found the ``_ =`` distracting, but it is nice to clarify here that we don't care about the return value. Switching to ``.ito()`` also achieves that (and comes with a miniscule performance advantage.)

Use toolz for a cleaner data transformation in test suite

517e8d0

I hope to use toolz more in future, for now it is just needed for this one test!

ajjackson added the enhancement New feature or request label Jul 30, 2024

ajjackson enabled auto-merge (squash) July 30, 2024 16:02

ajjackson merged commit 994a4e4 into master Jul 30, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of spectral units #311

Improve handling of spectral units #311

ajjackson commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024 •

edited

Loading

ajjackson commented Jul 30, 2024 •

edited

Loading

ajjackson commented Jul 30, 2024

ajjackson commented Jul 30, 2024 •

edited

Loading

oerc0122 left a comment

Improve handling of spectral units #311

Improve handling of spectral units #311

Conversation

ajjackson commented Jul 26, 2024 • edited Loading

github-actions bot commented Jul 26, 2024 • edited Loading

Test Results

ajjackson commented Jul 30, 2024 • edited Loading

ajjackson commented Jul 30, 2024

ajjackson commented Jul 30, 2024 • edited Loading

oerc0122 left a comment

Choose a reason for hiding this comment

ajjackson commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024 •

edited

Loading

ajjackson commented Jul 30, 2024 •

edited

Loading

ajjackson commented Jul 30, 2024 •

edited

Loading