-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of spectral units #311
Conversation
- Pint tends to choke on conversions between reciprocal energy and wavenumber. These don't come up too often but are an issue when dealing with histograms and spectra. - Here we add a few more "definitions" for Pint to use in such cases. - This creates a slightly scary range of paths around unit conversions. By putting them in a separate context, we can ensure they are only introduced when they are likely to be needed.
DOS units should be reciprocal of energy/frequency axis: this gives dimensionless area (count) and scales properly with unit changes. It looks like DOS was expressed in enegy/frequency units in part to avoid some Pint conversion challenges. The new reciprocal_spectroscopy context should allow those to be done more safely and directly. There are still some more test failures to be examined and cleaned up.
- Broadening logic tries to convert to hartree and back again, but this behaves unpredictably (or otherwide badly) for non-energy units. - Instead we can use the bin units; this still requires that the relevant parameters are dimensionally-consistent _with each other_. - While working on this I noticed that very small bin values (i.e. when using large bin _units_) are incorrectly identified as having equal width due to the "atol" which is intended to stabilise comparisons near zero. Bin widths should always be finite, so we can make this stricter and always base agreement on the relative width.
The current spectrum getters create a unit conversion factor and then multiply the raw array by it. There are two problems with this: - Multiplying any iterator by a Quantity or Unit leads to Pint checking through all the data for consistency. This can be quite expensive, and is unnecessary for these trusted array objects. When iterating over spectra yielded from a Spectrum1DCollection, the cost can become quite noticeable. - Some unit conversions in the spectroscopy context involve taking a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if this is done to a _factor_ the data itself will not be inverted. If we start with a Spectrum1D with x_data in wavenumber (1/cm) then spectrum.x_data.to("cm") and spectrum.x_data_unit = "cm" spectrum.x_data will not give the same result; the second method leaves the actual values untouched (as cm / (1/cm) = 1).
By providing Pint with "direct" conversion routes between wavenumber and energy (in each direction and inversion), Pint avoids unnecessarily taking the reciprocal of data and encountering divide-by-zero/infinite-value situations. It doesn't look like the existing divide-by-zero situations were causing problems with correctness of results, but its nice to clear the warnings (and save some CPU?) so that when we see this warning we know it is worth investigating.
I performed a little speed test of conversions in different contexts. With this definition file:
and this speed test code: import timeit
import numpy as np
from pint import UnitRegistry
ureg = UnitRegistry()
ureg.load_definitions("./definitions.txt")
sparse_data = np.random.random((10000, 200))
sparse_data[sparse_data > 0.5] = 0
sparse_spectrum = ureg.Quantity(sparse_data, "1/cm")
print("Default spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'spectroscopy')", globals=globals(), number=1000))
print("Reciprocal spectroscopy context:")
print(timeit.timeit("sparse_spectrum.to('meV', 'reciprocal_spectroscopy')", globals=globals(), number=1000))
print("Reciprocal spectroscopy context with parenthesis groups:")
print(timeit.timeit("sparse_spectrum.to('meV', 'bracketed_spectroscopy')", globals=globals(), number=1000)) the output is
So we see that
Usually a factor x2 is not a big thing to scream about, but this is so slow that it is a credible bottleneck and worth sorting out. That being said, things narrow a lot when replacing
|
Performance tests show this gives a significant improvement for calls to `.to()` and slightly worse performance for (the much-faster) `.ito()`.
I think it could be worth refactoring how spectra work; currently they store a numpy array and unit information, and dynamically construct a Quantity when properties are accessed. This is quite expensive, and it would be simpler to store a full Quantity. The current behaviour of "lazily" dealing with unit changes could be retained by deferring an Presumably some features work directly on the private array, and these would have to deal with an extra call to Anyway, this can all be a separate PR! |
@mducle @oerc0122 This could use another set of eyes before merging. In principle it is straightforward stuff, but messing around with unit conversions always scares me a bit. Are there any additional sanity checks or stress tests we can try in order to make sure we didn't accidentally give permission to do anything wrong? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good if it works, just a couple of stylistic questions.
Reviewer found the ``_ =`` distracting, but it is nice to clarify here that we don't care about the return value. Switching to ``.ito()`` also achieves that (and comes with a miniscule performance advantage.)
I hope to use toolz more in future, for now it is just needed for this one test!
It looks like spectral intensities are sometimes handled in energy units (such as when imported from CASTEP DOS). If these are converted to/from their proper (reciprocal energy) units, this can cause divide-by-zero errors as pint performs a wavenumber <-> wavelength conversion and takes the reciprocal of the spectral intensities, including empty bins.
Here we introduce another mechanism for dealing with spectral units: a few extra unit conversions in a special "reciprocal_spectroscopy" context (which comes an "rs" alias for convenience). This stops Pint from choking on the
1/(1/cm) <-> 1/meV
conversions.The broadening system is made a bit more robust; previously the variable-width broadening assumed that things can always be converted to energy units.
Perhaps the most serious issue addressed here is the way the Spectrum data "getters" (hidden behind a
@property
interface) work.The current spectrum getters create a unit conversion factor and then
multiply the raw array by it. There are two problems with this:
Multiplying any iterator by a Quantity or Unit leads to Pint
checking through all the data for consistency. This can be quite
expensive, and is unnecessary for these trusted array objects. When
iterating over spectra yielded from a Spectrum1DCollection, the cost
can become quite noticeable.
Some unit conversions in the spectroscopy context involve taking
a reciprocal, e.g. conversion from cm -> 1/cm is allowed. But if
this is done to a factor the data itself will not be inverted. If
we start with a Spectrum1D with x_data in wavenumber (1/cm) then
and
will not give the same result; the second method leaves the actual
values untouched (as cm / (1/cm) = 1).
To further improve these conversions I included some technically-redundant energy-wavenumber conversions in the "reciprocal_spectroscopy" context. With these available,
.to(..., "rs")
conversions can avoid needlessly taking reciprocals of the data and raising divide-by-zero errors. That should be more efficient, too, but it's not clear if that works in practice.