Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Face Bounds Computation #877

Merged
merged 28 commits into from
Aug 9, 2024
Merged

Optimize Face Bounds Computation #877

merged 28 commits into from
Aug 9, 2024

Conversation

philipc2
Copy link
Member

@philipc2 philipc2 commented Aug 5, 2024

Closes #730

Overview

  • Decorated Numpy functions (allclose, isclose, cross, dot) that were used by helper functions when building the face bounds
  • Added a global option to enable or disable fma instructions

Expected Usage

import uxarray as ux

grid_path = "/path/to/grid.nc"
data_path = "/path/to/data.nc"

uxds = ux.open_dataset(grid_path, data_path)

# this is how you use this function
some_output = uxds.some_function()

# this is another way to use this function
other_output = uxds.some_function(some_param = True)

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Aug 5, 2024
@philipc2 philipc2 self-assigned this Aug 5, 2024
Copy link

github-actions bot commented Aug 5, 2024

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change Before [1790e0e] After [2678371] Ratio Benchmark (Parameter)
- 13.8±0.03s 1.44±0.02s 0.1 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
- 2.09±0.01s 186±5ms 0.09 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
- 19.5±0.08s 1.69±0.01s 0.09 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
- 74.6±0.4ms 7.81±0.05ms 0.1 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
- 401M 347M 0.87 mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Change Before [1790e0e] After [2678371] Ratio Benchmark (Parameter)
1.63±0.01s 1.64±0.01s 1.00 import.Imports.timeraw_import_uxarray
621±5ms 639±7ms 1.03 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
39.1±0.4ms 39.3±0.8ms 1.00 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
1.83±0.06ms 1.60±0.2ms ~0.87 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
471±7μs 478±6μs 1.01 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
1.18±0μs 1.21±0μs 1.03 mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')
292±1ns 305±6ns 1.04 mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
778±3ns 834±3ns 1.07 mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')
279±1ns 287±2ns 1.03 mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
393M 397M 1.01 mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('120km', False)
383M 385M 1.00 mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('120km', True)
355M 354M 1.00 mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('480km', False)
354M 354M 1.00 mpas_ocean.GeoDataFrame.peakmem_to_geodataframe('480km', True)
1.21±0s 1.22±0.01s 1.01 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
58.5±0.6ms 58.5±0.4ms 1.00 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
97.4±0.3ms 94.7±0.4ms 0.97 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.30±0.1ms 5.41±0.1ms 1.02 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
266M 266M 1.00 mpas_ocean.Gradient.peakmem_gradient('120km')
243M 244M 1.00 mpas_ocean.Gradient.peakmem_gradient('480km')
2.68±0.02ms 2.64±0.01ms 0.98 mpas_ocean.Gradient.time_gradient('120km')
293±4μs 285±3μs 0.97 mpas_ocean.Gradient.time_gradient('480km')
365M 365M 1.00 mpas_ocean.Integrate.peakmem_integrate('120km')
186±10ms 177±1ms 0.95 mpas_ocean.Integrate.time_integrate('120km')
11.8±0.1ms 12.1±0.02ms 1.02 mpas_ocean.Integrate.time_integrate('480km')
351±3ms 347±3ms 0.99 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
349±3ms 347±2ms 0.99 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
350±0.5ms 353±4ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
21.8±0.2ms 22.4±0.3ms 1.03 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
22.9±0.2ms 22.2±0.2ms 0.97 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
22.2±0.2ms 22.4±0.3ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
54.9±0.05ms 55.0±0.2ms 1.00 mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
44.3±0.2ms 44.3±0.2ms 1.00 mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
360±1ms 361±1ms 1.00 mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
264±0.9ms 264±0.9ms 1.00 mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
234M 239M 1.02 quad_hexagon.QuadHexagon.peakmem_open_dataset
232M 239M 1.03 quad_hexagon.QuadHexagon.peakmem_open_grid
7.10±0.2ms 7.02±0.2ms 0.99 quad_hexagon.QuadHexagon.time_open_dataset
5.99±0.02ms 6.03±0.04ms 1.01 quad_hexagon.QuadHexagon.time_open_grid

Benchmarks that have got worse:

Change Before [1790e0e] After [2678371] Ratio Benchmark (Parameter)
+ 243M 376M 1.55 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
+ 303M 377M 1.25 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
+ 321M 381M 1.19 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
+ 318M 378M 1.19 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))

@philipc2 philipc2 mentioned this pull request Aug 6, 2024
14 tasks
@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

@hongyuchen1030

For the first step in optimizing, I've looked at decorating some of the commonly called functions with numba, and used a numba implementation of isclose. Looks to be about a 15-20% performance improvement to begin with.

Next step will be to eliminate any unnecessary coordinate conversion calls, such as _xyz_to_lonlat_rad

This might look like us providing both the latlon_rad and cartesian points in the function call to avoid needing to convert. Something like:

 def _point_within_gca(pt_latlon,
                       pt_xyz,
                       gca_a_latlon,
                       gca_a_xyz,
                       gca_b_latlon,
                       gca_b_xyz,
                       is_directed=False):

Hopefully once we've moved these conversions out of each function call, we can see if the remaining logic can be decorated with numba.

@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

@hongyuchen1030

if np.allclose(angle, np.pi, rtol=0, atol=ERROR_TOLERANCE):

I had a question about this and other parts of the code where the Cartesian coordinates are converted to latlon in radians. Are we able to use the node_lon and node_lat that are part of the Grid (in degrees) instead of needing to run the conversion? Additionally, is radians here just a preference or can we use degrees?

@hongyuchen1030
Copy link
Contributor

hongyuchen1030 commented Aug 7, 2024

Next step will be to eliminate any unnecessary coordinate conversion calls, such as _xyz_to_lonlat_rad

This might look like us providing both the latlon_rad and cartesian points in the function call to avoid needing to convert. Something like:

 def _point_within_gca(pt_latlon,
                       pt_xyz,
                       gca_a_latlon,
                       gca_a_xyz,
                       gca_b_latlon,
                       gca_b_xyz,
                       is_directed=False):

I have also been considering this design today. It will be highly beneficial to call the coordinate conversion only once, both for performance and accuracy.

We can extract all functions that take node coordinates as input and ensure that both Cartesian and spherical coordinates are required as parameters.

I had a question about this and other parts of the code where the Cartesian coordinates are converted to latlon in radians. Are we able to use the node_lon and node_lat that are part of the Grid (in degrees) instead of needing to run the conversion? Additionally, is radians here just a preference or can we use degrees?

That is another reason why I have been advocating for better documentation and conventions for the coordinate system.

Currently, we have the following coordinate systems:

  1. Spherical coordinates in degrees for users.
  2. Spherical coordinates in radians because the trigonometry functions in numpy use radians (np.sin, np.cos, etc.). And some other potential reasons we decided to use it long time ago
  3. Cartesian coordinate system with normalization.
  4. Cartesian coordinate system without normalization (we still haven't confirmed if the node data in MPAS or other unnormalized coordinates are equal to the normalized_coordinates * earth_radius yet).

Before we attempt to rewrite everything, we must ensure which coordinate systems we are using across the entire project.

@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

I have also been considering this design today. It will be highly beneficial to call the coordinate conversion only once, both for performance and accuracy.

Agreed, was thinking of this as well.

That is another reason why I have been advocating for better documentation and conventions for the coordinate system.

We have some documentation here in our user guide. https://uxarray.readthedocs.io/en/latest/user-guide/representation.html

Any suggestions on how we could improve this?

Currently, we have the following coordinate systems:

Wondering if it would make sense to include a node_lon_rad and node_lat_rad extension to how we represent our coordinates. I've seen some other grids store the coordinates both in degrees and radians, so this might be a worth extension (and would help with some of the conversions we are doing).

Hopefully the normalization inconsistencies will be resolved with #878

@hongyuchen1030
Copy link
Contributor

We have some documentation here in our user guide. https://uxarray.readthedocs.io/en/latest/user-guide/representation.html

Any suggestions on how we could improve this?
image

  1. I understand we already specify the convention underneath; however, the node_lat has the description specifying the units while others do not. It would be great to keep them uniform. Additionally, placing the note before the detailed description would enhance clarity.

  2. Just a note: I think we didn't use the convention of [-180, 180] until we updated the coordinates conversion. The old conversion function used the convention of [0, 360). All my algorithms are based on [0, 2π) as well. I highly recommend using [0, 2π) or [0, 360) since it provides an easier way to handle the antimeridian faces. This also necessitates a thorough re-check after any major functions are updated.

  3. It might also be beneficial to specify that we have a unique definition for the antimeridian points and the pole points.

Wondering if it would make sense to include a node_lon_rad and node_lat_rad extension to how we represent our coordinates. I've seen some other grids store the coordinates both in degrees and radians, so this might be a worth extension (and would help with some of the conversions we are doing).

I think what we've discovered so far is that the degrees are solely for the users, while all calculations use radians since the numpy trigonometry functions operate in radians. I'm always open to utilizing storage space to reduce time complexity.

@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

I understand we already specify the convention underneath; however, the node_lat has the description specifying the units while others do not. It would be great to keep them uniform. Additionally, placing the note before the detailed description would enhance clarity.

That's a good catch!

Just a note: I think we didn't use the convention of [-180, 180] until we updated the coordinates conversion. The old conversion function used the convention of [0, 360). All my algorithms are based on [0, 2π) as well. I highly recommend using [0, 2π) or [0, 360) since it provides an easier way to handle the antimeridian faces. This also necessitates a thorough re-check after any major functions are updated.

We use -180 to 180 since it makes certain visualizations tasks (i.e. projections and the use of the antimeridian package) simpler. Most of our datasets are also already in the -180 to 180 range.

I think what we've discovered so far is that the degrees are solely for the users, while all calculations use radians since the numpy trigonometry functions operate in radians. I'm always open to utilizing storage space to reduce time complexity.

This isn't fully the case. Many of our internal functions (i.e. BallTree) use degrees.

numpy trigonometry functions operate in radians

One option could be to use Scipy's degree implementations, but I'm not sure how much this implementation differs from Numpy's

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.sindg.html#scipy.special.sindg

@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

@hongyuchen1030

Would you mind explaining this code snippet for me?

d_a_max = (
np.clip(d_a_max, 0, 1)
if np.isclose(d_a_max, [0, 1], atol=ERROR_TOLERANCE).any()
else d_a_max
)

@hongyuchen1030
Copy link
Contributor

@hongyuchen1030

Would you mind explaining this code snippet for me?

d_a_max = (
np.clip(d_a_max, 0, 1)
if np.isclose(d_a_max, [0, 1], atol=ERROR_TOLERANCE).any()
else d_a_max
)

Sure, I wonder if you want to hop into a quick call?

@hongyuchen1030
Copy link
Contributor

We use -180 to 180 since it makes certain visualization tasks (i.e., projections and the use of the antimeridian package) simpler. Most of our datasets are also already in the -180 to 180 range.

I understand the benefits of using the -180 to 180 range for visualization tasks and consistency with existing datasets. My algorithms do not expose the longitude range to users, nor do they modify existing longitudes, so this shouldn't pose a significant issue. However, I wanted to bring this up because if we require functions to take node_lat and node_lon from the grid, we may need to quickly check if longitude conversion is necessary. This note could be helpful for troubleshooting future bugs.

This isn't fully the case. Many of our internal functions (i.e., BallTree) use degrees.

Is this because we directly use the grid.node_lat and grid.node_lon without calling the numpy trigonometric functions?

One option could be to use Scipy's degree implementations, but I'm not sure how much this implementation differs from Numpy's.

I compared the source code for Scipy's sindg and Numpy's sin. Here’s what I found:

  1. Numpy’s sin is a wrapper for the compiled c_sin, with the source code for s_sin available here.
  2. Scipy’s sindg was implemented recently in this .h file, and it still converts some variables into radians.

I haven't delved deeply into the internal workings of both functions, but we could perform a simple benchmark to compare their running times and compatibility if needed.

@philipc2
Copy link
Member Author

philipc2 commented Aug 7, 2024

@hongyuchen1030

After some further optimizations, this is the point that I've gotten to. Looks like the pyfma implementations still contribute to a pretty significant slowdown, taking up the majority of the execution time.

FMA Disabled

~5.73s Execution Time
image

FMA Enabled

~14.1s Execution Time

image

Do you think anything from here could possible act as an alternative to using pyfma?

https://docs.scipy.org/doc/scipy/reference/linalg.blas.html

@philipc2
Copy link
Member Author

philipc2 commented Aug 8, 2024

A few more refinements and with fma disabled, able to get down for 3.74s

image

@philipc2
Copy link
Member Author

philipc2 commented Aug 8, 2024

image

From the benchmarks run above, looks like almost a 8-10x improvement in performance across the board with these changes.

I was also able to compute the bounds on a 30km MPAS atmosphere grid (655k Faces, 1,310,000 nodes) in about 40 seconds, which is a significant improvement and wouldn't of been possible without these optimizations.

@hongyuchen1030
Copy link
Contributor

@hongyuchen1030

After some further optimizations, this is the point that I've gotten to. Looks like the pyfma implementations still contribute to a pretty significant slowdown, taking up the majority of the execution time.

FMA Disabled

~5.73s Execution Time image

FMA Enabled

~14.1s Execution Time

image

Do you think anything from here could possible act as an alternative to using pyfma?

https://docs.scipy.org/doc/scipy/reference/linalg.blas.html

I've been giving this case a lot of thought, considering my research findings and the intricacies of floating point operations.

Given the technical details involved with floating point operations, I won't delve into too much detail to avoid overwhelming you. Here's a summary of my thoughts:

  1. Maintain Current Approach: We can keep the current implementation as it is, but set is_FMA_enable to False globally and use the basic np.cross. We'll document this choice, informing users that they can opt for a more accurate method by enabling FMA.
  2. Adjust Error Tolerance: The immediate task would be to reset the error tolerance for intersection point snapping. This warrants a separate issue to be opened specifically for this. And since we are not allowed to use any compiled files in our project code base, we cannot do much about it now.

Rationale:

  1. Algorithm Accuracy: While CDO uses Kahan's Algorithm with FMA, my research indicates there's a more accurate method available. This method works with and without FMA, though the non-FMA version requires more flops. We can explore this choice further as we develop in Python.
  2. Floating Point Operations: Standard floating point operations generally produce acceptable results. We can monitor their reliability by calculating the condition number. I'm working on an adaptive algorithm to address this in the future.
  3. Task Complexity: As you might have noticed, all these floating point operations in Python inevitably lead to compiled C. Although all these packages use the same compiled C function, their binding and wrapping are different, making them difficult to connect properly.

Conclusion:

Since FMA isn't currently ready for the Python package, we can focus on the general case first and handle the ill-conditioned cases later. The package should primarily use normal floating point operations, and when an ill-conditioned case arises, it's not overly time-consuming to apply the more accurate operations to this subset.

If you have any questions or need further clarification, please don't hesitate to ask. I'm more than happy to discuss and explain everything.

@philipc2
Copy link
Member Author

philipc2 commented Aug 8, 2024

Maintain Current Approach: We can keep the current implementation as it is, but set is_FMA_enable to False globally and use the basic np.cross. We'll document this choice, informing users that they can opt for a more accurate method by enabling FMA.

I think this option makes the most sense to me and would fit the expectations of our users also.

And since we are not allowed to use any compiled files in our project code base, we cannot do much about it now.

I wouldn't say that we are not allowed, it's more something that we haven't explored yet. It might be worth considering.

I appreciate the write up and glad that we're making some good improvements both to the performance and accuracy of our operators.

@philipc2 philipc2 changed the title DRAFT: Optimize Face Bounds Computation Optimize Face Bounds Computation Aug 9, 2024
@philipc2 philipc2 linked an issue Aug 9, 2024 that may be closed by this pull request
@philipc2 philipc2 added the scalability Related to scalability & performance efforts label Aug 9, 2024
@philipc2 philipc2 added this to the Scalability & Performance milestone Aug 9, 2024
@philipc2 philipc2 marked this pull request as ready for review August 9, 2024 05:57
@philipc2
Copy link
Member Author

philipc2 commented Aug 9, 2024

Merging this so that we can continue our work with #785

Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

@philipc2 philipc2 merged commit 1a68daa into main Aug 9, 2024
15 of 16 checks passed
@hongyuchen1030
Copy link
Contributor

Merging this so that we can continue our work with #785

Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

Thank you very much for your excellent jobs! And I will open another issue and PR about our rewriting decisions for the helper functions.

@philipc2
Copy link
Member Author

philipc2 commented Aug 9, 2024

Merging this so that we can continue our work with #785
Overall, was able to get about a 10-12x performance improvement by using numba without major rewrites to the algorithm. There's still some more optimizations we could do, but this would require significant rewrites, which should not be a priority right now.

Thank you very much for your excellent jobs! And I will open another issue and PR about our rewriting decisions for the helper functions.

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run ASV benchmark workflow scalability Related to scalability & performance efforts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Face Bounds Construction
2 participants