The three factors contributing to the 1000x (10x10x10) speedup #8

HiroIshida · 2024-06-01T20:14:52Z

HiroIshida
Jun 1, 2024

I'm Hirokazu Ishida who spoke with you twice at ICRA 2024 to discuss your work, @zkingston . During the poster session, I asked about the 1000x speedup in your algorithm, considering that SIMD itself contributes to at most about 10x. I initially considered emailing but decided that a github discussion would be the best place to share our conversation with others.

If I recall correctly, you explained that SIMD is indeed only one part of the three factors. Each of the three factors makes about a 10x speedup, totaling to 1000x overall. Due to my limited English proficiency, I couldn't fully grasp the explanation, but I remember you mentioned "parallelism" in the context of the other two factors, which left me wondering how the term "parallelism" appeared without SIMD context.

Using the non-SIMD "parallelism" as a clue, I researched and realized that CPU instruction pipelining must be the key. I now understand that the "parallelism" you mentioned might be related to CPU pipelining. Since branch prediction-miss is one of the main reasons for pipeline bottlenecks, it makes sense to reduce branching at compilation time, which is emphasized in your paper.
Based on this, my current understanding is that the 10 x 10 x 10 factors can be attributed to:

Usage of SIMD instructions
Reduction of the total number of instructions by removing unnecessary operations
Branch-free pipelining

Also, in the FK, the tracing compiler contribute to all three elements.

I would greatly appreciate it if you could confirm whether my understanding of these three key attributes is accurate.

P.S. I'm deeply grateful for your generous responses to my questions at ICRA, despite my poor language skills. I believe this research will change the landscape of the motion planning community, and I'm eagerly awaiting your follow-up papers!

zkingston · 2024-06-03T20:21:51Z

zkingston
Jun 3, 2024
Maintainer

Effective instruction pipelining, branch prediction, and very few L1/L2 cache misses are large factors in why our SIMD forward kinematics and collision checking (FK/CC) kernels are fast. These are all results of the code generation from tracing compilation process.

However, there are algorithmic changes to the process of how motions between two configurations are validated that provide the other "10x" factors (although, to be clear, they are not each a 10x speed-up). These algorithmic changes are all oriented around the idea of finding if a motion between two configurations is invalid (i.e., there exists a configuration along the edge between these two configurations that is in collision) as fast as possible (i.e., with the least amount of computational effort). These changes are:

The "raked" collision detection, which spatially distributes collision checking over the edge being validated. Spatially distributing collision checking on an edge being validated is known to be more effective from the literature, but thanks to our SIMD FK/CC kernels, we can do this validation in parallel for the cost of a single collision check.
Interleaving forward kinematics and collision checking. This prevents wasted computation where if an "earlier" part of the robot is in collision, the remaining links of the robot are not posed or checked for collision. This applies to all of the configurations we are checking in parallel.
Hierarchical sphere approximations. If a larger bounding sphere is not in collision, the more detailed sphere model is not checked.

These components all work together in tandem with the SIMD-parallel FK/CC kernel, and combine together to provide the total observed speed-up. We plan to provide detailed ablations of our approach in a future paper that will detail where all the speed-ups come from. Also, feel free to ask any other questions about the work.

1 reply

HiroIshida Sep 3, 2024
Author

@zkingston Thank you very much and sorry for being late to say this (I thought response I fully understand the content and it took time)Now that I read the detail of the source code and clearly understand the idea. Thank you very much for maintaining the cool software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The three factors contributing to the 1000x (10x10x10) speedup #8

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

The three factors contributing to the 1000x (10x10x10) speedup #8

HiroIshida Jun 1, 2024

Replies: 1 comment · 1 reply

zkingston Jun 3, 2024 Maintainer

HiroIshida Sep 3, 2024 Author

HiroIshida
Jun 1, 2024

Replies: 1 comment 1 reply

zkingston
Jun 3, 2024
Maintainer

HiroIshida Sep 3, 2024
Author