The three factors contributing to the 1000x (10x10x10) speedup #8
Replies: 1 comment 1 reply
-
Effective instruction pipelining, branch prediction, and very few L1/L2 cache misses are large factors in why our SIMD forward kinematics and collision checking (FK/CC) kernels are fast. These are all results of the code generation from tracing compilation process. However, there are algorithmic changes to the process of how motions between two configurations are validated that provide the other "10x" factors (although, to be clear, they are not each a 10x speed-up). These algorithmic changes are all oriented around the idea of finding if a motion between two configurations is invalid (i.e., there exists a configuration along the edge between these two configurations that is in collision) as fast as possible (i.e., with the least amount of computational effort). These changes are:
These components all work together in tandem with the SIMD-parallel FK/CC kernel, and combine together to provide the total observed speed-up. We plan to provide detailed ablations of our approach in a future paper that will detail where all the speed-ups come from. Also, feel free to ask any other questions about the work. |
Beta Was this translation helpful? Give feedback.
-
I'm Hirokazu Ishida who spoke with you twice at ICRA 2024 to discuss your work, @zkingston . During the poster session, I asked about the 1000x speedup in your algorithm, considering that SIMD itself contributes to at most about 10x. I initially considered emailing but decided that a github discussion would be the best place to share our conversation with others.
If I recall correctly, you explained that SIMD is indeed only one part of the three factors. Each of the three factors makes about a 10x speedup, totaling to 1000x overall. Due to my limited English proficiency, I couldn't fully grasp the explanation, but I remember you mentioned "parallelism" in the context of the other two factors, which left me wondering how the term "parallelism" appeared without SIMD context.
Using the non-SIMD "parallelism" as a clue, I researched and realized that CPU instruction pipelining must be the key. I now understand that the "parallelism" you mentioned might be related to CPU pipelining. Since branch prediction-miss is one of the main reasons for pipeline bottlenecks, it makes sense to reduce branching at compilation time, which is emphasized in your paper.
Based on this, my current understanding is that the 10 x 10 x 10 factors can be attributed to:
Also, in the FK, the tracing compiler contribute to all three elements.
I would greatly appreciate it if you could confirm whether my understanding of these three key attributes is accurate.
P.S. I'm deeply grateful for your generous responses to my questions at ICRA, despite my poor language skills. I believe this research will change the landscape of the motion planning community, and I'm eagerly awaiting your follow-up papers!
Beta Was this translation helpful? Give feedback.
All reactions