A single factor that controls clang's optimization of tuple_cat #14
Replies: 2 comments 4 replies
-
Apologies for not mentioning this beforehand - I actually already knew this. I had removed the forwarding behaviour from the original original - my first personal implementation - during the migration and saw that it caused this change in behaviour. The main reason I didn’t worry too much is because less trivial uses of tuple_cat fail to optimise away for either GCC or clang anyway. My demo code just happened to be the perfect storm to make further (better) revisions to the code seem like regressions 😅 If you pull up the first Godbolt link I sent, in #10 , and make either of the a or b tuples have 4 elements rather than 3, GCC will fail to compile down to nothing there too… even with small tuple_cat uses. It’s really strange! (Apologies for the slapdash response again, I’m replying on my phone!) |
Beta Was this translation helpful? Give feedback.
-
Update: Godbolt link: link to godbolt example Code used to demonstrate this: // If you uncomment this line and set the value manually,
// either gcc or clang will produce un-optimal assembly.
// #define TUPLET_CAT_BY_FORWARDING_TUPLE 0
#include <tuplet/tuple.hpp>
using namespace tuplet;
#include <string_view>
int main(int argc, char **argv) {
auto a = tuple{ 1, 2.f, 3.0, std::string_view("Hello"), nullptr };
auto b = tuple{ '4', "5", argc };
auto big = tuple_cat
(
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b
);
// This is argc + argc + 1
return get<7>(big) + get<399>(big) + get<392>(big);
} Update 2: I wrote some code to compare tuplet to std::tuple. GCC can also optimize #if USE_STD
#include <tuple>
using std::tuple;
#else
#include <tuplet/tuple.hpp>
using tuplet::tuple;
#endif
auto main(int argc, char **argv)
-> int
{
auto a = tuple{ 1, 2.f, 3.0, 6 };
auto b = tuple{ '4', "5", argc };
auto big = tuple_cat
(
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b, a, b
);
return get<6>(big);
} This is the command to test tuplet: time g++ test.cpp -std=c++20 -O2 -c -o test-tuplet.o This is the command to test std::tuple: time g++ test.cpp -std=c++20 -O2 -c -o test-std.o -DUSE_STD=1 |
Beta Was this translation helpful? Give feedback.
-
NB: This factor appears to apply to pathological cases. In the test case we were invoking
tuple_cat
on 100 tuples, however the factor doesn't appear to make a difference when dealing with 30 tuples or fewer.@mechacrash For a while one thing that frustrated me is that in your initial code example (posted on godbolt), clang was able to optimize out tuple_cat and reduce everything down to a single instruction, but in the implementation that was actually created in the PR, clang wasn't able to do that. The implementations looked very similar, and for the longest time I couldn't figure things out. (GCC was able to do it in both cases)
I should make it clear: my frustration was not with you, but rather with the capricious nature of Clang's optimizer.
I ended up probing the godbolt code until I identified the factor that broke clang's optimizing capabilities, and it turns out that a single type signature is responsible.
See here a working sample here: https://godbolt.org/z/axYhsj7T4
If you change
-DTUPLET_CAT_BY_FORWARDING_TUPLE=0
from0
to1
, it will alter that type signature, and clang will no longer be able to optimize out tuple_cat. This macro controls a single line of code:If the macro is set to zero (or isn't present), then the big tuple will take all the arguments and pass them by value, but otherwise, it'll pass them by reference.
Another interesting factor: Clang can still optimize out tuple_cat for less-pathological cases. For example, if you're only tuple_cat'ing 30 tuples, clang can still do the optimization, independent of the value of the flag. This leads me to believe that using a forwarding tuple interferes with one optimization pass, but for simpler cases a different optimization pass is able to do the job.
See a less pathological example here: https://godbolt.org/z/s7sPf1Gad
@rileylev I'm tagging you just b/c you might be interested too.
Beta Was this translation helpful? Give feedback.
All reactions