A single factor that controls clang's optimization of tuple_cat #14

codeinred · 2021-10-12T06:30:22Z

codeinred
Oct 12, 2021
Maintainer

NB: This factor appears to apply to pathological cases. In the test case we were invoking tuple_cat on 100 tuples, however the factor doesn't appear to make a difference when dealing with 30 tuples or fewer.

@mechacrash For a while one thing that frustrated me is that in your initial code example (posted on godbolt), clang was able to optimize out tuple_cat and reduce everything down to a single instruction, but in the implementation that was actually created in the PR, clang wasn't able to do that. The implementations looked very similar, and for the longest time I couldn't figure things out. (GCC was able to do it in both cases)

I should make it clear: my frustration was not with you, but rather with the capricious nature of Clang's optimizer.

I ended up probing the godbolt code until I identified the factor that broke clang's optimizing capabilities, and it turns out that a single type signature is responsible.

See here a working sample here: https://godbolt.org/z/axYhsj7T4

If you change -DTUPLET_CAT_BY_FORWARDING_TUPLE=0 from 0 to 1, it will alter that type signature, and clang will no longer be able to optimize out tuple_cat. This macro controls a single line of code:

template <base_list_tuple... T>
constexpr auto tuple_cat(T&&... ts) {
    if constexpr (sizeof...(T) == 0) {
        return tuple<>();
    } else {
        // It turns out that passing these by value generates better assembly
        // on Clang and identical assembly on GCC. Everything should get
        // constructed in place, with the move being optimized out.
#if TUPLET_CAT_BY_FORWARDING_TUPLE
        using big_tuple = tuple<T&&...>;
#else
        using big_tuple = tuple<std::decay_t<T>...>;
#endif
        using outer_bases = base_list_t<big_tuple>;
        constexpr auto outer = detail::get_outer_bases(outer_bases {});
        constexpr auto inner = detail::get_inner_bases(outer_bases {});
        return detail::cat_impl(
            big_tuple {static_cast<T&&>(ts)...},
            outer,
            inner);
    }
}

If the macro is set to zero (or isn't present), then the big tuple will take all the arguments and pass them by value, but otherwise, it'll pass them by reference.

Another interesting factor: Clang can still optimize out tuple_cat for less-pathological cases. For example, if you're only tuple_cat'ing 30 tuples, clang can still do the optimization, independent of the value of the flag. This leads me to believe that using a forwarding tuple interferes with one optimization pass, but for simpler cases a different optimization pass is able to do the job.

See a less pathological example here: https://godbolt.org/z/s7sPf1Gad

@rileylev I'm tagging you just b/c you might be interested too.

gatchamix · 2021-10-12T06:40:01Z

gatchamix
Oct 12, 2021
Collaborator

Apologies for not mentioning this beforehand - I actually already knew this. I had removed the forwarding behaviour from the original original - my first personal implementation - during the migration and saw that it caused this change in behaviour.

The main reason I didn’t worry too much is because less trivial uses of tuple_cat fail to optimise away for either GCC or clang anyway. My demo code just happened to be the perfect storm to make further (better) revisions to the code seem like regressions 😅

If you pull up the first Godbolt link I sent, in #10 , and make either of the a or b tuples have 4 elements rather than 3, GCC will fail to compile down to nothing there too… even with small tuple_cat uses. It’s really strange!

(Apologies for the slapdash response again, I’m replying on my phone!)

4 replies

codeinred Oct 12, 2021
Maintainer Author

I didn't realize you already knew this! I want to be clear, I wasn't frustrated with you at all, just with the capricious nature of clang's optimizer.

I didn't realize that making a or b have 4 elements would affect optimization, BUT I have some good news: when a and b have different numbers of elements, GCC is still able to optimize things when TUPLET_CAT_BY_FORWARDING_TUPLE=1.

Here's an example where a has 4 elements and b has 3 elements, but GCC optimizes it: https://godbolt.org/z/34qa3MfEq

That being said, clang requires the opposite. It's able to optimize things down, but only when TUPLET_CAT_BY_FORWARDING_TUPLE=0: https://godbolt.org/z/MczfTzenv

gatchamix Oct 12, 2021
Collaborator

Yeah, it’s really weird what will trip up a compiler! I tried all sorts of things (different flags and different compiler revisions) expecting it to be better… but still the same outcome.

Hopefully the optimisation routines can be fine tuned in the future to allow this to work more often!

codeinred Oct 12, 2021
Maintainer Author

I hope so too. In the mean time, I think I'm going to have TUPLET_CAT_BY_FORWARDING_TUPLE be something that's turned off for clang and turned on for gcc, so (hopefully) we'll always get decent optimizing behavior

codeinred Oct 12, 2021
Maintainer Author

(Unfortunately MSVC never optimizes this well, but like, at least GCC and Clang optimize it well)

codeinred · 2021-10-12T07:50:24Z

codeinred
Oct 12, 2021
Maintainer Author

Update: TUPLET_CAT_BY_FORWARDING_TUPLE is now set to the correct value (0 for clang, 1 for GCC). This ensures that tuplet_cat can be compiled down to nothing, even in pathological cases.

Godbolt link: link to godbolt example

Code used to demonstrate this:

// If you uncomment this line and set the value manually,
// either gcc or clang will produce un-optimal assembly.

// #define TUPLET_CAT_BY_FORWARDING_TUPLE 0
#include <tuplet/tuple.hpp>
using namespace tuplet;

#include <string_view>

int main(int argc, char **argv) {
    auto a = tuple{ 1, 2.f, 3.0, std::string_view("Hello"), nullptr };
    auto b = tuple{ '4', "5", argc };

    auto big = tuple_cat
    (
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b
    );
    
    // This is argc + argc + 1
    return get<7>(big) + get<399>(big) + get<392>(big);
}

Update 2: I wrote some code to compare tuplet to std::tuple. GCC can also optimize std::tuple and std::tuple_cat optimally, however it takes 50 times longer to compile std::tuple than it does to compile tuplet::tuple (37 seconds vs 0.69 seconds on my laptop). Clang is unable to generate optimal assembly for std::tuple but does produce optimal assembly for tuplet.

#if USE_STD
#include <tuple>
using std::tuple;
#else 
#include <tuplet/tuple.hpp>
using tuplet::tuple;
#endif

auto main(int argc, char **argv)
-> int
{
    auto a = tuple{ 1, 2.f, 3.0, 6 };
    auto b = tuple{ '4', "5", argc };

    auto big = tuple_cat
    (
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b,
        a, b, a, b, a, b, a, b, a, b
    );

    return get<6>(big);
}

This is the command to test tuplet:

time g++ test.cpp -std=c++20 -O2 -c -o test-tuplet.o

This is the command to test std::tuple:

time g++ test.cpp -std=c++20 -O2 -c -o test-std.o -DUSE_STD=1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A single factor that controls clang's optimization of tuple_cat #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

A single factor that controls clang's optimization of tuple_cat #14

codeinred Oct 12, 2021 Maintainer

Replies: 2 comments · 4 replies

gatchamix Oct 12, 2021 Collaborator

codeinred Oct 12, 2021 Maintainer Author

gatchamix Oct 12, 2021 Collaborator

codeinred Oct 12, 2021 Maintainer Author

codeinred Oct 12, 2021 Maintainer Author

codeinred Oct 12, 2021 Maintainer Author

codeinred
Oct 12, 2021
Maintainer

Replies: 2 comments 4 replies

gatchamix
Oct 12, 2021
Collaborator

codeinred Oct 12, 2021
Maintainer Author

gatchamix Oct 12, 2021
Collaborator

codeinred Oct 12, 2021
Maintainer Author

codeinred Oct 12, 2021
Maintainer Author

codeinred
Oct 12, 2021
Maintainer Author