reduce.html

<!doctype html>
<html>
<meta charset='utf-8'>

<title>reduce results</title>
<script src='https://cdn.plot.ly/plotly-latest.min.js'></script>
<script type='text/javascript' src='scripts/bench.js'></script>

</html>

<body>
  <h1> reduce</h1>

  `std::reduce/std::transform_reduce` are two algorithms consistenlty vectorized
  by clang<br />
  (regardless of the shape of the call - it can be written by hand or via
  std::accumulate). <br />
  Because of this there is no scalar baseline.
  <br /> <br />
  There are two interesting variations on reduce:<br />
  a) We are reducing to the same type as the element type. <br />
  b) We are reducing to a bigger type to deal with a potential overflow.
  <br />

  <h2> Reducing to the same type </h2>

  This is relatively straight forward.<br />
  The best results is achived when unrolling 4 times. <br />
  In this case we are consistenlty massively winning - at most 5.6 times for
  1000 bytes of chars. <br />
  I don't know why the big difference in behaviour of `std::reduce` - the
  assembly looks <a href="https://godbolt.org/z/tskjMs">identical</a> <br />
  and I'm controlling for code alignment. <br />
  The only idea I have is that punishment for loop peeling for chars is this
  big, <br />
  I have seen mutliple times quite poor perfromance of scalar code for chars.
  <br /><br />
  With respect to code alignment, `unsq_eve` version cares very little - at most
  15% for 1000 bytes of chars. <br />
  `std::reduce` - not so lucky - at 10k swings up to 44% and this does not seem
  to be the scalar code's code fault. <br />

  <h4> reducing to the same type, data </h4>
  <div id='reduce_same_type'></div>
  <script>dynamicEntryPoint('reduce_same_type', {
      name: 'sum',
      size: 'x',
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'selection',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['eve::algo::reduce/char/char', 'eve::algo::reduce/short/short', 'eve::algo::reduce/int/int',
      'std::reduce/char/char', 'std::reduce/short/short', 'std::reduce/int/int']);
  </script>

  <h4> reducing to the same type, code alignment, unsq_eve </h4>
  <div id='reduce_same_type_code_alignment_unsq_eve'></div>
  <script>dynamicEntryPoint('reduce_same_type_code_alignment_unsq_eve', {
      name: 'sum',
      size: 'x',
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'selection',
      time: 'y',
      padding: 'minmax',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['eve::algo::reduce/char/char', 'eve::algo::reduce/short/short', 'eve::algo::reduce/int/int']);
  </script>

  <h4> reducing to the same type, code alignment, `std::reduce` </h4>
  <div id='reduce_same_type_code_alignment_std_reduce'></div>
  <script>dynamicEntryPoint('reduce_same_type_code_alignment_std_reduce', {
      name: 'sum',
      size: 10000,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'selection',
      time: 'y',
      padding: 'minmax',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/char/char', 'std::reduce/short/short', 'std::reduce/int/int']);
  </script>

  <h2> Reducing to a different type </h2>

  When reducing to a different type, we need to somehow convert from the <br />
  array type, to the type we want to do our operations in. <br />
  `std::reduce` for this generates a really nice <a
    href="https://godbolt.org/z/fXLh8T">assembly</a> <br />
  (chars reducing to shorts).

  <br /><br />
  ```<br />
  vpmovsxbw ymm4, xmmword ptr [rdi + rdx] <br />
  vpaddw ymm0, ymm0, ymm4<br />
  vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 16]<br />
  vpaddw ymm1, ymm1, ymm4<br />
  vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 32]<br />
  ```
  <br /><br />

  This is essentially: `_mm_cvtepi8_epi16 ` called directly on the address -
  <br /><br />

  I do the same trick in eve by loading a smaller `eve::wide` and <br />
  then calling `eve::convert` on it.

  We end up winning primarally because of not peeling the loops. <br />

  <h3> reducing 40 bytes </h3>

  On 40 bytes not peeling loops gives the most effect.<br />
  adding chars to `short` is 3.3 times faster and adding to `int` ~30%. <br />
  adding shorts to `int` is 1.7 times faster. <br />

  <h4> reducing chars 40 bytes data </h4>

  <div id='reduce_chars_40_bytes'></div>
  <script>dynamicEntryPoint('reduce_chars_40_bytes', {
      name: 'sum',
      size: 40,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/char', 'eve::algo::reduce/char']);
  </script>

  <h4> reducing shorts 40 bytes data </h4>

  <div id='reduce_shorts_40_bytes'></div>
  <script>dynamicEntryPoint('reduce_shorts_40_bytes', {
      name: 'sum',
      size: 40,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/short', 'eve::algo::reduce/short']);
  </script>

  <h3> reducing 1000 bytes </h3>

  On 1000 bytes not peeling loops still has an effect, though not that big.
  <br />
  adding chars to `short` is 1.5 times faster and adding to `int` is roughly the
  same.
  <br />
  adding shorts to `int` is 1.13 times faster.
  <br /> <br />

  <h4> reducing chars 1000 bytes data </h4>

  <div id='reduce_chars_1000_bytes'></div>
  <script>dynamicEntryPoint('reduce_chars_1000_bytes', {
      name: 'sum',
      size: 1000,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/char', 'eve::algo::reduce/char']);
  </script>

  <h4> reducing shorts 1000 bytes data </h4>

  <div id='reduce_shorts_1000_bytes'></div>
  <script>dynamicEntryPoint('reduce_shorts_1000_bytes', {
      name: 'sum',
      size: 1000,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/short', 'eve::algo::reduce/short']);
  </script>

  <h3> reducing 10000 bytes </h3>

  On 10'000 bytes loop peeling stops being important.<br /> <br />

  <h4> reducing chars 1000 bytes data </h4>

  <div id='reduce_chars_10000_bytes'></div>
  <script>dynamicEntryPoint('reduce_chars_10000_bytes', {
      name: 'sum',
      size: 10000,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/char', 'eve::algo::reduce/char']);
  </script>

  <div id='reduce_shorts_10000_bytes'></div>
  <script>dynamicEntryPoint('reduce_shorts_10000_bytes', {
      name: 'sum',
      size: 10000,
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'x',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['std::reduce/short', 'eve::algo::reduce/short']);
  </script>

  <h2> Total benchmark </h2>
  <div id='total_benchmark'></div>
  <script>dynamicEntryPoint('total_benchmark', {
      name: 'sum',
      size: 'x',
      algorithm: 'selection',
      type: 'selection',
      sum_type: 'selection',
      time: 'y',
      padding: 'min',
      group: 'intel_9700K_avx2_bmi',
      percentage: 100,
    }, ['reduce']);
  </script>

</body>