src/data.yml

machines:
  - name: Apple Macbook Air M2
    specs: |
      CPU: Apple M2
      RAM: LPDDR5 16GB
      Compiler: Homebrew Clang 17.0.6
      Kernel: Darwin Kernel Version 22.2.0: Fri Nov 11 02:06:26 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T8112
      ... more to come
  - name: AMD Ryzen 9
    specs: |
      CPU: AMD Ryzen 9 7900X 12-Core Processor
      RAM: 2x DDR5 16GB
      Compiler: GCC 13.2.0-23ubuntu4
      Kernel: Linux 6.8.0-41-generic
      ... more to come

questions:
  - title: Memory vs Compute
    topic: Memory
    code1: |
      struct Foo {
        int values[30];
        int cached_bar{0};

        int bar() {
          if (cached_bar == 0) {
            for (int i = 0; i < 30; ++i) {
              cached_bar += values[i] * i;
            }
          }

          return cached_bar;
        }
      };

      std::vector<Foo> arr(1'000'000);

      for (auto &foo : arr) {
        Use(foo.bar());
      }
    code2: |
      struct Foo {
        int values[30];

        int bar() {
          int res{0};
          for (int i = 0; i < 30; ++i) {
            res += values[i] * i;
          }
          return res;
        }
      };

      std::vector<Foo> arr(1'000'000);

      for (auto &foo : arr) {
        Use(foo.bar());
      }
    answer: 2
    faster_factor: 3
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/vs_memory.cpp
    machine: Apple Macbook Air M2
    explain: |
      TODO: say that memory is more common to be the issue

      In this example, the actual factor that cause majority of slow down is due to the
      write instruction. TODO go to next example.

      References:

      - TODO: find something
  - title: TLB Miss
    wip: true
    topic: Memory
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    explain: |
      References:

      - [Translation Lookaside Buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
  - title: Zero Initialized Array
    topic: Memory
    code1: |
      std::vector<int> arr;
      arr.resize(100'000'000);
      std::fill(arr.begin(), arr.end(), 0);

      # Only measure this part
      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = i;
      }
    code2: |
      std::vector<int> arr;
      arr.resize(100'000'000);

      # Only measure this part
      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = i;
      }
    answer: 1
    faster_factor: 1.15
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/page.cpp
    machine: Apple Macbook Air M2
    explain: |
      First time you access a page, it will trigger a minor page fault, that could be
      a bit slower. Zero-initialize the memory region will trigger the minor page
      fault, to make subsequent access faster.

      Beside `std::fill`, you can use `bzero` or `explicit_bzero`.

      References:

      - [Minor Page Fault](https://en.wikipedia.org/wiki/Page_fault#Minor)
  - title: Cache vs Memory
    topic: Memory
    code1: |
      constexpr int STRIDE = 1;

      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = (i + STRIDE) % arr.size();
      }

      # Measure below part only
      int sum{0};
      int p{0};
      for (auto i = 0u; i < arr.size(); ++i) {
        sum += arr[p];
        p = arr[p];
      }
    code2: |
      constexpr int STRIDE = 4096;

      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = (i + STRIDE) % arr.size();
      }

      # Measure below part only
      int sum{0};
      int p{0};
      for (auto i = 0u; i < arr.size(); ++i) {
        sum += arr[p];
        p = arr[p];
      }
    answer: 1
    faster_factor: 14.7
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/cache.cpp
    machine: Apple Macbook Air M2
    explain: |
      There are actually a lot of stuff happening here, but the main gist is most of data
      access in the first code is done in cache, while most in 2nd thread have to go to
      memory (cache miss).

      You can use `perf` to see cache miss counter.

      - TODO: perf example here

      References:
      
      - [Measuring Cache Latencies - StackOverflow](https://stackoverflow.com/questions/21369381/measuring-cache-latencies)
  - title: Prefetch
    topic: Memory
    code1: |
      constexpr int STRIDE = 4096;

      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = (i + STRIDE) % arr.size();
      }

      # Measure below part only
      int sum{0};
      int p{0};
      for (auto i = 0u; i < arr.size(); ++i) {
        __builtin_prefetch(&arr[(p + 1 * STRIDE) % arr.size()], 0, 0);
        __builtin_prefetch(&arr[(p + 2 * STRIDE) % arr.size()], 0, 0);

        sum += arr[p];
        p = arr[p];
      }
    code2: |
      constexpr int STRIDE = 4096;

      for (auto i = 0u; i < arr.size(); ++i) {
        arr[i] = (i + STRIDE) % arr.size();
      }

      # Measure below part only
      int sum{0};
      int p{0};
      for (auto i = 0u; i < arr.size(); ++i) {
        sum += arr[p];
        p = arr[p];
      }
    answer: 1
    faster_factor: 1.25
    explain: |
      If you know the memory address that you are going to access, or are likely be
      accessed, you can instruct CPU to prefetch them. That would help to reduce the
      wait if memory access is needed.

      Note that CPU also has a prefetcher, so don't try to do its work, only do this
      if you think the prefetcher cannot predict the next address.

      References:

      - [Cache Prefetching](https://en.wikipedia.org/wiki/Cache_prefetching)
  - title: False Sharing
    topic: Memory
    code1: |
      struct Foo {
        int x;
        int y;
        int z;
      };
      std::vector<Foo> arr(100'000);

      # Run in 2 threads
      for (int i = thread_idx; i < arr.size(); i += 2) {
        arr[i].x = i;
        arr[i].y = arr.size() - i;
        arr[i].z = arr.size() + i;
      }
    code2: |
      struct alignas(64) Foo {
        int x;
        int y;
        int z;
      };
      std::vector<Foo> arr(100'000);

      # Run in 2 threads
      for (int i = thread_idx; i < arr.size(); i += 2) {
        arr[i].x = i;
        arr[i].y = arr.size() - i;
        arr[i].z = arr.size() + i;
      }
    answer: 2
    faster_factor: 1.2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/false_sharing.cpp
    explain: |
      False sharing happens when 2 cores read/write to different variables in the same cache line. In the example,
      Foo struct only has 3 integers, which is smaller than 1 cache line (64 bytes). Therefore, multiple instances of Foo will
      fit into a single cache line, causing the possibility for 2 cores to read/write at the same cache line.

      Adding alignment for Foo struct so that 1 cache line only has 1 Foo instance (at the cost of more 
      and make the program run faster. Alternatively, you can divide the work between 2 threads differently (e.g. 0-size/2 to
      1 thread, and the rest to another thread).

      To measure effect of false sharing, you can use [perf c2c](https://man7.org/linux/man-pages/man1/perf-c2c.1.html).

      References:

        - [Wiki](https://en.wikipedia.org/wiki/False_sharing)
        - [docs.kernel.org](https://docs.kernel.org/kernel-hacking/false-sharing.html)
  - title: True Sharing
    topic: Memory
    wip: true
    code1: |
      constexpr int N = 128;
    code2: |
      constexpr int N = 128;
    answer: 2
    faster_factor: 3.4
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/cache.cpp
    explain: |
      References:

        - [Wiki](https://en.wikipedia.org/wiki/Cache_placement_policies)
  - title: Mutex
    topic: Memory
    code1: |
      std::vector<std::thread> threads;
      std::atomic<int> sum = 0;

      for (int i = 0; i < 4; ++i) {
        threads.emplace_back([&arr, i, &sum]() {
          int partial_sum = 0;
          auto start = i * arr.size() / 4;
          auto end = (i + 1) * arr.size() / 4;
          for (int i = start; i < end; ++i) {
            partial_sum += arr[i];
          }

          sum += partial_sum;
        });
      }
    code2: |
      std::vector<std::thread> threads;
      std::atomic<int> sum = 0;

      for (int i = 0; i < 4; ++i) {
        threads.emplace_back([&arr, i, &sum]() {
          auto start = i * arr.size() / 4;
          auto end = (i + 1) * arr.size() / 4;
          for (int i = start; i < end; ++i) {
            sum += arr[i];
          }
        });
      }
    answer: 1
    faster_factor: 145.8
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/lock.cpp
    explain: |
      Lock-free atomic operation does not mean it is free. Under the hood, works need to be done whenever 2 cores read/write to the same cache line.

      In the 2nd example, the contention is much higher since we read/modify sum on every iteration, while in the first one, it is only done 4 times, 1 in each thread. That's to show atomic, if not used properly could lead to degrade in performance.

      To detect cache line contention, you can use `perf c2c`.

      References:
      
      - [perf c2c man page](https://man7.org/linux/man-pages/man1/perf-c2c.1.html)
  - title: Cache Associativity
    topic: Memory
    code1: |
      constexpr int N = 64;

      for (auto j = 0u; j < 1024 / N; ++j) {
        for (auto i = 0u; i < N; ++i) {
          sum += arr1[i] + arr2[i];
        }
      }
    code2: |
      constexpr int N = 128;

      for (auto j = 0u; j < 1024 / N; ++j) {
        for (auto i = 0u; i < N; ++i) {
          sum += arr1[i] + arr2[i];
        }
      }
    answer: 2
    faster_factor: 3.4
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/cache.cpp
    explain: |
      As cache needs to be fast, it needs to be simple. And one way to do that is that each address can only be fit into certain cache entry.

      That means even if the cache itself is not full, but there are conflicts (e.g. a lot of address have the same last k-bits) there could be degrade in performance.

      References:

        - [Wiki](https://en.wikipedia.org/wiki/Cache_placement_policies)
        - [Cache Associativity](https://en.algorithmica.org/hpc/cpu-cache/associativity/)
        - [Cache Associativity](https://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf)
  - title: Cache Bank Conflict
    topic: Memory
    wip: true
    code1: |
      constexpr int N = 1;

      for (auto j = 0u; j < 1024 / N; ++j) {
        for (auto i = 0u; i < N; ++i) {
          sum += arr1[i] + arr2[i];
        }
      }
    code2: |
      constexpr int N = 2;

      for (auto j = 0u; j < 1024 / N; ++j) {
        for (auto i = 0u; i < N; ++i) {
          sum += arr1[i] + arr2[i];
        }
      }
    answer: 1
    faster_factor: 33.7
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/cache.cpp
    explain: |
      TODO
  - title: Memory Bank Conflict
    topic: Memory
    wip: true
    code1: |
      constexpr int N = 1;
    code2: |
      constexpr int N = 2;
    answer: 1
    faster_factor: 33.7
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/cache.cpp
    explain: |
      TODO
  - title: TLB Shoot Down
    topic: Memory
    wip: true
    code1: |
      for (int i = 0; i < 10'000'000; ++i) {
        sum += a[i % 1'000'000];
      }
    code2: |
      # Thread 1
      for (int i = 0; i < 1'000'000; ++i) {
        sum += a[i % 1'000'000];
      }

      # Thread 2
      while (++cnt) {
        if (cnt % 2 == 0)
          x = new int[10000];
        else
          delete [] x;
      }
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/tlb_shootdown.cpp
    explain: |
      TODO
  - title: Paused
    topic: Memory
    wip: true
    code1: |
      a
    code2: |
      b
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/tlb_shootdown.cpp
    explain: |
      TODO
  - title: Misalign
    wip: true
    topic: Memory
    code1: |
      a
    code2: |
      b
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/tlb_shootdown.cpp
    explain: |
      TODO
  - title: Recharge
    wip: true
    topic: Memory
    code1: |
      a
    code2: |
      b
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/tlb_shootdown.cpp
    explain: |
      TODO
  - title: Mooore Write
    wip: true
    topic: Memory
    code1: |
      a
    code2: |
      b
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/write.cpp
    explain: |
      TODO
  - title: Non Temporal Write
    wip: true
    topic: Memory
    code1: |
      a
    code2: |
      b
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/memory/write.cpp
    explain: |
      TODO

  - title: Sorted Array
    topic: CPU
    code1: |
      std::vector<int> arr(100'000);
      # then fill with random value in [0, 256]

      for (auto i = 0u; i < 100'000; ++i) {
        if (arr[i] >= 128)
          sum += arr[i];
      }
    code2: |
      std::vector<int> arr(100'000);
      # then fill with random value in [0, 256]

      # do not benchmark this sort function
      std::sort(arr.begin(), arr.end())

      for (auto i = 0u; i < 100'000; ++i) {
        if (arr[i] >= 128)
          sum += arr[i];
      }
    answer: 2
    faster_factor: 9.6
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      With sorted array, the condition `arr[i] >= 128` become easy to predicted (always false at the beginning, and then
      become always true). Without sorted array, that is harder to predict.

      CPU can run multiple instructions at the same time, but branch prevents it from happenning. Therefore, CPU try to
      predict result of the branch and execute instructions based on that prediction to keep the utilization high.

      References:

        - [Wiki](https://en.wikipedia.org/wiki/Branch_predictor)
        - [Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array)
  - title: Modulo
    topic: CPU
    code1: |
      std::array<int, 5> modulos{11, 107, 1013, 19211, 81727};

      for (auto i = 0u; i < arr.size(); ++i) {
        sum += arr[i] % modulos[i * 5 / arr.size()];
      }
    code2: |
      for (auto i = 0u; i < arr.size(); ++i) {
        switch (i * 5 / arr.size()) {
          case 0:
            sum += arr[i] % 11;
            break;
          case 1:
            sum += arr[i] % 107;
            break;
          case 2:
            sum += arr[i] % 1013;
            break;
          case 3:
            sum += arr[i] % 19211;
            break;
          case 4:
            sum += arr[i] % 81727;
            break;
        }
      }
    answer: 2
    faster_factor: 1.3
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/modulo.cpp
    explain: |
      Integer modulo (and division) operation in CPU is slow. But there is a trick to convert
      [division by a constant into multiplication](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant).

      Beware of the branching penalty though. The above example works because the branch is easily predicted.

      In case where you don't know the divisor in compile time, but you know same value is gonna be used multiple times,
      you can use something like [libdivide](https://libdivide.com/)

      References:
      
      - [Division Algorithm - Wiki](https://en.wikipedia.org/wiki/Division_algorithm)
  - title: Power Of Two
    topic: CPU
    code1: |
      int foo(int x) {
        return x % 3; 
      }
    code2: |
      int foo(int x) {
        return x % 128; 
      }
    answer: 2
    faster_factor: 1.2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/power_of_two.cpp
    explain: |
      Arithmetic operation with power of two can usually be converted into bit shift, or mask
      operation, which is quite cheap

      - x * 2^n == x << n
      - x % 2^n == x & (1 << n - 1)

  - title: Dependency
    topic: CPU
    code1: |
      for (auto i = 0u; i < 90; ++i) {
        sum *= 10;
        sum += a[i];
      }
    code2: |
      for (auto i = 0u; i < 90; i += 3) {
        sum *= 1000;
        sum += a[i] * 100;
        sum += a[i + 1] * 10;
        sum += a[i + 2];
      }
    answer: 2
    faster_factor: 1.3
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/dependency.cpp
    explain: |
      The first code has to be run sequentially pretty much, since the add operation
      must be done after * 10 operation. While in the 2nd code, there are 3 add
      operation that can be done in any order, thus enable more parallelism in CPU.

      References:

      - [Superscalar Processor - Wiki](https://en.wikipedia.org/wiki/Superscalar_processor)
  - title: False Dependency
    topic: CPU
    wip: true
    code1: |
      #include
    code2: |
      #include
    answer: 2
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      TODO
  - title: No Branch
    topic: CPU
    code1: |
      for (auto i = 0u; i < arr.size(); ++i) {
        if (arr[i] > 128)
          sum += arr[i];
      }
    code2: |
      for (auto i = 0u; i < arr.size(); ++i) {
        sum += arr[i] * (arr[i] > 128);
      }
    answer: 2
    faster_factor: 9
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      Eventhough there are a lot more instructions in the 2nd example, eliminating
      branch make it faster due to higher of parallelism achieved on the CPU.

      Part of a reason that compiler does not optimize the first example is because
      it is not obvious the branchless edition is faster. As you see in `Sorted Array`,
      branch one, with a clean pattern, will outperform the branchless one, due to extra
      computation it needs.

      References:

      - [Branchless Programming - Algorithmica](https://en.algorithmica.org/hpc/pipelining/branchless/)
  - title: No Branch 2
    topic: CPU
    code1: |
      for (auto i = 0u; i < arr.size(); ++i) {
        if (arr[i] > 128)
          sum += arr[i];
      }
    code2: |
      int sum[2] = {0, 0};
      for (auto i = 0u; i < arr.size(); ++i) {
        sum[arr[i] > 128] += arr[i];
      }
      // The answer is in sum[true]
    answer: 2
    faster_factor: 1.83
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      Similar to [No Branch](/q/no_branch.html). This is a difference approach, which
      is more general. Instead of having branch, you store the compute result in both
      branch, and discard the incorrect one at the end.
  - title: Well Predicted
    topic: CPU
    code1: |
      // The array is sorted beforehand
      for (auto i = 0u; i < arr.size(); ++i) {
        if (arr[i] > 128)
          sum += arr[i];
      }
    code2: |
      int sum[2] = {0, 0};
      for (auto i = 0u; i < arr.size(); ++i) {
        sum[arr[i] > 128] += arr[i];
      }
      // The answer is in sum[true]
    answer: 1
    faster_factor: 5.35
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      TODO
  - title: Register Spill
    topic: CPU
    code1: |
      // arr.size == 1'000'000
      constexpr int N = 4;

      for (auto i = 0u; i < arr.size(); i += N) {
        for (int j = 0; j < N; ++j) {
          sum += arr[i + j] * (i + j);
        }
      }
    code2: |
      // arr.size == 1'000'000
      constexpr int N = 8;

      for (auto i = 0u; i < arr.size(); i += N) {
        for (int j = 0; j < N; ++j) {
          sum += arr[i + j] * (i + j);
        }
      }
    answer: 1
    faster_factor: 3.7
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/register.cpp
    explain: |
      There are limited amount of registers in CPU. So if the calculation is large amount
      (in the sense that you need to have a lot intermediate values stored, some will have
      to go to CPU cache.

      In these 2 examples, the inner loops will be unrolled, as N is small enough. And with
      larger N, the amount of intermediate values for these calculation grow, leading to
      downgrade performance.

      Note that this specific example depends heavily on whether the compiler will unroll the 2nd
      loop, and how many loop will it unroll. GCC 13 seems to handle this better, and does not
      fully unroll all 8 loops in this case.

      References:
      
      - [Register Allocation - Wiki](https://en.wikipedia.org/wiki/Register_allocation)
      - [Register Renaming - Wiki](https://en.wikipedia.org/wiki/Register_renaming)
  - title: Jump
    topic: CPU
    code1: |
      for (auto i = 0u; i < 100'000; ++i) {
        asm("nop");
        asm("nop");
        asm("nop");
        ...
        // 15 times
      }
    code2: |
      for (auto i = 0u; i < 100'000; ++i) {
        asm("nop");
        asm("nop");
        asm("nop");
        ...
        // 14 times
      }
    answer: 1
    faster_factor: 1.16
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/jump.cpp
    explain: |
      32 bytes aligned cmp is significiantly slower. looks like to be on the frontend part.
      TODO

      References:
  - title: Pointer Chasing
    topic: CPU
    wip: true
    code1: |
      #include
    code2: |
      #include
    answer: 2
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      TODO
  - title: Store-To-Load Forwarding
    topic: CPU
    code1: |
      struct Elem {
        int32_t part1{0};
        int32_t part2{0};
      
        void Store(int32_t x) {
          part1 = 0;
          part2 = x;
        }
      
        int64_t Load() {
          return *reinterpret_cast<int64_t*>(this);
        }
      };

      volatile uint16_t x;

      for (auto &elem : arr) {
        elem.Store(x);
        std::atomic_thread_fence(std::memory_order_seq_cst);
        sum += elem.Load();
      }
    code2: |
      struct Elem {
        int32_t part1{0};
        int32_t part2{0};
      
        void Store(int32_t x) {
          *reinterpret_cast<int64_t*>(this) = x;
        }
      
        int64_t Load() {
          return *reinterpret_cast<int64_t*>(this);
        }
      };

      volatile uint16_t x;

      for (auto &elem : arr) {
        elem.Store(x);
        std::atomic_thread_fence(std::memory_order_seq_cst);
        sum += elem.Load();
      }
    answer: 2
    faster_factor: 2.4
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/store_forwarding.cpp
    explain: |
      Even L1 cache is slow sometimes. For example, if you write to a memory address and then load it again later, if the data is in store buffer inside CPU, we don't need to wait for them to commit to L1 or load from L1 again. This optimization in CPU is call store-to-load forwarding. There are certain restriction on when it can be performed, as it need to be done very fast, please check references for more details.

      You can use `perf` to measure this.

          $ perf list
          ls_stlf
            [Store-to-load-forward (STLF) hits]

          $ 'bazel-bin/cpu/store_forwarding --benchmark_filter=BM_StoreForwarding<Foo1> --benchmark_min_time=100x':
          410,195,410      L1-dcache-loads                                                       
          100,300,709      ls_stlf

          $ 'bazel-bin/cpu/store_forwarding --benchmark_filter=BM_StoreForwarding<Foo2> --benchmark_min_time=100x':
          209,280,067      L1-dcache-loads                                                       
          200,245,122      ls_stlf

      Note that above benchmark works because:

      - It rely on a fact the compiler cannot optimize the `Store` function to store 64 bytes directly.
      - We add a memory fence to prevent compiler from reordering the write before read. Removing that, the 2nd code is still faster, but because of different reason.

      References:
      
      - [Store forwarding by example - easyperf](https://easyperf.net/blog/2018/03/09/Store-forwarding)
  - title: Async
    topic: CPU
    wip: true
    code1: |
      #include
    code2: |
      #include
    answer: 2
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/sorted_array.cpp
    explain: |
      TODO


  - title: Inline vs Function Call
    topic: C++
    code1: |
      __attribute__(always_inline) int foo(int x, int y) {
        return x + y;
      }
    code2: |
      __attribute__(noinline) int foo(int x, int y) {
        return x + y;
      }
    answer: 1
    faster_factor: "4.7"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpp/function_call.cpp
    explain: |
      [Inlining function](https://en.wikipedia.org/wiki/Inline_expansion) could be faster because it:

      - Eliminate the cost of calling function. The cost will making some changes to some register, pushing thing to stack, and then a jump on the code.
      - Enable more optimization on the caller since the compiler has more info on how a function works. Depending on the cases, this would be the larger benefit.

      The drawback, however, is larger code size, and that could lead to more icache miss.

      For a function to be inlined in C++, it needs to be defined on the header, or link time optimization is enabled.

      Compiler on average can decide on which function, or where a function should be inlined better than human, especially with profile guided optimization. However, there are cases that you might want to use these attribute manually, e.g. if you know for sure a function is always in a critical path, and it will not lead to code bloat; or if the path that you want to optimize for is not hot (executed less frequently compared to other paths).

      Beside these attributes, there are others inline-related one that could be helpful: flatten

      Note that you might see different results on Clang, due to how these attributes are interpreted.

      References:

      - [GCC Common Function Attribute](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html)
      - [clang ignoring attribute noinline](https://stackoverflow.com/questions/54481855/clang-ignoring-attribute-noinline)
      - [GCC Link Time Optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
      - [LLVM Link Time Optimization](https://llvm.org/docs/LinkTimeOptimization.html)
      - [Profile-Guided Optimization](https://en.wikipedia.org/wiki/Profile-guided_optimization)

  - title: Function Call vs Virtual Function Call
    topic: C++
    wip: true
    code1: |
      #include
    code2: |
      #include
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpp/function_call.cpp
    explain: |
      TODO

  - title: Different Type of Static Variable
    topic: C++
    code1: |
      struct Foo {
        int Add(int x) {
          static int a = rand() % 5;
          return a += x;
        }
      };
    code2: |
      struct Foo {
        static inline int a = rand() % 5;

        int Add(int x) {
          return a += x;
        }
      };
    answer: 2
    faster_factor: "1.05"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/static_variable.cpp
    explain: |
      Static member variable will be initialized when the application start. Static local variable, however,
      will be initialized first time when it is called. That mean having an extra check in runtime to figure
      out whether the variable is initialized or not.

      References:

      - TODO

  - title: Signed vs Unsigned
    topic: C++
    code1: |
      // arr is an array of uint64_t
      int sum = 0;
      for (auto elem : arr) {
        sum += elem / 1234;
      }
    code2: |
      // arr is an array of int64_t
      int sum = 0;
      for (auto elem : arr) {
        sum += elem / 1234;
      }
    answer: 1
    faster_factor: "1.2"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/signed.cpp
    explain: |
      There are certain optimization that can only optimized for [unsigned integer divison](https://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html); making it faster in this case.

  - title: Signed vs Unsigned
    topic: C++
    wip: true
    code1: |
    code2: |
    answer: 1
    faster_factor: "1.2"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/signed.cpp
    explain: |
      Signed integer is overflow, while unsigned integer are not. So in some cases, the compiler has more constraints to work with, thus enable more optimization.

  - title: Restrict
    topic: C++
    code1: |
      void Add2Elems(int * __restrict a, int * __restrict b, int x) {
        for (int i = 0; i < 50; ++i) {
          a[i] += x;
          b[i] += x;
        }
      }
    code2: |
      void Add2Elems(int *a, int *b, int x) {
        for (int i = 0; i < 50; ++i) {
          a[i] += x;
          b[i] += x;
        }
      }
    answer: 1
    faster_factor: "1.9"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/alias.cpp
    explain: |
      `restrict` keyword let the compiler know these 2 pointers will not point to the same address. That allows compiler to assume independence between operations of `a` and `b`, leading to more optimized version.

      References:

      - [Restrict - Wiki](https://en.wikipedia.org/wiki/Restrict)
  - title: Shared Pointer
    topic: C++
    wip: true
    code1: |
      struct Foo {
    code2: |
      struct Foo {
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/static_variable.cpp
    explain: |
      TODO

  - title: Small
    topic: C++
    code1: |
      for (int j = 0; j < 23; ++j) {
        std::string s;
        for (auto i = 0u; i < 22; ++i)
          s += 'a';
      }
    code2: |
      for (int j = 0; j < 22; ++j) {
        std::string s;
        for (auto i = 0u; i < 23; ++i)
          s += 'a';
      }
    answer: 1
    faster_factor: 2
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpp/string.cpp
    explain: |
      Small String Optimization (SSO) is a way to avoid heap allocation by reusing the
      space for metadata (e.g. pointer, capacity, size) in case the length of the
      string is small.

      For Clang, the small string size is capped at 22 bytes (at the time of writing)

      References:

      - [An informal comparison of the three major implementations of std::string](https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=109742)
  - title: Chaining Function
    topic: C++
    code1: |
      struct Context {
        // complex class
      };
      
      void Step1(Context &ctx) {
        // complex function
      }
      
      void Step2(Context &ctx) {
        // complex function
      }

      for (auto &context : contexts) {
        Step1(context);
        Step2(context);
        Step1(context);
        Step2(context);
      }
    code2: |
      for (auto &context : contexts) {
        Step1(context);
      }
      for (auto &context : contexts) {
        Step2(context);
      }
      for (auto &context : contexts) {
        Step1(context);
      }
      for (auto &context : contexts) {
        Step2(context);
      }
    answer: 2
    faster_factor: "1.39"
    machine: AMD Ryzen 9
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/function_chain.cpp
    explain: |
      TODO: Code is too long

      There are various factors that affect the performance of this:

      - icache miss: the 2nd variant will have less icache missed, since we execute a smaller code at a time
      - dcache miss: depending on the Context struct, and these Step1, Step2 function; doing them all in 1 pass could save on the time to load the context member variable or related variables in Step1, Step2 functions
      - optimization across functions: in the first variant, multiple functions are called at one. There could be optimization that make the speed of 4 functions faster than running each 1 individually.
      - vectorization: in the 2nd variant, it is easier to vectorize each functions

      References:

  - title: ABI
    topic: C++
    wip: true
    code1: |
      struct Foo {
    code2: |
      struct Foo {
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/static_variable.cpp
    explain: |
      TODO

  - title: Batch Processing
    topic: C++
    wip: true
    code1: |
      struct Foo {
    code2: |
      struct Foo {
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/static_variable.cpp
    explain: |
      TODO

  - title: pow
    topic: C++
    wip: true
    code1: |
      std::pow(3);
    code2: |
      std::pow(4);
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpp/math.cpp
    explain: |

      References:

      - [slowpow](https://entropymine.com/imageworsener/slowpow/)

  - title: Loop Reordering
    topic: C++
    wip: true
    code1: |
      struct Foo {
    code2: |
      struct Foo {
    answer: 2
    faster_factor: "?"
    machine: Apple Macbook Air M2
    code_url: https://github.com/hgminh95/fast/blob/main/bench/cpu/static_variable.cpp
    explain: |
      TODO

  - title: Moving Data
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Coalesced Access
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Misaligned Access
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: The Swizzle Operator
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Kernel Fusion
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Kernel Fission
    topic: GPU
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO

  - title: Pipelining
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Removing Pipeline Register
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Register Layers
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Register Balancing
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Control-based Logic Reuse
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |
      TODO
  - title: Priority Encoders
    topic: RTL
    wip: true
    code1: |
      #include cpp
      int main() {
      }
    code2: |
      #include cpp
      int main() {
      }
    answer: 1
    faster_factor: 2
    explain: |