Skip to content

Commit

Permalink
deploy: 634f5c7
Browse files Browse the repository at this point in the history
  • Loading branch information
wsmoses committed Jun 7, 2023
1 parent 55c348f commit b66ff7c
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 8 deletions.
10 changes: 7 additions & 3 deletions getting_started/CUDAGuide/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
<span class=n>printf</span><span class=p>(</span><span class=s>&#34;%f %f</span><span class=se>\n</span><span class=s>&#34;</span><span class=p>,</span> <span class=n>d_x</span><span class=p>,</span> <span class=n>d_y</span><span class=p>);</span>

<span class=p>}</span>
</code></pre></div><p>A one-liner compilation of the above using Enzyme:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang test2.cpp -Xclang -load -Xclang /path/to/ClangEnzyme-11.so -O2 -fno-vectorize -fno-unroll-loops
</code></pre></div><p>A one-liner compilation of the above using Enzyme:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang test2.cpp -Xclang -load -Xclang /path/to/ClangEnzyme-11.so -O2
</code></pre></div><h2 id=cuda-example>CUDA Example&nbsp;<a class=headline-hash href=#cuda-example></a></h2><p>When porting the above code, there are some caveats to be aware of:</p><ol><li>CUDA 10.1 is the latest supported CUDA at the time of writing (Jan/20/2021) for LLVM 11.</li><li><code>__enzyme_autodiff</code> should only be invoked on <code>___device___</code> code, not <code>__global__</code> kernel code. <code>__global__</code> kernels may be supported in the future.</li><li><code>--cuda-gpu-arch=sm_xx</code> is usually needed as the default <code>sm_20</code> is unsupported by modern CUDA versions.</li></ol><div class=highlight><pre class=chroma><code class=language-cpp data-lang=cpp><span class=cp>#include</span> <span class=cpf>&lt;stdio.h&gt;</span><span class=cp>
</span><span class=cp></span>
<span class=kt>void</span> <span class=n>__device__</span> <span class=nf>foo_impl</span><span class=p>(</span><span class=kt>double</span><span class=o>*</span> <span class=n>x_in</span><span class=p>,</span> <span class=kt>double</span> <span class=o>*</span><span class=n>x_out</span><span class=p>)</span> <span class=p>{</span>
Expand Down Expand Up @@ -94,8 +94,12 @@
<span class=n>printf</span><span class=p>(</span><span class=s>&#34;%f %f</span><span class=se>\n</span><span class=s>&#34;</span><span class=p>,</span> <span class=n>host_d_x</span><span class=p>,</span> <span class=n>host_d_y</span><span class=p>);</span>

<span class=p>}</span>
</code></pre></div><p>For convenience, a one-liner compilation step is (against sm_70):</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang test3.cu -Xclang -load -Xclang /path/to/ClangEnzyme-11.so -O2 -fno-vectorize -fno-unroll-loops -fPIC --cuda-gpu-arch<span class=o>=</span>sm_70 -lcudart -L/usr/local/cuda-10.1/lib64
</code></pre></div><p>Note that this procedure (using ClangEnzyme as opposed to LLVMEnzyme manually) may not properly nest Enzyme between optimization passes and may impact performance in unintended ways.</p><h2 id=heterogeneous-ad>Heterogeneous AD&nbsp;<a class=headline-hash href=#heterogeneous-ad></a></h2><p>It is often desirable to take derivatives of programs that run in part on the CPU and in part on the GPU. By placing a call to <code>__enzyme_autodiff</code> in a GPU kernel like above, one can successfully take the derivative of GPU programs. Similarly one can use <code>__enzyme_autodiff</code> within CPU programs to differentiate programs which run entirely on the CPU. Unfortunately, differentiating functions that call GPU kernels requires a bit of extra work (shown below) &ndash; largely to work around the lack of support within LLVM for modules with multiple architecture targets.</p><p>To successfully differentiate across devices, we will use Enzyme on the GPU to export the augmented forward pass and reverse pass of the kernel being called, and then use Enzyme&rsquo;s custom derivative support to import that derivative function into the CPU code. This then allows Enzyme to differentiate any CPU code that also calls the kernel.</p><p>Suppose we have a heterogeneous program like the following:</p><div class=highlight><pre class=chroma><code class=language-cpp data-lang=cpp>
</code></pre></div><p>For convenience, a one-liner compilation step is (against sm_70):</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang test3.cu -Xclang -load -Xclang /path/to/ClangEnzyme-11.so -O2 --cuda-gpu-arch<span class=o>=</span>sm_70 -lcudart -L/usr/local/cuda-10.1/lib64
</code></pre></div><p>Note that this procedure (using ClangEnzyme as opposed to LLVMEnzyme manually) inserts Enzyme at a specific locaton in LLVM&rsquo;s optimization pipeline. The default ordering should be reasonable, however, the precise ordering of optimization passes may
<a href=https://proceedings.mlsys.org/paper/2020/file/4e732ced3463d06de0ca9a15b6153677-Paper.pdf>impact performance</a>
. If there is a performance issue that you suspect may be due to optimization ordering, please
<a href=https://github.com/EnzymeAD/Enzyme/issues/new>open an issue</a>
.</p><h2 id=heterogeneous-ad>Heterogeneous AD&nbsp;<a class=headline-hash href=#heterogeneous-ad></a></h2><p>It is often desirable to take derivatives of programs that run in part on the CPU and in part on the GPU. By placing a call to <code>__enzyme_autodiff</code> in a GPU kernel like above, one can successfully take the derivative of GPU programs. Similarly one can use <code>__enzyme_autodiff</code> within CPU programs to differentiate programs which run entirely on the CPU. Unfortunately, differentiating functions that call GPU kernels requires a bit of extra work (shown below) &ndash; largely to work around the lack of support within LLVM for modules with multiple architecture targets.</p><p>To successfully differentiate across devices, we will use Enzyme on the GPU to export the augmented forward pass and reverse pass of the kernel being called, and then use Enzyme&rsquo;s custom derivative support to import that derivative function into the CPU code. This then allows Enzyme to differentiate any CPU code that also calls the kernel.</p><p>Suppose we have a heterogeneous program like the following:</p><div class=highlight><pre class=chroma><code class=language-cpp data-lang=cpp>
<span class=c1>// GPU Kernel
</span><span class=c1></span><span class=n>__global__</span>
<span class=kt>void</span> <span class=nf>collide</span><span class=p>(</span><span class=kt>float</span><span class=o>*</span> <span class=n>src</span><span class=p>,</span> <span class=kt>float</span><span class=o>*</span> <span class=n>dst</span><span class=p>)</span> <span class=p>{</span>
Expand Down
15 changes: 11 additions & 4 deletions getting_started/Faq/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,17 @@
<a href=/getting_started/UsingEnzyme/>here</a>
. The flag disables preprocessing optimizations that Enzyme runs and tends to reduce these errors. If this doesn&rsquo;t work, check the following.</p><h3 id=illegal-typeanalysis-on-llvm-10>Illegal TypeAnalysis on LLVM 10+&nbsp;<a class=headline-hash href=#illegal-typeanalysis-on-llvm-10></a></h3><p>There is a
<a href="https://bugs.llvm.org/show_bug.cgi?id=47612">known bug</a>
in an existing LLVM optimization pass (SROA) that will incorrectly generate type information from a memcpy. This bug has been fixed in LLVM 13</p><h3 id=opt-cant-find--enzyme-option>opt can&rsquo;t find -enzyme option&nbsp;<a class=headline-hash href=#opt-cant-find--enzyme-option></a></h3><p>When you use opt command like below:
<code>opt input.ll -load=/path/to/LLVMEnzyme-VERSION.so -enzyme -o output.ll -S</code>
opt reports a error:<code>opt: unknown pass name 'enzyme'</code></p><p>May be because you are using LLVM 13 or a higher version, which has switched to the new pass manager pipeline and we haven&rsquo;t yet turned that on within enzyme.
For now you can just tell llvm to use the old pass manager by adding the <code>--enable-new-pm=0</code> flag to opt or <code>-flegacy-pass-manager</code> to clang.</p><h3 id=unreachable-executed-gvn-error>UNREACHABLE executed (GVN error)&nbsp;<a class=headline-hash href=#unreachable-executed-gvn-error></a></h3><p>Until June 2020, LLVM&rsquo;s exisitng GVN pass had a bug handling invariant.load&rsquo;s that would cause it to crash. These tend to be generated a lot by Enzyme for better optimization. This was reported
in an existing LLVM optimization pass (SROA) that will incorrectly generate type information from a memcpy. This bug has been fixed in LLVM 13</p><h3 id=opt-cant-find--enzyme-option>opt can&rsquo;t find -enzyme option&nbsp;<a class=headline-hash href=#opt-cant-find--enzyme-option></a></h3><p>When you use opt command like below:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>opt input.ll -load<span class=o>=</span>/path/to/LLVMEnzyme-VERSION.so -enzyme -o output.ll -S
</code></pre></div><p>opt reports a error:<code>opt: unknown pass name 'enzyme'</code></p><p>May be because you are using LLVM 13 or a higher version, which has switched to the new pass manager pipeline. On applicable LLVM versions (up to and including LLVM 15),
you can specify that opt useees the old pass manager by adding the <code>--enable-new-pm=0</code> flag. Alternatively, you can use the new pass manager, which uses the following syntax
for opt:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>opt input.ll -load<span class=o>=</span>/path/to/LLVMEnzyme-VERSION.so -passes<span class=o>=</span>enzyme -o output.ll -S
</code></pre></div><p>Simiarly, clang has different syntax for plugins for the new pass manager. On LLVM versions up to and including LLVM 12, this is done as follows:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang -Xclang -load -Xclang /path/to/ClangEnzyme-VERSION.so code.c
</code></pre></div><p>On LLVM 13 and above, the new pass manager is the default. If you would like to continue to use the old pass manager, you will also need to provide
either <code>-fno-experimental-new-pass-manager</code> on versions &lt;= 14, or <code>-flegacy-pass-manager</code> on LLVM 15. LLVM 16 and above dropped support for the legacy pass
manager entirely.</p><p>To use the new pass manager, the following command line args are required for Clang:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang -fpass-plugin<span class=o>=</span>/path/to/ClangEnzyme-VERSION.so code.c
</code></pre></div><p>If you are using CMake, Enzyme exports a special <code>ClangEnzymeFlags</code> target which will automatically add the correct flags. See
<a href=https://github.com/EnzymeAD/Enzyme/blob/main/enzyme/test/test_find_package/CMakeLists.txt#L14>here</a>
for an example.</p><h3 id=unreachable-executed-gvn-error>UNREACHABLE executed (GVN error)&nbsp;<a class=headline-hash href=#unreachable-executed-gvn-error></a></h3><p>Until June 2020, LLVM&rsquo;s exisitng GVN pass had a bug handling invariant.load&rsquo;s that would cause it to crash. These tend to be generated a lot by Enzyme for better optimization. This was reported
<a href="https://bugs.llvm.org/show_bug.cgi?id=46054">here</a>
and resolved in master. Options for resolving include updating to later verison of LLVM with the fix, or disabling creation of invariant.load&rsquo;s.</p><h2 id=other>Other&nbsp;<a class=headline-hash href=#other></a></h2><p>If you have an issue not resolved here, please make an issue on
<a href=https://github.com/EnzymeAD/Enzyme>Github</a>
Expand Down
5 changes: 4 additions & 1 deletion getting_started/UsingEnzyme/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,10 @@
<span class=p>}</span>
</code></pre></div><p>The generated gradient has been inlined and entirely simplified to return the input times two.</p><p>We can then compile this to a final binary as follows:</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang output_opt.ll -o a.exe
</code></pre></div><p>For ease, we could combine the final optimization and bianry execution into one command as follows.</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang output.ll -O3 -o a.exe
</code></pre></div><h2 id=advanced-options>Advanced options&nbsp;<a class=headline-hash href=#advanced-options></a></h2><p>Enzyme has several advanced options that may be of interest.</p><h3 id=performance-options>Performance options&nbsp;<a class=headline-hash href=#performance-options></a></h3><h4 id=disabling-preprocessing>Disabling Preprocessing&nbsp;<a class=headline-hash href=#disabling-preprocessing></a></h4><p>The <code>enzyme-preopt</code> option disables the preprocessing optimizations run by the Enzyme pass, except for the absolute minimum neccessary.</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>$ opt input.ll -load<span class=o>=</span>./Enzyme/LLVMEnzyme-7.so -enzyme -enzyme-preopt<span class=o>=</span><span class=m>1</span>
</code></pre></div><p>Moreover, using Enzyme&rsquo;s clang plugin, we could automate the entire AD and compilation in a single command. Using the clang plugin should be done by default as it improves the user experience as well as various default performance options. However, the example above is still useful to understand how Enzyme works on LLVM.</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>clang test.c -Xclang -load -Xclang /path/to/Enzyme/enzyme/build/Enzyme/LLVMEnzyme-&lt;VERSION&gt;.so -o a.exe
</code></pre></div><p>Note that each version of Clang/LLVM will have slightly different command line flags to specifying plugins. See
<a href=/getting_started/Faq/>the FAQ</a>
for more information.</p><h2 id=advanced-options>Advanced options&nbsp;<a class=headline-hash href=#advanced-options></a></h2><p>Enzyme has several advanced options that may be of interest.</p><h3 id=performance-options>Performance options&nbsp;<a class=headline-hash href=#performance-options></a></h3><h4 id=disabling-preprocessing>Disabling Preprocessing&nbsp;<a class=headline-hash href=#disabling-preprocessing></a></h4><p>The <code>enzyme-preopt</code> option disables the preprocessing optimizations run by the Enzyme pass, except for the absolute minimum neccessary.</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>$ opt input.ll -load<span class=o>=</span>./Enzyme/LLVMEnzyme-7.so -enzyme -enzyme-preopt<span class=o>=</span><span class=m>1</span>
$ opt input.ll -load<span class=o>=</span>./Enzyme/LLVMEnzyme-7.so -enzyme -enzyme-preopt<span class=o>=</span><span class=m>0</span>
</code></pre></div><h4 id=forced-inlining>Forced Inlining&nbsp;<a class=headline-hash href=#forced-inlining></a></h4><p>The <code>enzyme-inline</code> option forcibly inlines all subfunction calls. The <code>enzyme-inline-count</code> option limits the number of calls inlined by this utility.</p><div class=highlight><pre class=chroma><code class=language-sh data-lang=sh>$ opt input.ll -load<span class=o>=</span>./Enzyme/LLVMEnzyme-7.so -enzyme -enzyme-inline<span class=o>=</span><span class=m>1</span>
$ opt input.ll -load<span class=o>=</span>./Enzyme/LLVMEnzyme-7.so -enzyme -enzyme-inline<span class=o>=</span><span class=m>1</span> -enzyme-inline-count<span class=o>=</span><span class=m>100</span>
Expand Down

0 comments on commit b66ff7c

Please sign in to comment.