diff --git a/content/2-gpu-ecosystem.rst b/content/2-gpu-ecosystem.rst index 82e4cb1..6b6e059 100644 --- a/content/2-gpu-ecosystem.rst +++ b/content/2-gpu-ecosystem.rst @@ -188,7 +188,6 @@ Apart from what was presented above there are many others tools and features pro ROCm ^^^^ - ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. The ROCm platform is built on the foundation of open portability, supporting environments across multiple accelerator vendors and architectures. In some way it is very similar to CUDA API. @@ -220,7 +219,6 @@ oneAPI supports multiple programming models and programming languages. It enable Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms. - Differences and similarities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/content/3-gpu-problems.rst b/content/3-gpu-problems.rst index 1f1cefa..b6f5592 100644 --- a/content/3-gpu-problems.rst +++ b/content/3-gpu-problems.rst @@ -45,6 +45,7 @@ Specifically, you can expect good performance on GPUs for: - **Big data analytics**: Clustering, classification, regression, etc. - **Graphics rendering**: Original use-case for GPUs. + What are GPUs not good for -------------------------- @@ -78,6 +79,7 @@ Some types of problems that do not fit well on a GPU include: can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, it may not be well-suited for a GPU. + Examples of GPU acceleration ---------------------------- @@ -145,20 +147,28 @@ To give a flavor of what type of performance gains we can achieve by porting a c - 5.910 ms - ~550x / ~27x + Electronic structure calculations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -VASP is a popular software package used for electronic structure calculations. The figures below show the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage. +VASP is a popular software package used for electronic structure calculations. +The figures below show the speedup observed in a recent benchmark study on the +Perlmutter and Cori supercomputers, along with an analysis of total energy usage. .. figure:: img/problems/vasp-gpu.png :align: center - VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). + VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number + of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). + (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). .. figure:: img/problems/vasp-energy.png :align: center - Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs) + Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical + axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), + CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio + to the Cori Haswell usage (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). @@ -173,10 +183,11 @@ Fock matrix whose elements are given by: .. math:: F_{\alpha \beta} = H^{\textrm{core}}_{\alpha \beta} + \sum_{\gamma \delta}D_{\gamma \delta} \left [ (\alpha \beta|\gamma \delta) - \frac{1}{2} (\alpha \delta|\gamma \beta) \right ], -The first term is related to the one electron contributions and the second term is related to the -electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix -:math:`D_{\gamma \delta}`. One of the most expensive parts in the solution of the Hartree-Fock equations is the -processing (digestion) of the ERIs, one algorithm to do this task is as follows: +The first term is related to the one electron contributions and the second term is +related to the electron repulsion integrals (ERIs), in parenthesis, weighted by the +by the density matrix :math:`D_{\gamma \delta}`. One of the most expensive parts in +the solution of the Hartree-Fock equations is the processing (digestion) of the ERIs, +one algorithm to do this task is as follows: .. figure:: img/problems/hartree-fock-algorithm.png :width: 200 @@ -216,7 +227,14 @@ preserve the site for future researchers to gain critical insights and contribut Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations. -By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms. This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics. Examples of studies done using these methods can be found at the `Center for Humanities Computing Aarhus `__ (CHCAA) and `Interacting Minds Centre `__ (IMC) at Aarhus University. +By offloading the computational workload to GPUs, researchers can experience substantial +speedup in the execution of MCMC algorithms. This speedup allows for more extensive +exploration of parameter spaces and facilitates the analysis of larger datasets, +leading to more accurate and detailed insights into human behavior or population +dynamics. Examples of studies done using these methods can be found at the +`Center for Humanities Computing Aarhus `__ (CHCAA) and +`Interacting Minds Centre `__ (IMC) at Aarhus University. + Exercises --------- diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst index d097116..22864f6 100644 --- a/content/4-gpu-concepts.rst +++ b/content/4-gpu-concepts.rst @@ -95,11 +95,11 @@ Data parallelism can usually be explored by the GPUs quite easily. The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel. If the number of elements in the data set is fairly large (tens or hundred of thousands elements), the GPU should perform quite well. Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take. Getting absolute maximum out of the data parallelism requires good understanding of how GPU works. - -Another type of parallelism is a task parallelism. -This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data. -An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time. -Note that the tasks can consume totally different resources, which also can be explored. +Another type of parallelism is a task parallelism. This is when an application consists +of more than one task that requiring to perform different operations with (the same or) +different data. An example of task parallelism is cooking: slicing vegetables and grilling +are very different tasks and can be done at the same time. Note that the tasks can consume +totally different resources, which also can be explored. .. admonition:: In short :class: dropdown @@ -114,10 +114,16 @@ Note that the tasks can consume totally different resources, which also can be e - Task parallelism involves multiple independent tasks that perform different operations on the same or different data. - Task parallelism involves executing different tasks concurrently, leveraging different resources. + GPU Execution Model ------------------- -In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It's fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction. +In order to obtain maximum performance it is important to understand how GPUs execute +the programs. As mentioned before a CPU is a flexible device oriented towards general +purpose usage. It's fast and versatile, designed to run operating systems and various, +very different types of applications. It has lots of features, such as better control +logic, caches and cache coherence, that are not related to pure computing. CPUs optimize +the execution by trying to achieve low latency via heavy caching and branch prediction. .. figure:: img/concepts/cpu-gpu-highway.png :align: center @@ -126,68 +132,127 @@ In order to obtain maximum performance it is important to understand how GPUs ex Cars and roads analogy for the CPU and GPU behavior. The compact road is analogous to the CPU (low latency, low throughput) and the broader road is analogous to the GPU (high latency, high throughput). -In contrast the GPUs contain a relatively small amount of transistors dedicated to control and caching, and a much larger fraction of transistors dedicated to the mathematical operations. Since the cores in a GPU are designed just for 3D graphics, they can be made much simpler and there can be a very larger number of cores. The current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a very high degree of parallelism. Lots of threads are launched in parallel. For good performance there should be at least several times more than the number of CUDA cores. GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very little penalty for context switching. This way when some threads are performing some memory operations (reading or writing) others execute instructions. +In contrast the GPUs contain a relatively small amount of transistors dedicated to +control and caching, and a much larger fraction of transistors dedicated to the +mathematical operations. Since the cores in a GPU are designed just for 3D graphics, +they can be made much simpler and there can be a very larger number of cores. +The current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by +having a very high degree of parallelism. Lots of threads are launched in parallel. +For good performance there should be at least several times more than the number of CUDA cores. +GPU :abbr:`threads` are much lighter than the usual CPU threads and they have very little +penalty for context switching. This way when some threads are performing some memory +operations (reading or writing) others execute instructions. + CUDA Threads, Warps, Blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In order to perform some work the program launches a function called *kernel*, which is executed simultaneously by tens of thousands of :abbr:`threads` that can be run on GPU cores parallelly. GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. By "over-subscribing" the GPU there are threads that are performing some memory operations (reading or writing), while others execute instructions. +In order to perform some work the program launches a function called *kernel*, which +is executed simultaneously by tens of thousands of :abbr:`threads` that can be run on +GPU cores parallelly. GPU threads are much lighter than the usual CPU threads and +they have very little penalty for context switching. By "over-subscribing" the GPU +there are threads that are performing some memory operations (reading or writing), +while others execute instructions. .. figure:: img/concepts/thread-core.jpg :align: center :scale: 40 % -Every :abbr:`thread` is associated with a particular intrinsic index which can be used to calculate and access memory locations in an array. Each thread has its context and set of private variables. All threads have access to the global GPU memory, but there is no general way to synchronize when executing a kernel. If some threads need data from the global memory which was modified by other threads the code would have to be splitted in several kernels because only at the completion of a kernel it is ensured that the writing to the global memory was completed. +Every :abbr:`thread` is associated with a particular intrinsic index which can be +used to calculate and access memory locations in an array. Each thread has its +context and set of private variables. All threads have access to the global GPU +memory, but there is no general way to synchronize when executing a kernel. If some +threads need data from the global memory which was modified by other threads the +code would have to be splitted in several kernels because only at the completion +of a kernel it is ensured that the writing to the global memory was completed. -Apart from being much light weighted there are more differences between GPU threads and CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`. This done at hardware level. +Apart from being much light weighted there are more differences between GPU threads +and CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`. +This done at hardware level. .. figure:: img/concepts/warp-simt.jpg :align: center :scale: 40 % -All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B, 64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to access elements of the data which are adjacent in the memory. This is called *coalesced* memory access. - - -On some architectures, all members of a :abbr:`warp` have to execute the -same instruction, the so-called "lock-step" execution. This is done to achieve -higher performance, but there are some drawbacks. If an **if** statement -is present inside a :abbr:`warp` will cause the warp to be executed more than once, -one time for each branch. When different threads within a single :abbr:`warp` -take different execution paths based on a conditional statement (if), both -branches are executed sequentially, with some threads being active while -others are inactive. On architectures without lock-step execution, such -as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or newer, :abbr:`warp` -divergence is less costly. - -There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` are grouped together in so called :abbr:`blocks`. Each block is assigned to one Streaming Multiprocessor (SMP) unit. A SMP contains one or more SIMT (single instruction multiple threads) units, schedulers, and very fast on-chip memory. Some of this on-chip memory can be used in the programs, this is called :abbr:`shared memory`. The shared memory can be used to "cache" data that is used by more than one thread, thus avoiding multiple reads from the global memory. It can also be used to avoid memory accesses which are not efficient. For example in a matrix transpose operation, we have two memory operations per element and only can be coalesced. In the first step a tile of the matrix is saved read a coalesced manner in the shared memory. After all the reads of the block are done the tile can be locally transposed (which is very fast) and then written to the destination matrix in a coalesced manner as well. Shared memory can also be used to perform block-level reductions and similar collective operations. All threads can be synchronized at block level. Furthermore when the shared memory is written in order to ensure that all threads have completed the operation the synchronization is compulsory to ensure correctness of the program. - - +All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B, +64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to +access elements of the data which are adjacent in the memory. This is called *coalesced* memory access. + +On some architectures, all members of a :abbr:`warp` have to execute the same instruction, +the so-called "lock-step" execution. This is done to achieve higher performance, +but there are some drawbacks. If an **if** statement is present inside a :abbr:`warp` +will cause the warp to be executed more than once, one time for each branch. +When different threads within a single :abbr:`warp` take different execution paths +based on a conditional statement (if), both branches are executed sequentially, +with some threads being active while others are inactive. On architectures without +lock-step execution, such as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or +newer, :abbr:`warp` divergence is less costly. + +There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` are +grouped together in so called :abbr:`blocks`. Each block is assigned to one Streaming +Multiprocessor (SMP) unit. A SMP contains one or more SIMT (single instruction multiple +threads) units, schedulers, and very fast on-chip memory. Some of this on-chip memory +can be used in the programs, this is called :abbr:`shared memory`. The shared memory +can be used to "cache" data that is used by more than one thread, thus avoiding multiple +reads from the global memory. It can also be used to avoid memory accesses which are not +efficient. For example in a matrix transpose operation, we have two memory operations +per element and only can be coalesced. In the first step a tile of the matrix is saved +read a coalesced manner in the shared memory. After all the reads of the block are done +the tile can be locally transposed (which is very fast) and then written to the destination +matrix in a coalesced manner as well. Shared memory can also be used to perform block-level +reductions and similar collective operations. All threads can be synchronized at block level. +Furthermore when the shared memory is written in order to ensure that all threads have +completed the operation the synchronization is compulsory to ensure correctness of the program. .. figure:: img/concepts/block-smp.jpg :align: center :scale: 40 % - - -Finally, a block of threads can not be splitted among SMPs. For performance blocks should have more than one :abbr:`warp`. The more warps are active on an SMP the better is hidden the latency associated with the memory operations. If the resources are sufficient, due to fast context switching, an SMP can have more than one block active in the same time. However these blocks can not share data with each other via the on-chip memory. - - -To summarize this section. In order to take advantage of GPUs the algorithms must allow the division of work in many small subtasks which can be executed in the same time. The computations are offloaded to GPUs, by launching tens of thousands of threads all executing the same function, *kernel*, each thread working on different part of the problem. The threads are executed in groups called *blocks*, each block being assigned to a SMP. Furthermore the threads of a block are divided in *warps*, each executed by SIMT unit. All threads in a warp execute the same instructions and all memory accesses are done collectively at warp level. The threads can synchronize and share data only at block level. Depending on the architecture, some data sharing can be done as well at warp level. - -In order to hide latencies it is recommended to "over-subscribe" the GPU. There should be many more blocks than SMPs present on the device. Also in order to ensure a good occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT units. This way while some warps of threads are idle waiting for some memory operations to complete, others use the CUDA cores, thus ensuring a high occupancy of the GPU. - -In addition to this there are some architecture-specific features of which the developers can take advantage. :abbr:`Warp`-level operations are primitives provided by the GPU architecture to allow for efficient communication and synchronization within a warp. They allow :abbr:`threads` within a warp to exchange data efficiently, without the need for explicit synchronization. These warp-level operations, combined with the organization of threads into blocks and clusters, make it possible to implement complex algorithms and achieve high performance on the GPU. The cooperative groups feature introduced in recent versions of CUDA provides even finer-grained control over thread execution, allowing for even more efficient processing by giving more flexibility to the thread hierarchy. Cooperative groups allow threads within a block to organize themselves into smaller groups, called cooperative groups, and to synchronize their execution and share data within the group. +Finally, a block of threads can not be splitted among SMPs. For performance blocks should +have more than one :abbr:`warp`. The more warps are active on an SMP the better is hidden +the latency associated with the memory operations. If the resources are sufficient, +due to fast context switching, an SMP can have more than one block active in the same time. +However these blocks can not share data with each other via the on-chip memory. + +To summarize this section. In order to take advantage of GPUs the algorithms must allow +the division of work in many small subtasks which can be executed in the same time. +The computations are offloaded to GPUs, by launching tens of thousands of threads +all executing the same function, *kernel*, each thread working on different part of the problem. +The threads are executed in groups called *blocks*, each block being assigned to a SMP. +Furthermore the threads of a block are divided in *warps*, each executed by SIMT unit. +All threads in a warp execute the same instructions and all memory accesses are done +collectively at warp level. The threads can synchronize and share data only at block level. +Depending on the architecture, some data sharing can be done as well at warp level. + +In order to hide latencies it is recommended to "over-subscribe" the GPU. +There should be many more blocks than SMPs present on the device. +Also in order to ensure a good occupancy of the CUDA cores there should be more +warps active on a given SMP than SIMT units. This way while some warps of threads are +idle waiting for some memory operations to complete, others use the CUDA cores, +thus ensuring a high occupancy of the GPU. + +In addition to this there are some architecture-specific features of which the developers +can take advantage. :abbr:`Warp`-level operations are primitives provided by the +GPU architecture to allow for efficient communication and synchronization within a warp. +They allow :abbr:`threads` within a warp to exchange data efficiently, without the need +for explicit synchronization. These warp-level operations, combined with the organization of +threads into blocks and clusters, make it possible to implement complex algorithms and +achieve high performance on the GPU. The cooperative groups feature introduced in +recent versions of CUDA provides even finer-grained control over thread execution, +allowing for even more efficient processing by giving more flexibility to the thread hierarchy. +Cooperative groups allow threads within a block to organize themselves into smaller groups, +called cooperative groups, and to synchronize their execution and share data within the group. Below there is an example of how the threads in a grid can be associated with specific elements of an array - - .. figure:: img/concepts/indexing.png :align: center :scale: 40 % -The thread marked by orange color is part of a grid of threads size 4096. The threads are grouped in blocks of size 256. The "orange" thread has index 3 in the block 2 and the global calculated index 515. +The thread marked by orange color is part of a grid of threads size 4096. +The threads are grouped in blocks of size 256. The "orange" thread has index 3 +in the block 2 and the global calculated index 515. For a vector addition example this would be used as follow ``c[index]=a[index]+b[index]``. @@ -205,11 +270,15 @@ For a vector addition example this would be used as follow ``c[index]=a[index]+b - Thread indexing allows associating threads with specific elements in an array for parallel processing. - Terminology ----------- -At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the basic concept behind GPUs is pretty similar they use different names for the various parts. Furthermore there are software environments for GPU programming, some from the producers and some from external groups all having different naming as well. Below there is a short compilation of the some terms used across different platforms and software environments. +At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. +While the basic concept behind GPUs is pretty similar they use different names for the various parts. +Furthermore there are software environments for GPU programming, some from the producers +and some from external groups all having different naming as well. +Below there is a short compilation of the some terms used across different platforms and software environments. + Software ~~~~~~~~ diff --git a/content/index.rst b/content/index.rst index a7047af..cf6da31 100644 --- a/content/index.rst +++ b/content/index.rst @@ -106,8 +106,7 @@ Credits ------- Several sections in this lesson have been adapted from the following sources created by -`ENCCS `__ and `CSC `__, which are -all distributed under a +`ENCCS `__ and `CSC `__, which are all distributed under a `Creative Commons Attribution license (CC-BY-4.0) `__: - `High Performance Data Analytics in Python `__ @@ -117,8 +116,7 @@ all distributed under a The lesson file structure and browsing layout is inspired by and derived from `work `__ by `CodeRefinery `__ licensed under the `MIT license -`__. We have copied and adapted -most of their license text. +`__. We have copied and adapted most of their license text. Instructional Material