diff --git a/content/1-gpu-history.rst b/content/1-gpu-history.rst index 2f71318..ac7df45 100644 --- a/content/1-gpu-history.rst +++ b/content/1-gpu-history.rst @@ -10,7 +10,7 @@ Why GPUs? .. objectives:: - - Explain the historical development of microprocessors and how GPUs enable + - Explain the historical development of microprocessors and how GPUs enable continued scaling in computational power .. instructor-note:: @@ -19,7 +19,7 @@ Why GPUs? - 0 min exercises -Moore's law +The Moore's law ----------- It states that the number of transistors in a dense integrated circuit doubles about every two years. @@ -33,28 +33,28 @@ Higher performance of a single node has to rely on its more complicated structur The evolution of microprocessors. The number of transistors per chip increase roughly every 2 years. However, it can no longer be explored by the core frequency due to the power consumption limits. - Before 2000, the increase in the single core clock frequency was the major source of the + Before 2000, the increase in the single core clock frequency was the major source of the increase in the performance. Mid 2000 mark a transition towards multi-core processors. Increasing performance has been sustained with two main strategies over the years: - - Increase the single processor performance: + - Increase the single processor performance: - More recently, increase the number of physical cores. Computing in parallel --------------------- -The underlying idea of parallel computing is to split a computational problem into smaller -subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units. +The underlying idea of parallel computing is to split a computational problem into smaller +subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units. -.. figure:: img/history/compp.png +.. figure:: img/history/parallel-computing.png :align: center - - Computing in parallel. -How a problem is split into smaller subtasks strongly depends on the problem. -There are various paradigms and programming approaches to do this. + Serial processing and parallel computing. + +How a problem is split into smaller subtasks strongly depends on the problem. +There are various paradigms and programming approaches to do this. Graphics processing units @@ -67,18 +67,18 @@ But over the years, they were used more and more in HPC. GPUs are a specialized parallel hardware for floating point operations. They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow but it delegates highly-parallel tasks to the GPU. -GPUs are based on highly parallel architectures, which allows taking advantage of the +GPUs are based on highly parallel architectures, which allows taking advantage of the increasing number of transistors. Using GPUs allows one to achieve extreme performance per node. -As a result, the single GPU-equipped workstation can outperform small CPU-based clusters +As a result, the single GPU-equipped workstation can outperform small CPU-based clusters for some type of computational tasks. The drawback is: usually major rewrites of programs is required with an accompanying change in the programming paradigm. .. callout:: Host vs device - GPU-enabled systems require a heterogeneous programming model that involves both - CPU and GPU, where the CPU and its memory are referred to as the host, + GPU-enabled systems require a heterogeneous programming model that involves both + CPU and GPU, where the CPU and its memory are referred to as the host, and the GPU and its memory as the device. .. figure:: img/history/CPU_and_GPU_separated.png @@ -92,7 +92,7 @@ A look at the Top-500 list The `TOP500 project `__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2023, where the columns show: -- **Cores** - Number of processors +- **Cores** - Number of processors - **Rmax** - Maximal LINPACK performance achieved - **Rpeak** - Theoretical peak performance - **Power** - Power consumption @@ -116,13 +116,13 @@ GPU computing can significantly accelerate many types of scientific workloads. Improved energy efficiency ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Compared to CPUs, GPUs can perform more calculations per watt of power consumed, +Compared to CPUs, GPUs can perform more calculations per watt of power consumed, which can result in significant energy savings. This is indeed evident from the `Green500 list `__. -Cost-effectiveness +Cost-effectiveness ^^^^^^^^^^^^^^^^^^ -GPUs can be more cost-effective than traditional CPU-based systems for certain +GPUs can be more cost-effective than traditional CPU-based systems for certain workloads. @@ -132,17 +132,17 @@ Limitations and drawbacks Only for certain workloads ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Not all workloads can be efficiently parallelized and accelerated on GPUs. -Certain types of workloads, such as those with irregular data access patterns or +Not all workloads can be efficiently parallelized and accelerated on GPUs. +Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs. Steeper learning curve ^^^^^^^^^^^^^^^^^^^^^^ -Depending on the GPU programming API that you choose, GPU computing could -require specialized skills in GPU programming and knowledge of -GPU architecture, leading to a steeper learning curve compared to CPU programming. -Fortunately, if you study this training material closely you will become productive +Depending on the GPU programming API that you choose, GPU computing could +require specialized skills in GPU programming and knowledge of +GPU architecture, leading to a steeper learning curve compared to CPU programming. +Fortunately, if you study this training material closely you will become productive with GPU programming quickly! diff --git a/content/2-gpu-ecosystem.rst b/content/2-gpu-ecosystem.rst index 44af48a..82e4cb1 100644 --- a/content/2-gpu-ecosystem.rst +++ b/content/2-gpu-ecosystem.rst @@ -22,7 +22,7 @@ The GPU hardware and software ecosystem Overview of GPU hardware ------------------------ -.. figure:: img/hardware/CPUAndGPU.png +.. figure:: img/hardware/cpu-gpu.png :align: center A comparison of the CPU and GPU architecture. @@ -60,22 +60,22 @@ This allows to hide the memory operations: while some threads wait, others can c How do GPUs differ from CPUs? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -CPUs and GPUs were designed with different goals in mind. While the CPU -is designed to excel at executing a sequence of operations, called a thread, -as fast as possible and can execute a few tens of these threads in parallel, -the GPU is designed to excel at executing many thousands of them in parallel. -GPUs were initially developed for highly-parallel task of graphic processing -and therefore designed such that more transistors are devoted to data processing -rather than data caching and flow control. More transistors dedicated to -data processing is beneficial for highly parallel computations; the GPU can -hide memory access latencies with computation, instead of relying on large data caches -and complex flow control to avoid long memory access latencies, +CPUs and GPUs were designed with different goals in mind. While the CPU +is designed to excel at executing a sequence of operations, called a thread, +as fast as possible and can execute a few tens of these threads in parallel, +the GPU is designed to excel at executing many thousands of them in parallel. +GPUs were initially developed for highly-parallel task of graphic processing +and therefore designed such that more transistors are devoted to data processing +rather than data caching and flow control. More transistors dedicated to +data processing is beneficial for highly parallel computations; the GPU can +hide memory access latencies with computation, instead of relying on large data caches +and complex flow control to avoid long memory access latencies, both of which are expensive in terms of transistors. -.. list-table:: +.. list-table:: :widths: 100 100 :header-rows: 1 @@ -97,9 +97,9 @@ both of which are expensive in terms of transistors. GPU platforms ------------- -GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included. +GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included. -*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty. +*NVIDIA*, *AMD*, and *Intel* are the major companies which design and produces GPUs for HPC providing each its own suite **CUDA**, **ROCm**, and respectively **oneAPI**. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty. There are also cross-platform APIs such **DirectCompute** (only for Windows operating system), **OpenCL**, and **SYCL**. .. admonition:: CUDA - In short @@ -132,7 +132,7 @@ There are also cross-platform APIs such **DirectCompute** (only for Windows oper - Debugging: ``roc-gdb`` command line tool - Facilitates debugging of GPU programs - Performance analysis: ``rocprof`` and ``roctracer`` tools - - Analyze and optimize program performance + - Analyze and optimize program performance - Supports various heterogenous programming models such as **HIP**, **OpenMP**, and **OpenCL** - Heterogeneous-Computing Interface for Portability (HIP) - Enables source portability for NVIDIA and AMD platforms, Intel in plan @@ -178,31 +178,31 @@ In addition to this are provided **nvc** (C11 compiler), **nvc++** (C++17 compil When programming mistakes are inevitable they have to be fixed as soon as possible. The CUDA toolkit includes the command line tool **cuda-gdb** which can be used to find errors in the code. It is an extension to GDB, the GNU Project debugger. The existing GDB debugging features are inherently present for debugging the host code, and additional features have been provided to support debugging CUDA device code, allowing simultaneous debugging of both GPU and CPU code within the same application. The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware. This enables developers to debug applications without the potential variations introduced by simulation and emulation environments. -In addition to this the command line tool **compute-sanitizer** can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables. +In addition to this the command line tool **compute-sanitizer** can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables. Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the developers to optimize their applications. The former, NVIDIA Nsight Systems, is a system-wide performance analysis tool that provides detailed metrics on both CPU and GPU usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight Compute, is a kernel-level performance analysis tool that allows developers to analyze the performance of individual CUDA kernels. It provides detailed metrics on kernel execution, including memory usage, instruction throughput, and occupancy. These tools have graphical which can be used for all steps of the performance analysis, however on supercomputers it is recommended to use the command line interface for collecting the information needed and then visualize and analyse the results using the graphical interface on personal computers. -Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA eco-system is very well developed. +Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA eco-system is very well developed. ROCm ^^^^ -ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. -The ROCm platform is built on the foundation of open portability, supporting environments across multiple -accelerator vendors and architectures. In some way it is very similar to CUDA API. -It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs. +ROCm is an open software platform allowing researchers to tap the power of AMD accelerators. +The ROCm platform is built on the foundation of open portability, supporting environments across multiple +accelerator vendors and architectures. In some way it is very similar to CUDA API. +It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs. For debugging, it provides the command line tool ``rocgdb``, while for performance analysis ``rocprof`` and ``roctracer``. -In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP). -HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms. -It provides the ``hipcc`` compiler driver, which will call the appropriate toolchain depending on the desired platform. -On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). -On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs. -The header file contains mostly inlined functions and thus has very low overhead. +In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP). +HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms. +It provides the ``hipcc`` compiler driver, which will call the appropriate toolchain depending on the desired platform. +On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler). +On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs. +The header file contains mostly inlined functions and thus has very low overhead. The code is then compiled with ``nvcc``, the standard C++ compiler provided with CUDA. -On AMD platforms, libraries are prefixed by ``roc``, which can be called directly from HIP. In order to make portable calls, -one can call the libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance cost and ensure that +On AMD platforms, libraries are prefixed by ``roc``, which can be called directly from HIP. In order to make portable calls, +one can call the libraries using ``hip``-prefixed wrappers. These wrappers can be used at no performance cost and ensure that HIP code can be used on other platforms with no changes. Libraries included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA. ROCm also integrates with popular machine learning frameworks such as TensorFlow and PyTorch and provides optimized libraries and drivers to accelerate machine learning workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD accelerators to train and deploy machine learning models efficiently. @@ -214,9 +214,9 @@ oneAPI **Intel oneAPI** is a unified software toolkit developed by Intel that allows developers to optimize and deploy applications across a variety of architectures, including CPUs, GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, enabling developers to leverage the full potential of heterogeneous computing environments. With oneAPI, the developers can write code once and deploy it across different hardware targets without the need for significant modifications or rewriting. This approach promotes code reusability, productivity, and performance portability, as it abstracts the complexities of heterogeneous computing and provides a consistent programming interface based on open standards. The core of suite is **Intel oneAPI Base Toolkit**, a set of tools and libraries for developing high-performance, data-centric applications across diverse architectures. It features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for heterogeneous computing. It includes the **Collective Communications Library**, the **Data Analytics Library**, the **Deep Neural Networks Library**, the **DPC++/C++ Compiler**, the **DPC++ Library**, the **Math Kernel Library**, the **Threading Building Blocks**, debugging tool **Intel Distribution for GDB**, performance analisis tools **Intel Adviser** and **Intel Vtune Profiler**, the **Video Processing Library**, **Intel Distribution for Python**, the **DPC++ Compatibility Tool**, the **FPGA Add-on for oneAPI Base Toolkit**, the **Integrated Performance Primitives**. -This can be complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains **DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools **Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis tool **Intel Trace Analyzer and Collector**. +This can be complemented with additional toolkits. The **Intel oneAPI HPC Toolkit** contains **DPC++/C++ Compiler**, **Fortran** and **C++** Compiler Classic, debugging tools **Cluster Checker** and **Inspector**, **Intel MPI Library**, and performance analysis tool **Intel Trace Analyzer and Collector**. -oneAPI supports multiple programming models and programming languages. It enables developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. +oneAPI supports multiple programming models and programming languages. It enables developers to write **OpenMP** codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well **SYCL** programs for GPUs and FPGAs using the **DPC++** compiler. Initially, the **DPC++** compiler only targeted Intel GPUs using the **oneAPI Level Zero** low-level programming interface, but now support for NVIDIA GPUs (using CUDA) and AMD GPUs (using ROCm) has been added. Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms. @@ -224,13 +224,13 @@ Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneo Differences and similarities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -GPUs in general support different features, even among the same producer. In general newer cards come with extra -features and sometimes old features are not supported anymore. It is important when compiling to create binaries -targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices, -while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute -capability which is targeted is specified by the ``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9, and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling option ``--gpu-architecture=sm_XY``, while on AMD platforms ``--offload-arch=gfxabc`` ( where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908`` for MI100 series). -Note that in the case of portable (single source) programs one would specify ``openmp`` as well as target for -compilation, enabling to run the same code on multicore CPU. +GPUs in general support different features, even among the same producer. In general newer cards come with extra +features and sometimes old features are not supported anymore. It is important when compiling to create binaries +targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices, +while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute +capability which is targeted is specified by the ``-arch=sm_XY``, where ``X`` specifies the major architecture and it is between 1 and 9, and ``Y`` the minor. When using HIP on NVIDIA platforms one needs to use compiling option ``--gpu-architecture=sm_XY``, while on AMD platforms ``--offload-arch=gfxabc`` ( where ``abc`` is the architecture code such as ``90a`` for the MI200 series or ``908`` for MI100 series). +Note that in the case of portable (single source) programs one would specify ``openmp`` as well as target for +compilation, enabling to run the same code on multicore CPU. @@ -282,9 +282,9 @@ Exercises .. solution:: - The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of - cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many - cores are trying to access the memory simultaneously. This is why data management and access patterns + The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of + cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many + cores are trying to access the memory simultaneously. This is why data management and access patterns are essential in GPU computing. .. keypoints:: diff --git a/content/3-gpu-problems.rst b/content/3-gpu-problems.rst index d40e0fc..1f1cefa 100644 --- a/content/3-gpu-problems.rst +++ b/content/3-gpu-problems.rst @@ -25,15 +25,15 @@ What are GPUs good for Answer from `Stack Exchange `__: - *From a metaphorical point of view, the GPU can be seen as a person lying on a bed - of nails. The person lying on top is the data and in the base of each nail there - is a processor, so the nail is actually an arrow pointing from processor to memory. - All nails are in a regular pattern, like a grid. If the body is well spread, - it feels good (performance is good), if the body only touches some spots of the + *From a metaphorical point of view, the GPU can be seen as a person lying on a bed + of nails. The person lying on top is the data and in the base of each nail there + is a processor, so the nail is actually an arrow pointing from processor to memory. + All nails are in a regular pattern, like a grid. If the body is well spread, + it feels good (performance is good), if the body only touches some spots of the nail bed, then the pain is bad (bad performance).* -GPU computing is well-suited to problems that involve large amounts of data parallelism. +GPU computing is well-suited to problems that involve large amounts of data parallelism. Specifically, you can expect good performance on GPUs for: - **Large-scale matrix and vector operations**: Common in machine learning, scientific computing, and image processing. @@ -48,44 +48,44 @@ Specifically, you can expect good performance on GPUs for: What are GPUs not good for -------------------------- -Not all programming problems can efficiently leverage the parallelism offered by GPUs. +Not all programming problems can efficiently leverage the parallelism offered by GPUs. Some types of problems that do not fit well on a GPU include: -- **Sequential tasks**: Problems that require a series of dependent steps, - where each step relies on the outcome of the previous step, are not well-suited - for parallel processing. Examples include recursive algorithms, certain dynamic +- **Sequential tasks**: Problems that require a series of dependent steps, + where each step relies on the outcome of the previous step, are not well-suited + for parallel processing. Examples include recursive algorithms, certain dynamic programming problems, and some graph traversal algorithms. -- **Fine-grained branching**: GPUs perform best when the code being executed across - different threads follows a similar control flow. When there is extensive - branching (i.e., many ``if`` statements) within a kernel or algorithm, performance +- **Fine-grained branching**: GPUs perform best when the code being executed across + different threads follows a similar control flow. When there is extensive + branching (i.e., many ``if`` statements) within a kernel or algorithm, performance may suffer due to the divergence in execution paths among the GPU threads. -- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical - operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of - arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize +- **Low arithmetic intensity**: GPUs excel at performing a large number of mathematical + operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of + arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize its computational power, leading to underperformance. -- **Small data sets**: If the problem involves a small data set that does not require significant - parallelism, using a GPU may not result in noticeable performance gains. In such cases, - the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU, +- **Small data sets**: If the problem involves a small data set that does not require significant + parallelism, using a GPU may not result in noticeable performance gains. In such cases, + the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU, may outweigh any potential benefits. -- **Limited parallelism**: Some algorithms have inherent limitations on the degree of parallelism that can be +- **Limited parallelism**: Some algorithms have inherent limitations on the degree of parallelism that can be achieved. In these cases, using a GPU may not lead to significant performance improvements. -- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs, and their memory bandwidth - can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, +- **Memory-bound problems**: GPUs generally have less memory available compared to CPUs, and their memory bandwidth + can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations, it may not be well-suited for a GPU. Examples of GPU acceleration ---------------------------- -To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU +To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU (if we're lucky!), let's look at a few case examples. .. discussion:: Effect of array size - + Consider the case of matrix multiplication in the Julia language: .. code-block:: julia @@ -103,11 +103,11 @@ To give a flavor of what type of performance gains we can achieve by porting a c @btime begin $A_d * $A_d; AMDGPU.synchronize() - end + end end - - How much faster do you think the GPU version is compared to running on a single CPU core? + - How much faster do you think the GPU version is compared to running on a single CPU core? - Julia automatically parallelises matrix multiplication over available CPU cores. Will the GPU version be faster than running on 64 cores? - Does the size of the array affect how much the performance improves? @@ -118,7 +118,7 @@ To give a flavor of what type of performance gains we can achieve by porting a c .. list-table:: GPU acceleration for matrix multiply in Julia :widths: 25 25 25 25 25 :header-rows: 1 - + * - Matrix size - 1 CPU core - 64 CPU cores @@ -140,7 +140,7 @@ To give a flavor of what type of performance gains we can achieve by porting a c - 866.348 μs - ~400x / ~35x * - (4096, 4096) - - 3.221 s + - 3.221 s - 159.563 ms - 5.910 ms - ~550x / ~27x @@ -150,12 +150,12 @@ Electronic structure calculations VASP is a popular software package used for electronic structure calculations. The figures below show the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage. -.. figure:: img/problems/vasp_gpu.png +.. figure:: img/problems/vasp-gpu.png :align: center VASP GPU speedup for benchmark Si128 acfdtr. The horizontal axis shows the number of nodes, and the vertical axis shows the GPU speedup of VASP (Time(CPU)/Time(GPU)). (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs). -.. figure:: img/problems/vasp_energy.png +.. figure:: img/problems/vasp-energy.png :align: center Total energy usage comparison when running VASP on Perlmutter and Cori. The vertical axis shows the energy used by VASP benchmark jobs on Perlmutter GPUs (blue bars), CPUs (red bars), Cori KNL (yellow bars), and Cori Haswell (green bars) in ratio to the Cori Haswell usage. (Recent unpublished benchmarks of VASP on NVIDIA A100 GPUs) @@ -169,16 +169,16 @@ Computational Chemistry A great deal of computational resources are spent in Quantum Chemical calculations which involve the solution of the Hartree-Fock eigenvalue problem, which requires the diagonalization of the Fock matrix whose elements are given by: - + .. math:: F_{\alpha \beta} = H^{\textrm{core}}_{\alpha \beta} + \sum_{\gamma \delta}D_{\gamma \delta} \left [ (\alpha \beta|\gamma \delta) - \frac{1}{2} (\alpha \delta|\gamma \beta) \right ], -The first term is related to the one electron contributions and the second term is related to the -electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix -:math:`D_{\gamma \delta}`. One of the most expensive parts in the solution of the Hartree-Fock equations is the +The first term is related to the one electron contributions and the second term is related to the +electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix +:math:`D_{\gamma \delta}`. One of the most expensive parts in the solution of the Hartree-Fock equations is the processing (digestion) of the ERIs, one algorithm to do this task is as follows: -.. figure:: img/concepts/algorithms.svg +.. figure:: img/problems/hartree-fock-algorithm.png :width: 200 :align: center @@ -186,35 +186,35 @@ processing (digestion) of the ERIs, one algorithm to do this task is as follows: This algorithm is suitable for GPUs as it involves many arithmetic operations. In addition to this, there are symmetries and properties of the integrals that could be used to rearrange the loops in -an efficient manner that fit GPU architectures. +an efficient manner that fit GPU architectures. Humanities ^^^^^^^^^^ -A brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs. +A brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs. **Language models and NLP (natural language processing)** -With the recent popularity of ChatGPT, the use of language models has come into the mainstream, -however such models have been used in the humanities many years already. One of the biggest goals of humanities -researchers is working with textual data which has increased exponentially over recent years due to the rise in -social media. Analyzing such textual data to gain insights into questions of sociology, linguistics and various -other fields have become increasingly reliant on using language models. Along with language models, +With the recent popularity of ChatGPT, the use of language models has come into the mainstream, +however such models have been used in the humanities many years already. One of the biggest goals of humanities +researchers is working with textual data which has increased exponentially over recent years due to the rise in +social media. Analyzing such textual data to gain insights into questions of sociology, linguistics and various +other fields have become increasingly reliant on using language models. Along with language models, the need for GPU access has become essential. **Archeology** -The field of archeology also makes use of GPUs in their 3D modelling -and rendering work. The biggest problem with archeological sites is that once they are excavated, -they are destroyed, so any researchers who aren't present at the site, would lose valuable insights into how -it looked when it was found. However, with recent developments in technology and accessibility to high-performance -computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to -preserve the site for future researchers to gain critical insights and contribute to the research. +The field of archeology also makes use of GPUs in their 3D modelling +and rendering work. The biggest problem with archeological sites is that once they are excavated, +they are destroyed, so any researchers who aren't present at the site, would lose valuable insights into how +it looked when it was found. However, with recent developments in technology and accessibility to high-performance +computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to +preserve the site for future researchers to gain critical insights and contribute to the research. **Cognitive Science** -Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations. +Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations. By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms. This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics. Examples of studies done using these methods can be found at the `Center for Humanities Computing Aarhus `__ (CHCAA) and `Interacting Minds Centre `__ (IMC) at Aarhus University. @@ -238,12 +238,12 @@ Exercises .. solution:: - The right answer is option 3. GPUs do not handle recursion and branching as effectively as more + The right answer is option 3. GPUs do not handle recursion and branching as effectively as more data-heavy algorithms. .. keypoints:: - - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics. + - GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics. - GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems. diff --git a/content/4-gpu-concepts.rst b/content/4-gpu-concepts.rst index 5b2415f..d097116 100644 --- a/content/4-gpu-concepts.rst +++ b/content/4-gpu-concepts.rst @@ -13,7 +13,7 @@ GPU programming concepts .. objectives:: - Understand parallel computing principles and architectures. - - Differentiate data parallelism from task parallelism. + - Differentiate data parallelism from task parallelism. - Learn the GPU execution model. - Parallelize and execute work on GPUs. - Develop efficient GPU code for high performance. @@ -30,47 +30,47 @@ Different types of parallelism Distributed- vs. Shared-Memory Architecture ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most of computing problems are not trivially parallelizable, which means that the subtasks -need to have access from time to time to some of the results computed by other subtasks. +Most of computing problems are not trivially parallelizable, which means that the subtasks +need to have access from time to time to some of the results computed by other subtasks. The way subtasks exchange needed information depends on the available hardware. -.. figure:: img/history/distributed_vs_shared.png +.. figure:: img/concepts/distributed-vs-shared.png :align: center - + Distributed- vs shared-memory parallel computing. -In a distributed memory environment each computing unit operates independently from the -others. It has its own memory and it **cannot** access the memory in other nodes. -The communication is done via network and each computing unit runs a separate copy of the -operating system. In a shared memory machine all computing units have access to the memory +In a distributed memory environment each computing unit operates independently from the +others. It has its own memory and it **cannot** access the memory in other nodes. +The communication is done via network and each computing unit runs a separate copy of the +operating system. In a shared memory machine all computing units have access to the memory and can read or modify the variables within. Processes and threads ~~~~~~~~~~~~~~~~~~~~~ -The type of environment (distributed- or shared-memory) determines the programming model. -There are two types of parallelism possible, process based and thread based. +The type of environment (distributed- or shared-memory) determines the programming model. +There are two types of parallelism possible, process based and thread based. .. figure:: img/history/processes-threads.png :align: center -For distributed memory machines, a process-based parallel programming model is employed. -The processes are independent execution units which have their *own memory* address spaces. -They are created when the parallel program is started and they are only terminated at the +For distributed memory machines, a process-based parallel programming model is employed. +The processes are independent execution units which have their *own memory* address spaces. +They are created when the parallel program is started and they are only terminated at the end. The communication between them is done explicitly via message passing like MPI. -On the shared memory architectures it is possible to use a thread based parallelism. -The threads are light execution units and can be created and destroyed at a relatively -small cost. The threads have their own state information but they *share* the *same memory* -address space. When needed the communication is done though the shared memory. +On the shared memory architectures it is possible to use a thread based parallelism. +The threads are light execution units and can be created and destroyed at a relatively +small cost. The threads have their own state information but they *share* the *same memory* +address space. When needed the communication is done though the shared memory. -Both approaches have their advantages and disadvantages. Distributed machines are -relatively cheap to build and they have an "infinite " capacity. In principle one could -add more and more computing units. In practice the more computing units are used the more -time consuming is the communication. The shared memory systems can achieve good performance -and the programming model is quite simple. However they are limited by the memory capacity -and by the access speed. In addition in the shared parallel model it is much easier to +Both approaches have their advantages and disadvantages. Distributed machines are +relatively cheap to build and they have an "infinite " capacity. In principle one could +add more and more computing units. In practice the more computing units are used the more +time consuming is the communication. The shared memory systems can achieve good performance +and the programming model is quite simple. However they are limited by the memory capacity +and by the access speed. In addition in the shared parallel model it is much easier to create race conditions. @@ -83,7 +83,7 @@ The units process the data by applying the same or very similar operation to dif A common example is applying a blur filter to an image --- the same function is applied to all the pixels on an image. This parallelism is natural for the GPU, where the same instruction set is executed in multiple :abbr:`threads`. -.. figure:: img/concepts/ENCCS-OpenACC-CUDA_TaskParallelism_Explanation.png +.. figure:: img/concepts/data-task-parallelism.png :align: center :scale: 40 % @@ -113,11 +113,11 @@ Note that the tasks can consume totally different resources, which also can be e - Data parallelism distributes data across computational units, processing them with the same or similar operations. - Task parallelism involves multiple independent tasks that perform different operations on the same or different data. - Task parallelism involves executing different tasks concurrently, leveraging different resources. - + GPU Execution Model ------------------- -In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It's fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction. +In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It's fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction. .. figure:: img/concepts/cpu-gpu-highway.png :align: center @@ -131,32 +131,32 @@ In contrast the GPUs contain a relatively small amount of transistors dedicated CUDA Threads, Warps, Blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In order to perform some work the program launches a function called *kernel*, which is executed simultaneously by tens of thousands of :abbr:`threads` that can be run on GPU cores parallelly. GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. By "over-subscribing" the GPU there are threads that are performing some memory operations (reading or writing), while others execute instructions. +In order to perform some work the program launches a function called *kernel*, which is executed simultaneously by tens of thousands of :abbr:`threads` that can be run on GPU cores parallelly. GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. By "over-subscribing" the GPU there are threads that are performing some memory operations (reading or writing), while others execute instructions. -.. figure:: img/concepts/THREAD_CORE.png +.. figure:: img/concepts/thread-core.jpg :align: center :scale: 40 % -Every :abbr:`thread` is associated with a particular intrinsic index which can be used to calculate and access memory locations in an array. Each thread has its context and set of private variables. All threads have access to the global GPU memory, but there is no general way to synchronize when executing a kernel. If some threads need data from the global memory which was modified by other threads the code would have to be splitted in several kernels because only at the completion of a kernel it is ensured that the writing to the global memory was completed. +Every :abbr:`thread` is associated with a particular intrinsic index which can be used to calculate and access memory locations in an array. Each thread has its context and set of private variables. All threads have access to the global GPU memory, but there is no general way to synchronize when executing a kernel. If some threads need data from the global memory which was modified by other threads the code would have to be splitted in several kernels because only at the completion of a kernel it is ensured that the writing to the global memory was completed. -Apart from being much light weighted there are more differences between GPU threads and CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`. This done at hardware level. +Apart from being much light weighted there are more differences between GPU threads and CPU threads. GPU :abbr:`threads` are grouped together in groups called :abbr:`warps`. This done at hardware level. -.. figure:: img/concepts/WARP_SMTU.png +.. figure:: img/concepts/warp-simt.jpg :align: center :scale: 40 % - - + + All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B, 64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to access elements of the data which are adjacent in the memory. This is called *coalesced* memory access. -On some architectures, all members of a :abbr:`warp` have to execute the -same instruction, the so-called "lock-step" execution. This is done to achieve -higher performance, but there are some drawbacks. If an **if** statement -is present inside a :abbr:`warp` will cause the warp to be executed more than once, +On some architectures, all members of a :abbr:`warp` have to execute the +same instruction, the so-called "lock-step" execution. This is done to achieve +higher performance, but there are some drawbacks. If an **if** statement +is present inside a :abbr:`warp` will cause the warp to be executed more than once, one time for each branch. When different threads within a single :abbr:`warp` take different execution paths based on a conditional statement (if), both branches are executed sequentially, with some threads being active while -others are inactive. On architectures without lock-step execution, such +others are inactive. On architectures without lock-step execution, such as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or newer, :abbr:`warp` divergence is less costly. @@ -164,7 +164,7 @@ There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` -.. figure:: img/concepts/BLOCK_SMP.png +.. figure:: img/concepts/block-smp.jpg :align: center :scale: 40 % @@ -173,7 +173,7 @@ There is another level in the GPU :abbr:`threads` hierarchy. The :abbr:`threads` Finally, a block of threads can not be splitted among SMPs. For performance blocks should have more than one :abbr:`warp`. The more warps are active on an SMP the better is hidden the latency associated with the memory operations. If the resources are sufficient, due to fast context switching, an SMP can have more than one block active in the same time. However these blocks can not share data with each other via the on-chip memory. -To summarize this section. In order to take advantage of GPUs the algorithms must allow the division of work in many small subtasks which can be executed in the same time. The computations are offloaded to GPUs, by launching tens of thousands of threads all executing the same function, *kernel*, each thread working on different part of the problem. The threads are executed in groups called *blocks*, each block being assigned to a SMP. Furthermore the threads of a block are divided in *warps*, each executed by SIMT unit. All threads in a warp execute the same instructions and all memory accesses are done collectively at warp level. The threads can synchronize and share data only at block level. Depending on the architecture, some data sharing can be done as well at warp level. +To summarize this section. In order to take advantage of GPUs the algorithms must allow the division of work in many small subtasks which can be executed in the same time. The computations are offloaded to GPUs, by launching tens of thousands of threads all executing the same function, *kernel*, each thread working on different part of the problem. The threads are executed in groups called *blocks*, each block being assigned to a SMP. Furthermore the threads of a block are divided in *warps*, each executed by SIMT unit. All threads in a warp execute the same instructions and all memory accesses are done collectively at warp level. The threads can synchronize and share data only at block level. Depending on the architecture, some data sharing can be done as well at warp level. In order to hide latencies it is recommended to "over-subscribe" the GPU. There should be many more blocks than SMPs present on the device. Also in order to ensure a good occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT units. This way while some warps of threads are idle waiting for some memory operations to complete, others use the CUDA cores, thus ensuring a high occupancy of the GPU. @@ -183,7 +183,7 @@ Below there is an example of how the threads in a grid can be associated with sp -.. figure:: img/concepts/Indexing.png +.. figure:: img/concepts/indexing.png :align: center :scale: 40 % @@ -239,9 +239,9 @@ Software | blockDim.\{x,y,z\} | get_local_size(\{0,1,2\}) | nd_item::get_local_range(\{2,1,0\}) [#syclindex]_ | +-------------------------+-------------------------+---------------------------+---------------------------------------------------+ -.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL. +.. [#syclindex] In SYCL, the thread indexing is inverted. In a 3D grid, physically adjacent threads have consecutive X (0) index in CUDA, HIP, and OpenCL, but consecutive Z (2) index in SYCL. In a 2D grid, CUDA, HIP, and OpenCL still has contiguous indexing along X (0) dimension, while in SYCL it is Y (1). - Same applies to block dimensions and indexing. + Same applies to block dimensions and indexing. Exercises @@ -277,7 +277,7 @@ Exercises d) Neither data nor task parallelism .. solution:: - + Correct answer: *b) Data Parallelism* .. challenge:: What is a kernel in the context of GPU execution? @@ -285,9 +285,9 @@ Exercises a) A specific section of the CPU used for memory operations. b) A specific section of the GPU used for memory operations. c) A type of thread that operates on the GPU. - d) A function that is executed simultaneously by tens of thousands of threads on GPU cores. + d) A function that is executed simultaneously by tens of thousands of threads on GPU cores. - .. solution:: + .. solution:: Correct answer: *d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.* diff --git a/content/conf.py b/content/conf.py index 53d1379..ec727c3 100644 --- a/content/conf.py +++ b/content/conf.py @@ -17,11 +17,11 @@ # -- Project information ----------------------------------------------------- -project = "GPU programming: why, when and how?" +project = "Introduction to GPU Programming" copyright = "2023, The contributors" author = "The contributors" github_user = "ENCCS" -github_repo_name = "gpu-programming" # auto-detected from dirname if blank +github_repo_name = "intro-gpu-programming" # auto-detected from dirname if blank github_version = "main" conf_py_path = "/content/" # with leading and trailing slash @@ -174,9 +174,9 @@ def setup(app): app.add_directive(obj.cssname(), obj) app.add_role('abbr', AutoAbbreviation(), override=True) - + import os if os.environ.get('GITHUB_REF', '') == 'refs/heads/main': html_js_files = [ ('https://plausible.io/js/script.js', {"data-domain": "enccs.github.io/gpu-programming", "defer": "defer"}), - ] + ] diff --git a/content/img/concepts/block-smp.jpg b/content/img/concepts/block-smp.jpg new file mode 100644 index 0000000..23c81ba Binary files /dev/null and b/content/img/concepts/block-smp.jpg differ diff --git a/content/img/concepts/cpu-gpu-highway.png b/content/img/concepts/cpu-gpu-highway.png new file mode 100644 index 0000000..1c45869 Binary files /dev/null and b/content/img/concepts/cpu-gpu-highway.png differ diff --git a/content/img/concepts/data-task-parallelism.png b/content/img/concepts/data-task-parallelism.png new file mode 100644 index 0000000..769c854 Binary files /dev/null and b/content/img/concepts/data-task-parallelism.png differ diff --git a/content/img/concepts/distributed-vs-shared.png b/content/img/concepts/distributed-vs-shared.png new file mode 100644 index 0000000..e47c420 Binary files /dev/null and b/content/img/concepts/distributed-vs-shared.png differ diff --git a/content/img/concepts/indexing.png b/content/img/concepts/indexing.png new file mode 100644 index 0000000..6ddcb5d Binary files /dev/null and b/content/img/concepts/indexing.png differ diff --git a/content/img/concepts/processes-threads.png b/content/img/concepts/processes-threads.png new file mode 100644 index 0000000..58ef86c Binary files /dev/null and b/content/img/concepts/processes-threads.png differ diff --git a/content/img/concepts/thread-core.jpg b/content/img/concepts/thread-core.jpg new file mode 100644 index 0000000..eb70275 Binary files /dev/null and b/content/img/concepts/thread-core.jpg differ diff --git a/content/img/concepts/warp-simt.jpg b/content/img/concepts/warp-simt.jpg new file mode 100644 index 0000000..da06ce7 Binary files /dev/null and b/content/img/concepts/warp-simt.jpg differ diff --git a/content/img/hardware/cpu-gpu.png b/content/img/hardware/cpu-gpu.png new file mode 100644 index 0000000..d1559af Binary files /dev/null and b/content/img/hardware/cpu-gpu.png differ diff --git a/content/img/history/CPU_and_GPU_separated.png b/content/img/history/CPU_and_GPU_separated.png new file mode 100644 index 0000000..473dafd Binary files /dev/null and b/content/img/history/CPU_and_GPU_separated.png differ diff --git a/content/img/history/microprocessor-trend-data.png b/content/img/history/microprocessor-trend-data.png new file mode 100644 index 0000000..1243c6d Binary files /dev/null and b/content/img/history/microprocessor-trend-data.png differ diff --git a/content/img/history/parallel-computing.png b/content/img/history/parallel-computing.png new file mode 100644 index 0000000..dbda9ad Binary files /dev/null and b/content/img/history/parallel-computing.png differ diff --git a/content/img/history/top-5.png b/content/img/history/top-5.png new file mode 100644 index 0000000..b2244ef Binary files /dev/null and b/content/img/history/top-5.png differ diff --git a/content/img/problems/hartree-fock-algorithm.png b/content/img/problems/hartree-fock-algorithm.png new file mode 100644 index 0000000..008d251 Binary files /dev/null and b/content/img/problems/hartree-fock-algorithm.png differ diff --git a/content/img/problems/vasp-energy.png b/content/img/problems/vasp-energy.png new file mode 100644 index 0000000..afaf7f5 Binary files /dev/null and b/content/img/problems/vasp-energy.png differ diff --git a/content/img/problems/vasp-gpu.png b/content/img/problems/vasp-gpu.png new file mode 100644 index 0000000..b812824 Binary files /dev/null and b/content/img/problems/vasp-gpu.png differ diff --git a/content/index.rst b/content/index.rst index 0110d38..a7047af 100644 --- a/content/index.rst +++ b/content/index.rst @@ -51,9 +51,23 @@ mentioned above to a level which will enable you to quickly become a productive .. toctree:: :maxdepth: 1 - :caption: Prerequisites and Instructor’s guide + :caption: Prerequisites 0-setup + +.. toctree:: + :maxdepth: 1 + :caption: The lesson + + 1-gpu-history + 2-gpu-ecosystem + 3-gpu-problems + 4-gpu-concepts + +.. toctree:: + :maxdepth: 1 + :caption: Instructor’s guide + guide .. toctree::