Skip to content

Architecture Background

Jeffrey Young edited this page Jun 26, 2023 · 21 revisions

To effectively use Spatter, users will need to have some familiarity with some important computer architecture and organization topics that include but are not limited to:

  1. What does the memory organization of a modern CPU and GPU look like?
    • Caches, on-board High Bandwidth Memories (HBM) and DRAM
  2. How do caches work and why are they different for CPU and GPU platforms?
  3. What does a CPU prefetcher do and how does it improve performance?
  4. What do we mean when we say memory accesses are coalesced, and how does this affect gather/scatter operations?

Memory Organization of a Modern CPU

CPUs have larger, multi-level caches and rely on larger amounts of DRAM.

Block diagram of basic CPU memory system

CPU L1 L2 (Usally) Shared L3 DRAM

Memory Organization of a Modern GPU

GPUs have smaller caches and tend to have stacked HBMs.

GPU SM (core equivalent) L1 Sometimes L2 HBM

How Hardware Caches Work

Cache design and eviction policies, DRAM

  • Cache explained

    • Cache refers to high-speed data storage that typically represents a smaller subset of a larger set of data.
    • Cache is useful because it makes the retrieval of certain information extremely fast. The application does not need to go to primary storage every time if the data needed is found in the cache.
    • Cache allows for the continuous and efficient reuse of data
  • CPU Cache here

    • CPU cache refers to the cache memory that is used by the CPU
    • CPU cache holds common data that the CPU will likely need to repeatedly access.
    • For certain data, CPU checks CPU cache memory first. If data is not found in CPU cache memory, CPU looks at the slower DRAM for the data.
    • IF CPU can find what it needs on the CPU cache, the CPU will able to perform faster.
    • Many times, the CPU is so fast that the CPU is limited by the rate that it can retrieve data, not by its computational speed.
    • CPU Cache uses SRAM(Static RAM), which is more expensive but is much faster than DRAM(Dynamic RAM). When people talk about a computer's memory, they are typically referring to DRAM; they just call it RAM. Unlike DRAM, Static RAM does not have to be constantly refreshed. Hence, it is much faster.
    • Three types of CPU Cache
    • Level 1 Cache(Primary Cache) is located on the processor itself and runs at the same speed as the CPU. It is the fastest cache on the computer.
    • Level 2 Cache(External Cache) is meant to catch the data requests that the CPU could not find on the Level 1 cache. If the CPU cannot find the data it needs on the Level 1 cache, it will look at the Level 2 cache for that data.
    • Level 3 Cache is used to catch data requests that were not caught by the Level 2 cache. If the data is not found in Level 3 cache, the CPU will look to the slower DRAM for that data.
    • In modern computers, all three Levels of Cache are located on the processor. Level 3 is referred to as shared cache because it is shared between the cores of the CPU. Each core typically has its own Level 1 cache and Level 2 cache.
  • Cache design and eviction policies article can be found here

    • Important Cache terminology: Cache hit and Cache miss
    • Cache hit is when the data that is being requested is successfully retrieved from the cache
    • Cache miss is when the data that is being requested is not in the cache memory and that data must be retrieved elsewhere
    • In general, data that is not changed very often but is read very often should be stored in cache
    • Cache dating accessing strategies: Cache Aside(lazy loading), Read Through, Write Through, Write Around
    • Cache Aside: If a Cache hit occurs, the data will be returned to the application that requested that data. If a Cache miss occurs, the application will look to the database, update the cache with that new data, and then return that data to the application
    • Read Through: If a Cache hit occurs, the data will be returned to the application. If a Cache miss occurs, the cache will look at the database for the data. Then, the cache will update itself with the data, and then the cache will return that data.
    • Write Through: This operation is meant for writing data, not reading it. It occurs when the application makes a request to write new data or This operation always results in a cache hit, and the cache will either update its entries or create a new entry for the data. Then the cache will update the primary data storage.
    • Write Around: This technique is similar to write through cache, but data is written directly to permanent storage, bypassing the cache. This can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a “cache miss” and must be read from slower back-end storage and experience higher latency.
    • Write Back: Very similar to write through, but for a Write Back operation, the database is not updated at the same time as the cache. Rather, the database is updated at predefined intervals of time.
  • Eviction Polices: These are policies that keep the size of the cache in check. It helps to ensure that the cache does not exceed a maximum size limit. In order to keep the size under a certain threshold, existing elements in the cache are selectively removed. Below are some specific eviction policies for removing elements from the cache.

    • Least Recently Used(LRU): This eviction policy removes the values that were last used the longest time ago.
    • Least Frequently Used(LFU): This eviction policy removes the values that were accessed the least amount of times.
    • Most Recently Used(MRU): This eviction policy removes the values that were most recently used.
    • Most Frequently Used(MFU): This eviction policy removes the values that were accessed the greatest number of times.
  • DRAM(Dynamic Random Access Memory):

    • Dynamic Random Access Memory (DRAM) is a type of memory that is used in computers and other digital electronic devices. It's called "dynamic" because it needs to be continually refreshed to retain the information stored in it. This is in contrast to Static Random Access Memory (SRAM), which does not need to be refreshed.
    • Each bit of data in a DRAM is stored as a presence or absence of electric charge in a single capacitor within an integrated circuit. The capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1.
    • The electric charge in the capacitors slowly leaks away, so without intervention, the data on the chip would soon be lost. To prevent this, DRAM needs to be periodically refreshed, reading the charge levels and recharging capacitors as needed. This refresh process is the defining characteristic of dynamic random access memory.
    • DRAM is used for the main memory in computing devices because it is cheap and has a high density, meaning it can store a large amount of data in a small physical space. However, its need for constant refreshing does make it slower and less efficient than other types of memory like SRAM.
  • DRAM allows random access to any memory location, meaning that individual memory cells can be accessed in any order, without the need to sequentially read or write through the entire memory. This makes DRAM much faster compared to other storage options like magnetic tapes or optical disks.

How prefetchers work

  • A prefetcher in computing is a component that helps speed up the data retrieval process. It's like a predictive tool that anticipates the data or instructions the processor (CPU) will need next and fetches it from the main memory to the cache in advance.

  • The main goal of prefetching is to hide the latency, or delay, of fetching data from the main memory. The idea is to have the data ready in the cache by the time the CPU needs it, so it can continue its work without having to wait.

  • Prefetching can be managed by hardware, software, or a combination of both. Hardware prefetchers are built into the CPU and operate automatically, while software prefetching involves using special instructions in the code to fetch data ahead of time.

  • While prefetching can greatly improve system performance by reducing wait times for data, it can be tricky to get right. The prefetcher needs to accurately predict which data will be needed, when it will be needed, and fetch it without using up too much memory bandwidth or cache space. If it fetches the wrong data or fetches it at the wrong time, it can actually slow things down instead of speeding them up.

How CPU and GPU caches differ

  • CPU and GPU: These are both types of "brains" in your computer. The CPU (Central Processing Unit) is like the main boss, who handles a wide variety of tasks and can switch between them quickly. The GPU (Graphics Processing Unit) is more like a specialist, who is really good at doing one type of job (like rendering video or playing games) but does it for a lot of data at once.

  • Cache: This is like a small memory bank where the CPU or GPU stores data that it needs to use often or imminently. It's faster to get data from the cache than it is to get it from the computer's main memory (RAM).

  • Cache Size and Structure: The CPU has a more complex and bigger cache because it needs to handle many different tasks and switch between them quickly. The GPU has a simpler, smaller cache because it's doing lots of similar operations all at once. It's like having a big, versatile toolbox (CPU) versus having a smaller box with lots of the same tool (GPU).

  • Cache Coherency: When different parts of the CPU are working on the same task, they need to share information. The CPU's cache is "coherent," meaning that it makes sure everyone has the same, up-to-date information. The GPU's cores usually work independently on different parts of data, so its cache doesn't have to be coherent -- each worker (core) has its own set of tools (cache).

  • Cache Usage: The CPU uses its cache to store data it will need soon to do its tasks quickly. The GPU uses its cache to store data that a lot of its cores will use at the same time, like when you're playing a game and the GPU needs to render lots of similar images.

  • So, in a nutshell, the CPU and GPU use their caches in different ways because they're designed to do different kinds of jobs: the CPU is the all-rounder, while the GPU is the specialist.

memory coalescing support (GPUs, A64FX)

  • Memory coalescing is like organizing data into neat, easy-to-reach shelves. In powerful computer processors, like in GPUs and the A64FX chip, it's faster to grab data when it's lined up neatly in memory, rather than scattered around. "Memory coalescing support" means these processors have special ways to help line up the data for faster access.
  • If the data pieces are scattered around in different places in memory, the processor has to spend more time getting each piece of data separately, slowing things down. This is like having to run around the library to different shelves to get each book. So, when we talk about "memory coalescing support", it means the processor and/or its programming tools have ways to help arrange or access data so it's more like the books on one shelf, speeding up the work.
  • When a group of threads (basic units of execution in parallel processing) running on these cores try to access memory simultaneously, it's most efficient if they access consecutive memory addresses. This is because memory is often accessed in blocks or "words" of a certain size, not individual bytes.

Resources for Further Reading

Video Lectures

Clone this wiki locally