1.1 Two Processors, One Problem

Every modern processor faces the same fundamental engineering constraint: the memory it can access fastest is the memory it has least of. This is not a GPU-specific problem or a CPU-specific problem. It is the memory wall — the defining constraint of computer architecture since the 1990s.

Both CPU and GPU solve it the same way: build a hierarchy of progressively slower, larger storage tiers, and create a software abstraction (virtual memory) that makes the limited fast tier appear unlimited to consumers.

This chapter maps the two hierarchies side by side — with real latency numbers — and shows why they inevitably arrive at identical software solutions.


1.2 The CPU Memory Hierarchy

┌─────────────────────────────────────────────────────────────────────┐
│ TIER          │ CAPACITY      │ LATENCY       │ MANAGED BY          │
├───────────────┼───────────────┼───────────────┼─────────────────────┤
│ Registers     │ ~1 KB         │ 0 cycles      │ Compiler / ISA      │
│ L1 Cache      │ 32–64 KB      │ ~4 cycles     │ Hardware (coherent) │
│ L2 Cache      │ 256 KB–1 MB   │ ~12 cycles    │ Hardware (coherent) │
│ L3 / LLC      │ 8–64 MB       │ ~40 cycles    │ Hardware (coherent) │
│ DRAM          │ 16–512 GB     │ ~100 ns       │ OS MM (page-based)  │
│ Swap (SSD)    │ TB-scale      │ ~100 µs       │ OS MM (swap subsys) │
│ Swap (HDD)    │ TB-scale      │ ~10 ms        │ OS MM (swap subsys) │
└───────────────┴───────────────┴───────────────┴─────────────────────┘

Key boundary: Between LLC and DRAM, hardware manages movement automatically (cache lines, inclusivity, coherence protocols). Between DRAM and swap, software manages movement — the OS kernel’s memory management subsystem.

The CPU MM’s job is precisely this: make 512 GB of DRAM appear sufficient for workloads that collectively request terabytes of address space, by paging data between DRAM and swap as needed.


1.3 The GPU Memory Hierarchy

┌─────────────────────────────────────────────────────────────────────┐
│ TIER          │ CAPACITY      │ LATENCY        │ MANAGED BY         │
├───────────────┼───────────────┼────────────────┼────────────────────┤
│ Registers     │ ~256 KB/SM    │ 0 cycles       │ Compiler / HW      │
│ Shared Mem/L1 │ 16–228 KB/SM  │ ~20 cycles     │ Programmer / HW    │
│ L2 Cache      │ 4–96 MB       │ ~200 cycles    │ Hardware           │
│ VRAM (HBM/G6X)│ 8–192 GB      │ ~400–800 cyc   │ Driver (TTM/GEM)   │
│ System Memory │ Host DRAM     │ ~1–5 µs (PCIe) │ Driver (TTM evict) │
│ Shmem/Swap    │ Disk-backed   │ ~100 µs+       │ Driver (TTM swap)  │
└───────────────┴───────────────┴────────────────┴────────────────────┘

Note: VRAM latency here means an L2-miss load from GPU shader, not a DMA
transfer. System memory latency reflects a single GPU-initiated read crossing
the PCIe/CXL bus. Bulk DMA throughput is 32–64 GB/s (PCIe 5.0 x16).

Key boundary: Between L2 and VRAM, GPU hardware manages movement (cache lines). Between VRAM and system memory/swap, software manages movement — the DRM/TTM subsystem in the kernel.

TTM’s job is precisely this: make 24 GB of VRAM appear sufficient for workloads that collectively allocate hundreds of gigabytes of buffer objects, by evicting data between VRAM and system memory as needed.


1.4 The Parallel, Visualized

          CPU World                          GPU World
    ┌────────────────────┐            ┌────────────────────┐
    │   "Fast Memory"    │            │   "Fast Memory"    │
    │   Physical DRAM    │            │      VRAM          │
    │  (limited, fast)   │            │  (limited, fast)   │
    └────────┬───────────┘            └────────┬───────────┘
             │ eviction/swap-out               │ eviction
             ▼                                  ▼
    ┌────────────────────┐            ┌────────────────────┐
    │  "Slow Overflow"   │            │  "Slow Overflow"   │
    │   Swap Device      │            │  System Memory     │
    │  (large, slow)     │            │  (large, slow*)    │
    └────────┬───────────┘            └────────┬───────────┘
             │                                 │ further eviction
             │                                 ▼
             │                        ┌────────────────────┐
             │                        │  "Cold Storage"    │
             │                        │  shmem/swap_storage│
             │                        │  (disk-backed)     │
             │                        └────────────────────┘
             │
    ┌────────┴───────────┐            ┌────────────────────┐
    │  Management Layer  │            │  Management Layer  │
    │  Linux MM          │            │  DRM TTM/GEM       │
    │  (kswapd, LRU,     │            │  (eviction, LRU,   │
    │   page tables)     │            │   GPU page tables) │
    └────────────────────┘            └────────────────────┘

    * "slow" = across PCIe bus, ~10-50 GB/s vs VRAM's ~1-5 TB/s

The bandwidth ratio is striking:

  • CPU: DRAM ~50–100 GB/s vs swap (NVMe) ~3–7 GB/s → ~15× gap
  • GPU: VRAM ~1–5 TB/s vs PCIe ~32–64 GB/s → ~30–80× gap

The GPU’s fast/slow bandwidth ratio is worse. A buffer that fits in VRAM runs at full memory bandwidth; once evicted to system memory, every access pays a 30–80× throughput penalty. The eviction algorithm is therefore more critical for GPU than CPU — a bad decision is punished harder.


1.5 The Software Consequence

Both hierarchies create the same software requirements:

1.5.1 Virtual Address Spaces (Indirection)

The processor cannot use physical DRAM/VRAM addresses directly for application-visible pointers, because:

  • Multiple consumers share the limited physical resource
  • Data moves between tiers (addresses would change)
  • Isolation between contexts requires private address spaces

CPU solution: Per-process virtual address space (mm_struct, page tables)
GPU solution: Per-context GPU virtual address space (drm_gpuvm, GPU page tables)

1.5.2 Demand Allocation (Laziness)

Allocating physical storage at virtual allocation time wastes the scarce resource:

  • Many allocations are never fully used
  • Upfront allocation blocks other consumers

CPU solution: handle_mm_fault() — allocate physical page on first access
GPU solution: ttm_tt_populate() — allocate pages on first GPU use; GEM fault handlers

1.5.3 Capacity Management (Eviction)

When the fast tier is full and a new allocation arrives:

  • Select a victim from the fast tier (who to evict?)
  • Move victim’s data to the slow tier (how to preserve?)
  • Record that the data has moved (how to find it later?)
  • When the victim is needed again, bring it back (how to restore?)

CPU solution: LRU scan → write to swap → swap entry in PTE → do_swap_page()
GPU solution: LRU scan → DMA to system mem / write to shmem → TTM_TT_FLAG_SWAPPEDttm_tt_swapin()

1.5.4 Access Hints (Policies)

Userspace knows things the kernel doesn’t — which data is hot, which is dispensable:

CPU solution: madvise(MADV_DONTNEED) — “you can reclaim this without saving”
GPU solution: DRM_GEM_OBJECT_PURGEABLE — “you can free this without backing up”


1.6 Latency Numbers That Explain Everything

Understanding why both systems work the same way requires understanding the latency ratios:

Access Pattern CPU GPU
Hit in fast tier ~100 ns (DRAM access) ~400 ns (VRAM, L2 miss)
Miss → slow tier (bulk move) ~100 µs (swap-in 4 KB from SSD) ~1–10 µs (PCIe DMA of a BO page)
Miss → cold storage ~10 ms (swap-in from HDD) ~100 µs (shmem read-back)
Penalty ratio (miss/hit) ~1000× (SSD) ~10× (PCIe)

The GPU penalty ratio appears smaller, but the miss volume is far larger:
a single eviction moves an entire buffer object (often MB–GB), not a 4 KB
page. Total stall time = per-page latency × pages moved.

The CPU’s per-page penalty ratio is worse (1000× for SSD, 100,000× for HDD), which is why CPU MM uses extremely sophisticated LRU aging (multi-generational LRU, access bit scanning). The GPU’s per-access ratio is lower, but eviction operates on entire buffer objects — so a single bad eviction decision can stall the GPU for milliseconds. Both systems therefore need the same class of LRU-based policy to minimize misses.


1.7 The Three-Tier Model in Code

1.7.1 CPU: DRAM → Swap

From include/linux/swap.h — the kernel tracks which pages have been swapped out:

/* Encoding of swap entry in PTE */
swp_entry_t entry = pte_to_swp_entry(pte);
/* Contains: swap device index + offset within swap file */

1.7.2 GPU: VRAM → System Memory → Shmem

From include/drm/ttm/ttm_placement.h:

#define TTM_PL_SYSTEM   0   /* System memory (CPU-accessible, GPU via PCIe) */
#define TTM_PL_TT       1   /* TT memory (system mem + GPU page table entry) */
#define TTM_PL_VRAM     2   /* Video RAM (fast, local to GPU) */

The placement definitions encode the hierarchy directly. And from ttm_resource_manager:

struct ttm_resource_manager {
    uint64_t size;                                /* Total size of this tier */
    struct list_head lru[TTM_MAX_BO_PRIORITY];    /* LRU lists for eviction */
    ...
};

Each memory tier has its own resource manager with its own LRU — exactly as the CPU has per-zone LRU lists (lruvec per memory cgroup per NUMA node).


1.8 Why the GPU Has Three Tiers (Not Two)

The CPU model is simpler: DRAM or swap. The GPU has an extra level:

VRAM  →  System Memory (TTM_PL_TT/SYSTEM)  →  shmem (swap_storage)
 fast         medium (PCIe-accessible)           cold (disk-backed)

Why? Because GPU system memory isn’t quite swap:

  • System memory is still directly accessible by the GPU (via PCIe BAR or IOMMU)
  • A buffer in system memory can still be used — just slower
  • Only when system memory is also under pressure does TTM swap to shmem

This maps to the CPU’s NUMA tiering:

Fast DRAM (local node)  →  Slow DRAM (remote node)  →  Swap

In both cases, there’s a gradient of performance, not a binary fast/slow split.


1.9 When the Hierarchies Collapse: The Integrated GPU Case

Integrated GPUs (Intel, AMD APU, ARM Mali) share system memory with the CPU:

┌─────────────────────────────────────────┐
│     Shared Physical DRAM                │
│  ┌─────────────┐   ┌─────────────────┐  │
│  │  CPU Pages  │   │  GPU Buffers    │  │
│  │  (mm_struct)│   │  (GEM/TTM)      │  │
│  └─────────────┘   └─────────────────┘  │
└─────────────────────────────────────────┘
         │                    │
         └────── Swap ────────┘

Here the parallel is even more obvious: CPU and GPU are literally competing for the same physical pages, managed by the same kernel, with the same swap as overflow. The GPU’s drm_gem_lru_scan() participates in the same shrinker framework as CPU slab caches.


1.10 The “Same Problem, Same Solution” Principle

Computer science has a name for this: convergent design. When two systems solve the same problem under similar constraints, they converge on the same solution — not because one copied the other, but because the solution space is constrained.

The constraints are:

Constraint CPU GPU
Limited fast memory DRAM is finite VRAM is finite
Larger address spaces 48-bit VA > physical RAM GPU VA > VRAM
Multiple competing consumers Processes Buffer objects / GPU contexts
Data locality matters NUMA effects VRAM vs PCIe latency
Unpredictable access patterns General-purpose code Varying workloads

Given these constraints, the solution must include:

  1. ✅ Indirection layer (page tables)
  2. ✅ Lazy allocation (demand paging)
  3. ✅ LRU-based eviction (reclaim)
  4. ✅ Overflow to slower storage (swap)
  5. ✅ Userspace hints (madvise)
  6. ✅ Pressure-responsive shrinking (shrinkers)

Both CPU MM and GPU TTM/GEM implement all six. Not because one was designed after the other (though chronologically TTM came after), but because the problem requires all six.


1.11 Historical Timeline

Year CPU MM Milestone GPU MM Milestone
1969 Multics: demand paging
1979 BSD: mmap(), vfork()
1991 Linux 0.01: basic swap
1998 Linux 2.2: mature VM DRI 1.0 (no memory management)
2004 Linux 2.6: rmap, O(1) scheduler
2008 GEM introduced (shmem-backed objects)
2009 TTM introduced (full VM analogy emerges)
2012 Radeon TTM eviction matures in production
2014 nouveau (community reverse-eng) gains full TTM eviction
2017 HMM merged (4.14) — (but for GPU use!)
2022 Multi-gen LRU (MGLRU)
2023 GPUVM merged (drm_gpuvm = GPU mm_struct)
2024 drm_gpusvm (SVM layer using HMM)
2025+ CPU and GPU MM converging via HMM/SVM/CXL

The GPU memory subsystem recapitulated 40 years of CPU MM evolution in about 15 years — because it was solving the same problem and could learn from existing solutions.


1.12 What This Means For You — And What Comes Next

If you develop GPU drivers or compute runtimes, you already intuitively understand buffer placement, eviction cost, and working-set management. What this column will show you is that every one of those intuitions has a precise, well-studied counterpart in CPU MM — with decades of research, proven algorithms, and battle-tested kernel code you can learn from or call into directly.

Now that we see the parallel from 10,000 feet, we’ll zoom in level by level:

Next Chapters What We’ll Explore
Part I (Ch. 2–5) How both systems organize virtual address space
Part II (Ch. 6–9) How both systems allocate physical storage
Part III (Ch. 10–12) How both systems map virtual to physical
Part IV (Ch. 13–18) How both systems evict under pressure (the deepest parallel)
Part V (Ch. 19–22) How both systems fault-in on demand
Part VI (Ch. 23–25) How both systems accept hints from userspace
Part VII (Ch. 26–30) How the two systems are merging
Part VIII (Ch. 31–34) How to apply this knowledge in practice

Each chapter will start with what you know (the GPU side) and reveal its CPU mirror — building a complete mental model of both systems as one unified architecture.


Previous: Chapter 0 — GPU Memory Is Virtual Memory: Why This Column Exists
Next: Chapter 2 — The Process Address Space: mm_struct as the Original GPU Context

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐