Chapter 1: A Tale of Two Memory Hierarchies — CPU vs GPU at 10,000 Feet

This article compares the memory hierarchies of CPU and GPU. It points out that both face the same basic constraint: limited fast - memory capacity. To address this, both adopt a hierarchical storage

DeeplyMind

61人浏览 · 2026-06-01 05:00:00

DeeplyMind · 2026-06-01 05:00:00 发布

1.1 Two Processors, One Problem

Every modern processor faces the same fundamental engineering constraint: the memory it can access fastest is the memory it has least of. This is not a GPU-specific problem or a CPU-specific problem. It is the memory wall — the defining constraint of computer architecture since the 1990s.

Both CPU and GPU solve it the same way: build a hierarchy of progressively slower, larger storage tiers, and create a software abstraction (virtual memory) that makes the limited fast tier appear unlimited to consumers.

This chapter maps the two hierarchies side by side — with real latency numbers — and shows why they inevitably arrive at identical software solutions.

1.2 The CPU Memory Hierarchy

┌─────────────────────────────────────────────────────────────────────┐
│ TIER          │ CAPACITY      │ LATENCY       │ MANAGED BY          │
├───────────────┼───────────────┼───────────────┼─────────────────────┤
│ Registers     │ ~1 KB         │ 0 cycles      │ Compiler / ISA      │
│ L1 Cache      │ 32–64 KB      │ ~4 cycles     │ Hardware (coherent) │
│ L2 Cache      │ 256 KB–1 MB   │ ~12 cycles    │ Hardware (coherent) │
│ L3 / LLC      │ 8–64 MB       │ ~40 cycles    │ Hardware (coherent) │
│ DRAM          │ 16–512 GB     │ ~100 ns       │ OS MM (page-based)  │
│ Swap (SSD)    │ TB-scale      │ ~100 µs       │ OS MM (swap subsys) │
│ Swap (HDD)    │ TB-scale      │ ~10 ms        │ OS MM (swap subsys) │
└───────────────┴───────────────┴───────────────┴─────────────────────┘

Key boundary: Between LLC and DRAM, hardware manages movement automatically (cache lines, inclusivity, coherence protocols). Between DRAM and swap, software manages movement — the OS kernel’s memory management subsystem.

The CPU MM’s job is precisely this: make 512 GB of DRAM appear sufficient for workloads that collectively request terabytes of address space, by paging data between DRAM and swap as needed.

1.3 The GPU Memory Hierarchy

┌─────────────────────────────────────────────────────────────────────┐
│ TIER          │ CAPACITY      │ LATENCY        │ MANAGED BY         │
├───────────────┼───────────────┼────────────────┼────────────────────┤
│ Registers     │ ~256 KB/SM    │ 0 cycles       │ Compiler / HW      │
│ Shared Mem/L1 │ 16–228 KB/SM  │ ~20 cycles     │ Programmer / HW    │
│ L2 Cache      │ 4–96 MB       │ ~200 cycles    │ Hardware           │
│ VRAM (HBM/G6X)│ 8–192 GB      │ ~400–800 cyc   │ Driver (TTM/GEM)   │
│ System Memory │ Host DRAM     │ ~1–5 µs (PCIe) │ Driver (TTM evict) │
│ Shmem/Swap    │ Disk-backed   │ ~100 µs+       │ Driver (TTM swap)  │
└───────────────┴───────────────┴────────────────┴────────────────────┘

Note: VRAM latency here means an L2-miss load from GPU shader, not a DMA
transfer. System memory latency reflects a single GPU-initiated read crossing
the PCIe/CXL bus. Bulk DMA throughput is 32–64 GB/s (PCIe 5.0 x16).

Key boundary: Between L2 and VRAM, GPU hardware manages movement (cache lines). Between VRAM and system memory/swap, software manages movement — the DRM/TTM subsystem in the kernel.

TTM’s job is precisely this: make 24 GB of VRAM appear sufficient for workloads that collectively allocate hundreds of gigabytes of buffer objects, by evicting data between VRAM and system memory as needed.

1.4 The Parallel, Visualized

          CPU World                          GPU World
    ┌────────────────────┐            ┌────────────────────┐
    │   "Fast Memory"    │            │   "Fast Memory"    │
    │   Physical DRAM    │            │      VRAM          │
    │  (limited, fast)   │            │  (limited, fast)   │
    └────────┬───────────┘            └────────┬───────────┘
             │ eviction/swap-out               │ eviction
             ▼                                  ▼
    ┌────────────────────┐            ┌────────────────────┐
    │  "Slow Overflow"   │            │  "Slow Overflow"   │
    │   Swap Device      │            │  System Memory     │
    │  (large, slow)     │            │  (large, slow*)    │
    └────────┬───────────┘            └────────┬───────────┘
             │                                 │ further eviction
             │                                 ▼
             │                        ┌────────────────────┐
             │                        │  "Cold Storage"    │
             │                        │  shmem/swap_storage│
             │                        │  (disk-backed)     │
             │                        └────────────────────┘
             │
    ┌────────┴───────────┐            ┌────────────────────┐
    │  Management Layer  │            │  Management Layer  │
    │  Linux MM          │            │  DRM TTM/GEM       │
    │  (kswapd, LRU,     │            │  (eviction, LRU,   │
    │   page tables)     │            │   GPU page tables) │
    └────────────────────┘            └────────────────────┘

    * "slow" = across PCIe bus, ~10-50 GB/s vs VRAM's ~1-5 TB/s

The bandwidth ratio is striking:

CPU: DRAM ~50–100 GB/s vs swap (NVMe) ~3–7 GB/s → ~15× gap
GPU: VRAM ~1–5 TB/s vs PCIe ~32–64 GB/s → ~30–80× gap

The GPU’s fast/slow bandwidth ratio is worse. A buffer that fits in VRAM runs at full memory bandwidth; once evicted to system memory, every access pays a 30–80× throughput penalty. The eviction algorithm is therefore more critical for GPU than CPU — a bad decision is punished harder.

1.5 The Software Consequence

Both hierarchies create the same software requirements:

1.5.1 Virtual Address Spaces (Indirection)

The processor cannot use physical DRAM/VRAM addresses directly for application-visible pointers, because:

Multiple consumers share the limited physical resource
Data moves between tiers (addresses would change)
Isolation between contexts requires private address spaces

CPU solution: Per-process virtual address space (mm_struct, page tables)
GPU solution: Per-context GPU virtual address space (drm_gpuvm, GPU page tables)

1.5.2 Demand Allocation (Laziness)

Allocating physical storage at virtual allocation time wastes the scarce resource:

Many allocations are never fully used
Upfront allocation blocks other consumers

CPU solution: handle_mm_fault() — allocate physical page on first access
GPU solution: ttm_tt_populate() — allocate pages on first GPU use; GEM fault handlers

1.5.3 Capacity Management (Eviction)

When the fast tier is full and a new allocation arrives:

Select a victim from the fast tier (who to evict?)
Move victim’s data to the slow tier (how to preserve?)
Record that the data has moved (how to find it later?)
When the victim is needed again, bring it back (how to restore?)

CPU solution: LRU scan → write to swap → swap entry in PTE → do_swap_page()
GPU solution: LRU scan → DMA to system mem / write to shmem → TTM_TT_FLAG_SWAPPED → ttm_tt_swapin()

1.5.4 Access Hints (Policies)

Userspace knows things the kernel doesn’t — which data is hot, which is dispensable:

CPU solution: madvise(MADV_DONTNEED) — “you can reclaim this without saving”
GPU solution: DRM_GEM_OBJECT_PURGEABLE — “you can free this without backing up”

1.6 Latency Numbers That Explain Everything

Understanding why both systems work the same way requires understanding the latency ratios:

Access Pattern	CPU	GPU
Hit in fast tier	~100 ns (DRAM access)	~400 ns (VRAM, L2 miss)
Miss → slow tier (bulk move)	~100 µs (swap-in 4 KB from SSD)	~1–10 µs (PCIe DMA of a BO page)
Miss → cold storage	~10 ms (swap-in from HDD)	~100 µs (shmem read-back)
Penalty ratio (miss/hit)	~1000× (SSD)	~10× (PCIe)

The GPU penalty ratio appears smaller, but the miss volume is far larger:
a single eviction moves an entire buffer object (often MB–GB), not a 4 KB
page. Total stall time = per-page latency × pages moved.

The CPU’s per-page penalty ratio is worse (1000× for SSD, 100,000× for HDD), which is why CPU MM uses extremely sophisticated LRU aging (multi-generational LRU, access bit scanning). The GPU’s per-access ratio is lower, but eviction operates on entire buffer objects — so a single bad eviction decision can stall the GPU for milliseconds. Both systems therefore need the same class of LRU-based policy to minimize misses.

1.7 The Three-Tier Model in Code

1.7.1 CPU: DRAM → Swap

From include/linux/swap.h — the kernel tracks which pages have been swapped out:

/* Encoding of swap entry in PTE */
swp_entry_t entry = pte_to_swp_entry(pte);
/* Contains: swap device index + offset within swap file */

1.7.2 GPU: VRAM → System Memory → Shmem

From include/drm/ttm/ttm_placement.h:

#define TTM_PL_SYSTEM   0   /* System memory (CPU-accessible, GPU via PCIe) */
#define TTM_PL_TT       1   /* TT memory (system mem + GPU page table entry) */
#define TTM_PL_VRAM     2   /* Video RAM (fast, local to GPU) */

The placement definitions encode the hierarchy directly. And from ttm_resource_manager:

struct ttm_resource_manager {
    uint64_t size;                                /* Total size of this tier */
    struct list_head lru[TTM_MAX_BO_PRIORITY];    /* LRU lists for eviction */
    ...
};

Each memory tier has its own resource manager with its own LRU — exactly as the CPU has per-zone LRU lists (lruvec per memory cgroup per NUMA node).

1.8 Why the GPU Has Three Tiers (Not Two)

The CPU model is simpler: DRAM or swap. The GPU has an extra level:

VRAM  →  System Memory (TTM_PL_TT/SYSTEM)  →  shmem (swap_storage)
 fast         medium (PCIe-accessible)           cold (disk-backed)

Why? Because GPU system memory isn’t quite swap:

System memory is still directly accessible by the GPU (via PCIe BAR or IOMMU)
A buffer in system memory can still be used — just slower
Only when system memory is also under pressure does TTM swap to shmem

This maps to the CPU’s NUMA tiering:

Fast DRAM (local node)  →  Slow DRAM (remote node)  →  Swap

In both cases, there’s a gradient of performance, not a binary fast/slow split.

1.9 When the Hierarchies Collapse: The Integrated GPU Case

Integrated GPUs (Intel, AMD APU, ARM Mali) share system memory with the CPU:

┌─────────────────────────────────────────┐
│     Shared Physical DRAM                │
│  ┌─────────────┐   ┌─────────────────┐  │
│  │  CPU Pages  │   │  GPU Buffers    │  │
│  │  (mm_struct)│   │  (GEM/TTM)      │  │
│  └─────────────┘   └─────────────────┘  │
└─────────────────────────────────────────┘
         │                    │
         └────── Swap ────────┘

Here the parallel is even more obvious: CPU and GPU are literally competing for the same physical pages, managed by the same kernel, with the same swap as overflow. The GPU’s drm_gem_lru_scan() participates in the same shrinker framework as CPU slab caches.

1.10 The “Same Problem, Same Solution” Principle

Computer science has a name for this: convergent design. When two systems solve the same problem under similar constraints, they converge on the same solution — not because one copied the other, but because the solution space is constrained.

The constraints are:

Constraint	CPU	GPU
Limited fast memory	DRAM is finite	VRAM is finite
Larger address spaces	48-bit VA > physical RAM	GPU VA > VRAM
Multiple competing consumers	Processes	Buffer objects / GPU contexts
Data locality matters	NUMA effects	VRAM vs PCIe latency
Unpredictable access patterns	General-purpose code	Varying workloads

Given these constraints, the solution must include:

✅ Indirection layer (page tables)
✅ Lazy allocation (demand paging)
✅ LRU-based eviction (reclaim)
✅ Overflow to slower storage (swap)
✅ Userspace hints (madvise)
✅ Pressure-responsive shrinking (shrinkers)

Both CPU MM and GPU TTM/GEM implement all six. Not because one was designed after the other (though chronologically TTM came after), but because the problem requires all six.

1.11 Historical Timeline

Year	CPU MM Milestone	GPU MM Milestone
1969	Multics: demand paging	—
1979	BSD: `mmap()`, `vfork()`	—
1991	Linux 0.01: basic swap	—
1998	Linux 2.2: mature VM	DRI 1.0 (no memory management)
2004	Linux 2.6: rmap, O(1) scheduler	—
2008	—	GEM introduced (shmem-backed objects)
2009	—	TTM introduced (full VM analogy emerges)
2012	—	Radeon TTM eviction matures in production
2014	—	nouveau (community reverse-eng) gains full TTM eviction
2017	HMM merged (4.14)	— (but for GPU use!)
2022	Multi-gen LRU (MGLRU)	—
2023	—	GPUVM merged (`drm_gpuvm` = GPU `mm_struct`)
2024	—	`drm_gpusvm` (SVM layer using HMM)
2025+	—	CPU and GPU MM converging via HMM/SVM/CXL

The GPU memory subsystem recapitulated 40 years of CPU MM evolution in about 15 years — because it was solving the same problem and could learn from existing solutions.

1.12 What This Means For You — And What Comes Next

If you develop GPU drivers or compute runtimes, you already intuitively understand buffer placement, eviction cost, and working-set management. What this column will show you is that every one of those intuitions has a precise, well-studied counterpart in CPU MM — with decades of research, proven algorithms, and battle-tested kernel code you can learn from or call into directly.

Now that we see the parallel from 10,000 feet, we’ll zoom in level by level:

Next Chapters	What We’ll Explore
Part I (Ch. 2–5)	How both systems organize virtual address space
Part II (Ch. 6–9)	How both systems allocate physical storage
Part III (Ch. 10–12)	How both systems map virtual to physical
Part IV (Ch. 13–18)	How both systems evict under pressure (the deepest parallel)
Part V (Ch. 19–22)	How both systems fault-in on demand
Part VI (Ch. 23–25)	How both systems accept hints from userspace
Part VII (Ch. 26–30)	How the two systems are merging
Part VIII (Ch. 31–34)	How to apply this knowledge in practice

Each chapter will start with what you know (the GPU side) and reveal its CPU mirror — building a complete mental model of both systems as one unified architecture.

Previous: Chapter 0 — GPU Memory Is Virtual Memory: Why This Column Exists
Next: Chapter 2 — The Process Address Space: mm_struct as the Original GPU Context

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

功耗可控专项实例（结合手机 AI 影像业务）

【摘要】针对中低端手机（骁龙778G/天玑720）在1080P录像与语音助手并发场景下出现的功耗超标（瞬时5.1W/4.8W）和过热（44.3℃/45.1℃）问题，提出动态分级优化方案：1）前台场景通过FP32转FP16算子压缩、动态帧率调节降低15%功耗；2）后台实现模型休眠（静态功耗从0.7W降至0.28W）；3）按机型差异化温控策略。最终两款机型功耗均压至≤4.2W，温度≤42℃，满足标准且

2048 AI社区

openclaw v2026.5.28发布：Agent恢复更稳、渠道安全全面增强、移动端焕新、Provider能力再扩展

代码地址：github.com/openclaw/openclaw总体来看，并不是一次单点功能更新，而是一次覆盖运行时、渠道、安全、移动端、浏览器输入、Provider、文档处理、CLI、认证、插件性能、发布验证链路的综合性升级。Agent 与 Codex 运行时恢复更稳多渠道消息投递与会话身份更安全移动端、WebChat、Talk 等体验更连续浏览器和自动化输入校验更严格Provider、PDF

2048 AI社区

[智能体-187]：LCEL（LangChain Expression Language）完整详解

python运行chain = (| ChatPromptTemplate.from_template("简述：{topic}")| llmres = chain.invoke({"topic": "Python 运算符重载"})print(res)本质：LCEL 是 LangChain 基于Runnable协议 + Python 运算符重载实现的组件编排表达式；核心符号管道符，代表数据从左向右流