某服务器操作系统挂起无反应,查询日志可以看到如下内容:

Nov 10 06:51:01 localhost kernel:      Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_8.x86_64 #1
Nov 10 06:51:01 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 10 06:51:01 localhost kernel: task:filebeat        state:D stack:    0 pid:13205 ppid:     1 flags:0x00000080
Nov 10 06:51:01 localhost kernel: Call Trace:
Nov 10 06:51:01 localhost kernel: __schedule+0x2d1/0x870
Nov 10 06:51:01 localhost kernel: ? call_function_interrupt+0xa/0x20
Nov 10 06:51:01 localhost kernel: schedule+0x55/0xf0
Nov 10 06:51:01 localhost kernel: io_schedule+0x12/0x40
Nov 10 06:51:01 localhost kernel: migration_entry_wait_on_locked+0x1ea/0x290
Nov 10 06:51:01 localhost kernel: ? filemap_fdatawait_keep_errors+0x50/0x50
Nov 10 06:51:01 localhost kernel: do_swap_page+0x5b0/0x710
Nov 10 06:51:01 localhost kernel: ? pmd_devmap_trans_unstable+0x2e/0x40
Nov 10 06:51:01 localhost kernel: ? handle_pte_fault+0x5d/0x880
Nov 10 06:51:01 localhost kernel: __handle_mm_fault+0x453/0x6c0
Nov 10 06:51:01 localhost kernel: handle_mm_fault+0xca/0x2a0
Nov 10 06:51:01 localhost kernel: __do_page_fault+0x1f0/0x450
Nov 10 06:51:01 localhost kernel: do_page_fault+0x37/0x130
Nov 10 06:51:01 localhost kernel: ? page_fault+0x8/0x30
Nov 10 06:51:01 localhost kernel: page_fault+0x1e/0x30
Nov 10 06:51:01 localhost kernel: RIP: 0033:0x15c297e
Nov 10 06:51:01 localhost kernel: Code: Unable to access opcode bytes at RIP 0x15c2954. 

原因分析:

filebeat 进程在内存缺页(page fault)时,被阻塞在 swap / 页回迁(page migration)过程中,长期等待 IO,触发 hung task 检测。

本质不是 filebeat bug,而是:

  • 内存压力过大

  • swap / 后端存储(磁盘 / SAN / 虚拟磁盘)IO 卡顿

  • 或 NUMA / 内存页迁移被锁住导致内核线程无法继续调度。

Tainted: G           OE    --------- -  - 

标志 含义
G 内核是“干净的”(没有严重内核错误)
O 加载了 out-of-tree module(非官方内核模块)
E 出现过 error(通常是 driver 或硬件 error)

Red Hat 知识库文章(Solution 7014646)描述了 RHEL 8.8 或 RHEL 8.6 EUS 上出现 hung_task_timeout_secs + migration_entry_wait_on_locked 的现象。

Resolution

Red Hat Enterprise Linux 8.8

  • The issue has been resolved with kernel-4.18.0-477.13.1.el8_8 via errata: RHSA-2023:3349.

Raw

# rpm -qp kernel-4.18.0-477.13.1.el8_8.x86_64.rpm --changelog | grep 2188249
- migrate: grab the compound head in migration_entry_wait_on_locked (Nico Pache) [2189629 2188249]

Possible workaround:

  • Boot the system with an older kernel released before the RHEL8.8 (GA).

Root Cause

  • There is a regression bug since RHEL8 kernel commit a598e2338f01 ("mm/migrate.c: rework migration_entry_wait() to not take a pageref") was introduced in RHEL-8.8 (GA).

  • The RHEL-8.8 kernel patch note explains how to resolve the issue:

Raw

commit f20af36bf5b7c25b94f73263629c202753d470d7
Author: Nico Pache <npache@redhat.com>
Date:   Mon Apr 24 14:18:16 2023 -0600

    migrate: grab the compound head in migration_entry_wait_on_locked

    Y-Commit 22609d42496e64d42bbb79b6929d2b6c3b47fc2f

    RHEL commit a598e2338f01 ("mm/migrate.c: rework migration_entry_wait()
    to not take a pageref") differs from upstream due to the folio changes.
    In converting the function to work with the page struct I mistakenly
    forgot to make sure we are operating on the compound page head. Without
    this we are occasional splats of hung tasks due to the page never being
    woken up.

    Upstream-status: RHEL-only
    O-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2188249
    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2189629
    Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/mm/filemap.c b/mm/filemap.c
index b8fa03e9a685..5b13e47f1fbe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1401,7 +1401,7 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
        unsigned long pflags;
        bool in_thrashing;
        wait_queue_head_t *q;
-       struct page *page = pfn_swap_entry_to_page(entry);
+       struct page *page = compound_head(pfn_swap_entry_to_page(entry));

        q = page_waitqueue(page);
        if (!PageUptodate(page) && PageWorkingset(page)) {

RHEL 8.8 内核在回迁 swap / migration 页时,没有确保操作的是 compound page 的 head page,导致 waitqueue 绑错 page,对应的唤醒永远不会发生,从而触发 hung task。

migration_entry_wait_on_locked do_swap_page hung_task_timeout_secs

正是这个 bug 的直接表现。


migration_entry_wait_on_locked 是干嘛的?

1️⃣ 典型调用路径

page_fault └─ handle_mm_fault └─ do_swap_page └─ migration_entry_wait_on_locked └─ io_schedule

2️⃣ migration_entry_wait_on_locked 的职责

当:

  • page 正在 内存迁移(NUMA / compaction / swap in)

  • 或 page 正在被 IO 填充

  • 并且 page 被 lock 住

👉 当前 task 必须 sleep,等迁移完成后被唤醒。

关键点:

这个“等”和“唤醒”,是通过 page 对应的 waitqueue 完成的。


真正的 bug:compound page 用错了 page 指针

1️⃣ 什么是 compound page(重点)

在 RHEL 8.x(引入 folio 之前/期间):

  • THP(Transparent Huge Page)

  • huge page

  • 大文件 cache page

都会用 compound page 表示:


compound page: head page ← 唯一合法的“控制页” tail page tail page ...

📌 规则:

  • 锁(PageLocked)

  • waitqueue

  • 唤醒
    👉 都只发生在 head page 上


2️⃣ 出问题的旧代码(buggy)


struct page *page = pfn_swap_entry_to_page(entry);

问题在这里 👆

  • pfn_swap_entry_to_page()
    👉 可能返回的是 tail page

  • 而不是 compound head


3️⃣ 后果是什么?(非常致命)

接下来代码做了什么:


q = page_waitqueue(page); wait_event(q, ...);

但如果:

  • 你在 tail page 的 waitqueue 上睡眠

  • 真正的唤醒发生在 head page

👉 结果:

  • 迁移完成 ✔

  • head page 上 wake_up ✔

  • 但 task 睡在 tail page 的 waitqueue ❌

  • 永远等不到唤醒

这就解释了你看到的:


task state: D hung_task_timeout_secs

⚠️ 不是 IO 真慢,而是“唤醒丢失”


四、commit 是怎么修的?

1️⃣ 核心修复只有一行(但非常关键)


- struct page *page = pfn_swap_entry_to_page(entry); + struct page *page = compound_head(pfn_swap_entry_to_page(entry));

2️⃣ 这个改动的意义

强制保证:

  • 无论 swap entry 指向的是:

    • head page

    • 还是 tail page

  • 最终操作的一定是 compound head

从而保证:

  • waitqueue 正确

  • wake_up 一定能唤醒所有等待者

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐