carwatchdog 守护进程健康检测

Mr.QingBin

1093人浏览 · 2025-10-29 11:05:22

Mr.QingBin · 2025-10-29 11:05:22 发布

总览

WatchdogProcessService 是 carwatchdog 守护进程里负责对已注册客户端/服务做周期性“健康检测（heartbeat / ping）”的核心类。它的目标是：

定期向已注册的 client/service 发送 checkIfAlive 请求；
根据超时/不响应结果执行 dump/kill 操作并上报 VHAL 与 monitor；
管理注册、取消注册、binder death、session id 等细节；
支持不同超时等级（CRITICAL/MODERATE/NORMAL）和可被覆盖的全局 ping 间隔。
下面分步骤详细说明实现细节与边界行为。

关键数据结构与常量

kTimeouts：包含三个超时等级，分别对应 aidl 定义的 TimeoutLength::TIMEOUT_CRITICAL / TIMEOUT_MODERATE / TIMEOUT_NORMAL。
mClientsByTimeout: std::unordered_map<TimeoutLength, ClientInfoMap> — 按 timeout 分组存储已注册客户端（ClientInfo）。
mPingedClients: std::unordered_map<TimeoutLength, PingedClientMap> — 当前正在等待回复的 ping（按 sessionId 存储 ClientInfo 的副本），用于在截止时间检查哪些没回复。
mLastSessionId / getNewSessionId(): 为每次 ping 分配唯一的 session id，保证正数并循环处理。
mHandlerLooper + MessageHandlerImpl：负责在 handler 线程上周期调度健康检查消息（以及 VHAL 相关消息）。
kCarWatchdogServiceTimeoutDelay（常量）：给注册为“service 类型”的 client 默认使用 CRITICAL 超时（守护进程对 CarWatchdogService 的特殊处理）。
mIsEnabled：开关，用于 enable/disable 健康检测（由 controlProcessHealthCheck / setEnabled 控制）。
mStoppedUserIds：当用户停止时，对应用户下的 client 会被跳过。
超时对应时长（getTimeoutDurationNs）：

TIMEOUT_CRITICAL -> 3s
TIMEOUT_MODERATE -> 6s
TIMEOUT_NORMAL -> 12s
（如果通过只读属性 override，会统一使用 mOverriddenClientHealthCheckWindowNs）

客户端注册 / 链接死亡（linkToDeath）

registerClient(client, timeout)（普通 client）
registerCarWatchdogService(binder, helper)（service 类型）
registerMonitor(monitor)

注册时会：

获取 calling pid/uid，构造 ClientInfo（含 kClient 或 kWatchdogServiceHelper / binder）并插入到 mClientsByTimeout[timeout]。
对常规 client 调用 linkToDeath（通过 mDeathRegistrationWrapper）以便当客户端进程崩溃/binder 死时收到回调并从注册表中移除。
如果插入成功并且是该 timeout 的第一个 client，会调用 startHealthCheckingLocked(timeout) 启动定期检查（sendMessageDelayed）。
链接死亡：handleBinderDeath(void* cookie) 被注册为 AIBinder 死亡回调，会查找并删除对应 client，以及在 monitor 的 case 下单独处理。

周期性健康检测（doHealthCheck）

这是核心流程，简化后的步骤如下：

由 mHandlerLooper 在 message thread 上触发 doHealthCheck(what)（what 即 timeout 的 enum 值）。实现入口： MessageHandlerImpl::handleMessage -> kService->doHealthCheck(message.what)。
doHealthCheck 做以下工作：
取出对应 timeout（static_cast<TimeoutLength>(what)）。
先调用 dumpAndKillClientsIfNotResponding(timeout) 处理上次周期中未响应且超时的 client（详见后文）。
在临界区（锁 mMutex）构造一次性列表 clientsToCheck（把所有当前注册且未被停止用户过滤掉的 client），对每个 client：
生成一个 sessionId = getNewSessionId() 并把它写入 clientInfo.sessionId；
把 clientInfo 的一个副本插入到 mPingedClients[timeout]，key = sessionId（这样等待回复时可以用 sessionId 查找）。
对 clientsToCheck 逐个调用 clientInfo.checkIfAlive(timeout)：
对普通 client：调用 client->checkIfAlive(sessionId, timeout)（即 AIDL/NDK RPC 发给客户端）
对 service 类型 client：调用 WatchdogServiceHelper->checkIfAlive(binder, sessionId, timeout)（helper 代表 CarService 端的代理，CarService 可能会在不同线程/流程作更复杂检查）
如果调用返回非 ok（不能发送/连接异常），会把该 session 从 mPingedClients 中移除（表示这次 ping 无法发送，避免后续误判）。
如果本次要 ping 的数 > 0，则再次安排下一次 doHealthCheck 的延迟时间：durationNs = getTimeoutDurationNs(timeout)；mHandlerLooper->sendMessageDelayed(durationNs.count(), mMessageHandler, Message(what))。也就是说每个 timeout 有自己的周期。
sessionId 的作用：
防止并发/错配：ping 时附带 sessionId，客户端回复时会把相同 sessionId 带回（tellClientAlive），server 以 sessionId 在 mPingedClients[timeout] 中查找并移除对应条目 —— 只有匹配到的 sessionId 才算真正回复。
sessionId 通过 getNewSessionId 保证不会轻易重复，减少错判。

客户端响应（tellClientAlive）

当客户端收到 checkIfAlive 后应在规定时间内回调 tellClientAlive(sessionId)。调用链为：
Binder RPC -> WatchdogBinderMediator::tellClientAlive -> mWatchdogProcessService->tellClientAlive(client, sessionId)
WatchdogProcessService::tellClientAlive -> tellClientAliveLocked(client->asBinder(), sessionId)（加锁）
tellClientAliveLocked 遍历所有 timeout 对应的 mPingedClients[timeout]，查找 sessionId 且 getAIBinder() 匹配的条目；若找到则从 mPingedClients 中 erase 并返回 ok；否则返回 EX_ILLEGAL_ARGUMENT（sessionId 未找到或 client 未注册）。
成功移除后该客户端本周期的 ping 被视为“已响应”，不会被当作未响应处理。
超时与失活处理（dumpAndKillClientsIfNotResponding / dumpAndKillAllProcesses）
在每次 doHealthCheck 开头会先运行 dumpAndKillClientsIfNotResponding(timeout)，该函数：
遍历 mPingedClients[timeout] 中仍存在的 session（即上次周期被 ping 但在截止时仍未被从 mPingedClients 中移除的 client）；
对每个未响应的 client：提取 pid/startTime 等信息并从 mClientsByTimeout 中移除，同时调用 clientInfo.prepareProcessTermination()（该方法会对普通 client 调用 prepareProcessTermination RPC，或对 service 类型调用 helper 的 prepareProcessTermination）；
收集所有未响应进程的 ProcessIdentifier 列表，然后调用 dumpAndKillAllProcesses(processIdentifiers, reportToVhal=true)。
dumpAndKillAllProcesses 做：
检查是否有已注册的 monitor（mMonitor），若没有则返回错误；
如果 reportToVhal==true，则先 reportTerminatedProcessToVhal(...) 报告给 VHAL（将 TERMINATED_PROCESS property 发给 VHAL）；
调用 monitor->onClientsNotResponding(processesNotResponding)（Car 的 monitor 实现通常会进行 dump/killing 的实际动作）。
也就是说：超时 -> 准备终止 -> 上报 VHAL -> 通知 monitor 执行 dump/killing。
tellCarWatchdogServiceAlive（CarService 作为被 ping 的“service”类型）
对于 service 类型（CarWatchdogService），守护进程不会直接调用 CarService 的 checkIfAlive RPC（它会通过 WatchdogServiceHelper 代理/封装调用）。在 CarService 回应后，CarService 可能会在响应中把“哪些子进程未响应（clientsNotResponding）”传回。WatchdogProcessService::tellCarWatchdogServiceAlive 会：
在 tellClientAliveLocked 成功后（即确认 CarService 本身是存活的）调用 dumpAndKillAllProcesses(clientsNotResponding, reportToVhal=true) 去处理子进程。
因此 CarWatchdogService 可以把更细粒度的诊断结果汇报给守护进程。

Binder 权限与 system-only 接口

WatchdogInternalHandler（内部注册于 internal ICarWatchdog）在每个敏感方法里会使用 checkSystemUser(...) 验证调用者是否为 SYSTEM uid（IPCThreadState::self()->getCallingUid() == AID_SYSTEM）。因此只有具有 system 权限的进程（例如 CarService）能调用内部管理/控制接口。客户端的普通公开接口（ICarWatchdog）也会有相应的访问控制，但 WatchdogInternalHandler 专门用于系统/内部操作（注册 CarService、tellCarWatchdogServiceAlive、dump 控制等）。

线程与锁

mHandlerLooper（message loop）负责周期性调度和长时任务（如 VHAL 心跳检查、缓存 VHAL pid）。MessageHandlerImpl::handleMessage 在 handler 线程上调用 doHealthCheck、reportWatchdogAliveToVhal 等。
mMutex 用来保护 mClientsByTimeout、mPingedClients、mMonitor、mVhalProcessIdentifier 等共享状态。多数对状态的修改/查询都会在加锁区执行。
客户端 checkIfAlive RPC（kClient->checkIfAlive）是通过 AIDL/NDK 异步/同步 RPC 发往客户端，可能触发客户端内的异步处理；回复仍然是一个 binder RPC 回到守护进程。
VHAL 相关（与健康检查的关系）
守护进程也会管理 VHAL 心跳（VHAL_HEARTBEAT / WATCHDOG_ALIVE），并在需要时调用 reportWatchdogAliveToVhal()、checkVhalHealth()。这些与 client 健康检查是并列的健康信号来源，VHAL 心跳异常可能导致对 VHAL 进程做终止处理（terminateVhal）。
对 AIDL VHAL，守护进程可能需要从 CarWatchdogService 请求 VHAL 进程 PID（requestAidlVhalPid），这需要 CarService 作为 bridge。

错误情况与边界

若 ping 无法发送（RPC 失败），doHealthCheck 会在发送失败后把该 session 从 mPingedClients 中删掉，避免把“未能发送”误判为“超时未响应”。
mStoppedUserIds：当某些用户已停止，该用户下的 client 在后续健康检查中会被跳过（不 ping）。
注册重复：registerClient 在重复注册时会返回 ERR_DUPLICATE_REGISTRATION（registerClient 里会检查 findClientAndProcessLocked），并被 toScopedAStatus 处理为 ok 或异常。
系统关机：isSystemShuttingDown 检查 sys.powerctl，若系统正在关机则会跳过某些终止动作。
权限检查：WatchdogInternalHandler 的系统 API 拒绝非 system uid 调用（EX_SECURITY）。
简化的交互序列（关键步骤）
Client / CarService 调用 registerClient/registerCarWatchdogService -> 插入 mClientsByTimeout，linkToDeath 注册死亡回调。
HandlerLooper 触发 doHealthCheck(timeout=CRITICAL)：
dumpAndKillClientsIfNotResponding(prevTimeout) 处理上次未响应的；
为每个 client 分配 sessionId，把 clientInfo 放入 mPingedClients[CRITICAL]；
调用 client->checkIfAlive(sessionId, timeout) 或 helper->checkIfAlive(...)。
客户端收到 ping 后在本进程内做检查并调用 tellClientAlive(sessionId)（或 CarService 用 tellCarWatchdogServiceAlive 返回 clientsNotResponding）；
Server 端 tellClientAlive -> tellClientAliveLocked 找到对应 sessionId 并移除 mPingedClients 条目，表示成功；
若在超时时间到达时 mPingedClients 仍有条目，则 dumpAndKillClientsIfNotResponding 会把这些条目转换为 ProcessIdentifier，调用 prepareProcessTermination、dumpAndKillAllProcesses -> reportTerminatedProcessToVhal -> monitor->onClientsNotResponding。

常见场景示例（CRITICAL 超时）
TIMEOUT_CRITICAL（3s）：
t=0: doHealthCheck 给 client 发 ping(sessionId=100)
t≈0: mPingedClients[CRITICAL] 插入 sessionId=100
client 在 2s 内回调 tellClientAlive(100) -> server 移除 entry -> OK
若 client 在 3s 后仍未回调 -> dumpAndKillClientsIfNotResponding 在下一次 doHealthCheck 时把它视为未响应并触发 dump/killing。
可扩展/可配置点
通过只读属性 ro.carwatchdog.client_healthcheck.interval 可以 override ping 间隔（所有 timeout 使用相同覆盖值）。
WatchdogServiceHelper 抽象便于 CarService 端实现更复杂的服务级检查逻辑（而不是简单的 RPC 返回）。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

2026年5款宝藏国产AI生视频工具实测，普通人也能高质量出片

2048 AI社区

让 AI 编程助手「像资深工程师一样工作」：聊聊 Addy Osmani 的 agent-skills

2048 AI社区

昇思推理框架：打通AI算法与实际应用的核心桥梁

昇思推理框架是华为昇腾AI生态的核心组件，实现AI模型从训练到部署的全流程闭环。该框架采用三层架构设计，支持多语言接口、模型优化和全场景硬件适配，具备轻量化、高性能特点。应用覆盖计算机视觉、自然语言处理、医疗科研等领域，显著提升工业质检、智能客服等场景的效率和准确率。实战代码展示了从模型转换到服务化部署的全过程，验证了其作为算法与应用桥梁的易用性和高效性，为AI技术规模化落地提供有力支撑。