ceph mon导致集群失败
作者 伍增田 Tommy WU zxpns18@126.com
环境有2个mon,controller 和 compute-10-80-7-70(这个节点有问题,需要重新安装),先从集群中删除该节点。

执行 ceph mon remove compute-10-80-7-70命令后,ceph-mon服务不可用,既所有ceph命令会卡住无法返回
mon的log出现下面的错误:
2025-08-06 22:12:38.876 7f4d97336700 1 mon.controller@0(probing) e5 handle_auth_request failed to assign global_id
2025-08-06 22:12:38.947 7f4d97336700 1 mon.controller@0(probing) e5 handle_auth_request failed to assign global_id
2025-08-06 22:12:39.679 7f4d97336700 1 mon.controller@0(probing) e5 handle_auth_request failed to assign global_id
2025-08-06 22:12:40.545 7f4d97336700 1 mon.controller@0(probing) e5 handle_auth_request failed to assign global_id

[root@controller ceph]# ceph-mon -i controller --extract-monmap /tmp/monmap
2025-08-06 22:14:03.111 7fa09ab5e1c0 -1 wrote monmap to /tmp/monmap

[root@controller ceph]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 5
fsid 00d24b66-9bab-40fd-80fb-873bdf9c631a
last_changed 2025-08-06 21:27:27.873997
created 2025-07-07 01:08:35.230962
min_mon_release 14 (nautilus)
0: [v2:10.80.7.89:3300/0,v1:10.80.7.89:6789/0] mon.controller
1: [v2:10.80.7.70:3300/0,v1:10.80.7.70:6789/0] mon.compute-10-80-7-70
这里看出,ceph mon remove compute-10-80-7-70,没有从store.db中清除掉,只是内存中emove compute-10-80-7-70这个节点

[root@controller ceph]# monmaptool --rm compute-10-80-7-70 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: removing compute-10-80-7-70
monmaptool: writing epoch 5 to /tmp/monmap (1 monitors)

[root@controller ceph]# ceph-mon -i controller --inject-monmap /tmp/monmap
这样mon.controller 起来后mon集群可以正常工作了。再把mon.compute-10-80-7-70节点重新安装配置加入进来。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐