Skip to content

gpu-k8s节点管理坏卡自动扫描处理 probe命令执行报错标记坏卡机器,打污点,并发送到飞书群通知

Notifications You must be signed in to change notification settings

ning1875/gpu-badcard-monitor

Repository files navigation

what and why:gpu-k8s节点管理坏卡自动扫描处理

  • 在大规模gpu集群中坏卡常有发生
  • 需要一个主动发现并处理坏卡节点的组件

how:这个项目采用最简单的cmd主动探测模式

  • 功能01:probe命令执行报错标记坏卡机器,打污点,并发送到飞书群通知
  • 功能02:给基于阿里云vmem的gpu虚拟化任务提供limit-kill功能限制

部署使用

  • 安装deploy下的yaml即可:镜像需要提前编译一下

这只是小乙老师日常k8s运维开发中的一小部分,有相关招聘和购课需求私聊我ning1875

使用

# 单节点误判去掉污点
kubectl taint node 10.50.0.69 badCardScan-

# 大规模误判删除ds
kubectl delete ds -n infra gpu-badcard-monitor

# 查看历史污点时间
kubectl get node   -L bad_card_taint_time  |grep 2025
10.50.0.242   Ready,SchedulingDisabled      <none>   5h37m   v1.30.1   2025-05-26_16-14-15

ecc错误命令


 nvidia-smi --query-gpu=uuid,index,power.draw,power.limit,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
 nvidia-smi --query-gpu=uuid,index --format=csv


ecc举例

root@10-188-52-137:~# nvidia-smi --query-gpu=uuid,index,power.draw,power.limit,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.sram --format=csv
uuid, index, power.draw [W], power.limit [W], ecc.errors.uncorrected.volatile.dram, ecc.errors.uncorrected.volatile.sram, ecc.errors.uncorrected.aggregate.dram, ecc.errors.uncorrected.aggregate.sram
GPU-a1633ccb-7a0c-5369-9407-f5e68626802d, 0, 33.36 W, 70.00 W, 0, 0, 0, 0
GPU-87ee8f8a-58aa-d830-f6ae-8c63b3ab2e79, 1, 32.02 W, 70.00 W, 0, 0, 106, 0
GPU-a720407c-939b-0c02-aa0c-0bc188e694a0, 2, 19.26 W, 70.00 W, 0, 0, 0, 0
GPU-81237da4-8b6d-86ff-ea21-b262fa4a8058, 3, 20.16 W, 70.00 W, 0, 0, 0, 0
GPU-c1c879ca-6e1e-a7c7-3249-02e16364a948, 4, 17.84 W, 70.00 W, 0, 0, 0, 0
GPU-bed8e9a7-1e33-ccf0-d3c7-108580378a04, 5, 17.73 W, 70.00 W, 0, 0, 0, 0

指标举例

			"target_node",
			"target_name",
			"target_namespace",
			"target_card_idx",  // 卡索引值
			"target_card_uuid", // uuid
			"target_card_mem",  // 申请的内存

gpu_badcard_monitor_pod_info{target_node="192.168.0.101",target_name="tensor-flow-001",target_namespace="recommand-001",target_card_idx="0",target_card_mem="8",target_card_uuid="GPU-ce5d3093-5aba-1fab-74e9-2564625cc577"} 1

About

gpu-k8s节点管理坏卡自动扫描处理 probe命令执行报错标记坏卡机器,打污点,并发送到飞书群通知

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published