RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 802: system not yet initialized问题解决
RuntimeError: Unexpected error from cudaGetDeviceCount(). Error 802: system not yet initialized问题解决
·
场景还原
笔者新拿到了一台服务器安装了cuda12.1的驱动和cuda toolkit,启动vllm服务后出现如下报错:
[root@localhost ~]#python3.9 /root/FastChat/fastchat/serve/vllm_worker.py --model-path /run/model/qwen-110b/ --num-gpus 8 --dtype bfloat16
2024-06-21 00:50:37 | ERROR | stderr | Traceback (most recent call last):
2024-06-21 00:50:37 | ERROR | stderr | File "/root/FastChat/fastchat/serve/vllm_worker.py", line 41, in <module>
2024-06-21 00:50:37 | ERROR | stderr | seed = torch.cuda.current_device()
2024-06-21 00:50:37 | ERROR | stderr | File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 778, in current_device
2024-06-21 00:50:37 | ERROR | stderr | _lazy_init()
2024-06-21 00:50:37 | ERROR | stderr | File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
2024-06-21 00:50:37 | ERROR | stderr | torch._C._cuda_init()
2024-06-21 00:50:37 | ERROR | stderr | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
[root@localhost ~]#
[1] 0:bash*
问题所在
nvidia-fabricmanager
服务没有启动,多GPU运行不了
问题解决
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager

欢迎来到由智源人工智能研究院发起的Triton中文社区,这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂,为推动人工智能技术的普及与深化应用贡献力量。
更多推荐
所有评论(0)