TensorRT-LLM & Triton Server 部署过程记录

虽然理论上 Docker 方式部署不是必须，但在实践中发现如果不使用官方镜像，在配置 TRT 和 NTIS 环境的时候会出现各种由于版本 mismatch 的编译错误，比如 mpi4py api 的编译过程中，由于我们服务器的 os 版本（ubuntu24.04）与 os 对应的 openmpi 版本（4.1.6）超前导致编译失败等类似情况（在这个过程中就花费了大量的时间成本）。还是以上面的场景举

✦昨夜星辰✦

2272人浏览 · 2024-10-15 19:50:58

✦昨夜星辰✦ · 2024-10-15 19:50:58 发布

文章目录

前置环境安装
两个 Docker Container
- TensorRT-LLM Container 部署
- Triton Container 部署
关于 In-flight Batching
关于 Decoupled Mode 解耦模型

前置环境安装

本目录为新环境（新卡）需要安装的部分，如果以前已安装请跳过此目录。

Nvidia 驱动环境安装

包括两部分：CUDA Driver 和 GPU Driver。只需要看这个文档：NVIDIA CUDA Installation Guide for Linux。里面内容很多，其实只需按序执行（尽量都由包管理器的安装程序来安装，因为太多依赖了）：
- 18 - 卸载干净旧版本（可选）
- 3.9/3.2 - 安装 CUDA Toolkit/Driver
- 4 - 安装 GPU Driver
Note：如果在运行时出现类似“ NVME错误 ”或者“ NCCL错误 ”，那一般就是驱动出现问题，重装吧

Nvidia-Fabricmanager 安装

简介：NVIDIA Fabric Manager 是一款用于管理 NVIDIA GPU 之间互联的软件服务。当您使用支持 NVLink 和 NVSwitch 技术的 NVIDIA GPU（如 A100、A800 等）时，需要安装对应版本的 Fabric Manager，以使多个 GPU 卡能够通过 NVSwitch 互联。
【重要】多卡机器一定要安装 nvidia fabricmanager Fabric Manager for NVIDIA NVSwitch Systems
Note：如果在运行时出现类似“ 错误802 ”或者“ System Not Initialized 错误 ”，那一般就是没安装 fabricmanager

#安装
sudo apt-get install nvidia-fabricmanager

#启动
sudo systemctl start nvidia-fabricmanager

#检查是否正常启动
sudo systemctl status nvidia-fabricmanager
#然后检查GPU是否正常注册到fabricmanager上
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
#如注册中，会返回：
#    Fabric
#        State                             : In Progress
#        Status                            : N/A
#如注册完成，会返回
#    Fabric
#        State                             : Completed
#        Status                            : Success

Nvidia Container Toolkit

NVIDIA Container Toolkit 是一个用于构建和运行 GPU 加速容器的工具包，它包含容器运行时库和自动配置容器以利用 NVIDIA GPU 的实用工具。

安装

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

配置

# for docker container
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# for k8s
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

两个 Docker Container

需要分别部署两个 docker container —— One for TensorRT-LLM（TRT），another for Nvidia Triton Inference Server（NTIS）。

Steps：

在 Docker 模式上，先不考虑 K8S 管理，用 dokcer command 来与容器进行交互。第一个遇到的问题是官方在 docker 模式选择上是建议使用 rootless mode 的，但是在使用最新版 docker 配置 rootless mode 时遇到了错误：用 dockerd-rootless-setuptool.sh install 启动 rootless mode 时，出现了权限问题，官方文档的 trobleshoot 虽有提到这点，但错误的原因不太一样，因此最后并没有解决，所以还是选择了 superuser mode。
在选择 docker image 之前，必须要先安装 Nvidia Containter Toolkit（NCT）。NCT 可以理解为是官方给 docker 容器做的一个 adapter，可以通过 NCG 命令行工具来快速调整 docker container 的配置（比如让容器能够使用上 Nvidia GPU 的 C++ Runtime API 以及显卡驱动关联适配等）。
在容器镜像的选择上，其中一个大坑是无论是官方（比如技术 blog 和官方 github 建议的就不一样）还是民间攻略都有不同的镜像选择的建议。因此，最好的方法是先去 Nvidia 官方源看一下现在官方提供了哪些镜像（Data Science, Machine Learning, AI, HPC Containers | NVIDIA NGC - 选择CUDA）、镜像下的组件、组件版本、驱动版本是什么，而组件中尤其需要关注 trt-llm 的版本，然后使用 Nvidia 官方源nvcr.io/nvidia/cuda（而非 docker 官方 docker/nvidia）进行下载。
在选择组件和组件版本的时候，会遇到一些基于 Nvidia 显卡架构的概念的选择问题，以下是一些对应关系：

架构代号	对应CUDA版本	架构产品名	对应GPU
SM95	Blackwell架构	CUDA 12+	B100
SM90/SM90a	Hopper架构	CUDA 12+	H100、H200
SM89	Ada架构	CUDA 11.8+	RTX 40系列
SM86	Ampere架构	CUDA 11+	RTX 30系列
SM80	Ampere架构	CUDA 11+	A100
SM80	Turing架构	CUDA 10+	A100
SM75	Turing架构	CUDA 10+	RTX 20系列

比如，SM90 意味着计算能力为“compute_90”。计算能力（Compute Capability）是 NVIDIA 用来描述其 GPU 架构特性的一个指标。每个 sm 版本号对应着特定的硬件特性和支持的 CUDA 功能。通常，新版本的 sm 号表示更先进的技术和更高的性能。例如，sm_90 的引入通常意味着对深度学习和高性能计算的更好支持。这些 sm 版本号在编译 CUDA 程序时非常重要，因为它们决定了程序可以在何种架构的 GPU 上运行，以及可以使用哪些特性和优化。

TensorRT-LLM Container 部署

环境设置

# 部署nvidia ubuntu TRT容器 - 官方文档用的image
docker run --ipc=host --runtime=nvidia --gpus all -v /data2:/data -p 8080:8080 --entrypoint /bin/bash -it nvcr.io/nvidia/cuda:12.6.0-devel-ubuntu22.04

## 备选image，带tenssorrt built，感觉更好
docker run --ipc=host --runtime=nvidia --gpus all -v /data2:/data -p 8080:8080 --entrypoint /bin/bash -it nvcr.io/nvidia/tensorrt:24.08-py3

# 进入容器后，先安装依赖包
# trtllm要使用python3.10版本
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

# 然后安装TensorRT-LLM
# 由于后续使用的triton容器内的trtllm版本是0.12.0，所以在这里需要使用同样版本的trtllm来进行模型编译
# 要加上--extra-index-url https://pypi.nvidia.com，官方pypi源没有trtllm
# 安装要好久，差不多半小时
pip3 install tensorrt_llm==0.12.0 -U --extra-index-url https://pypi.nvidia.com

# 检查是否安装成功
python3 -c "import tensorrt_llm" # 返回：[TensorRT-LLM] TensorRT-LLM version: 0.12.0

模型编译

# 安装hugging face 环境
pip install -U huggingface_hub


# 进入与本机磁盘mapping的data目录
cd data


# 下载模型
# 我事先下载好模型了，所以实际过程中这一步并没有做
# resume-download一定要加上，断链问题比较严重，下载要好久
huggingface-cli login --token **** #在登录hf后，token在右上角settings-token里面
huggingface-cli download --resume-download meta-llama/Llama-3.1-70B-Instruct --local-dir /data/meta-llama/Llama-3.1-70B-Instruct 


# 下载TensorRT-LLM源码
git clone -b v0.12.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git lfs install


# 在加载模型前，需要先将模型格式转为TensorRT-LLM的checkpoint格式
cd examples/llama/
pip install -r requirement.txt

# 升级镜像的transformers包，适配huggingface config
# 会出现warning提示transformers版本和tensorrt_llm版本不适配，可忽略
pip install --upgrade transformers 

# 大概处理时间5分钟左右                            
python3 convert_checkpoint.py --model_dir /data/meta-llama/Llama-3.1-70B-Instruct \
                            --output_dir ./trt_ckpts/llama3.1_checkpoint_8gpu_tp8 \
                            --dtype float16 \
                            --tp_size 8 \
                            --workers 8


# 编译模型，大概处理时间也是5分钟左右
trtllm-build --checkpoint_dir ./trt_ckpts/llama3.1_checkpoint_8gpu_tp8 \
             --output_dir ./trt_engines/llama3.1_70B_128K_fp16_8-gpu \
             --workers 8 \
             --gemm_plugin auto \
             --max_num_tokens 131072

### 备用 for 405B ###
python3 convert_checkpoint.py --model_dir /data/meta-llama/Llama-3.1-405B-Instruct \
                              --output_dir ./trt_ckpts/lama_3.1_405B_HF_model/tp8-pp1/ \
                              --dtype bfloat16 \
                              --use_fp8_rowwise \
                              --tp_size 8 \
                              --pp_size 1 \
                              --load_by_shard \
                              --workers 8 \
                              --remove_duplicated_kv_heads

模型推理

# 进行简单推理测试
# 一定要加上mpirun -n 8来适配 我们的机器node（8卡）
mpirun -n 8 --allow-run-as-root \
    python3 ../run.py \
        --engine_dir ./trt_engines/llama3.1_70B_128K_fp16_8-gpu  \
        --max_output_len 128 \
        --tokenizer_dir /data/meta-llama/Llama-3.1-70B-Instruct \
        --input_text "tell a story"



# 用官方语料进行简单推理测试（汇总）
mpirun -n 8 --allow-run-as-root \
    python3 ../summarize.py \
        --test_trt_llm \
        --hf_model_dir /data/meta-llama/Llama-3.1-70B-Instruct \
        --data_type fp16 \
        --engine_dir ./trt_engines/llama3.1_70b_fp16_8-gpu

其他 docker container 的处理

# 理论上，官方认为使用docker容器只是为了构造出一个用于编译模型引擎的环境
# 在编译完成后并不会在容器内做serving，可以直接注销容器

# 因此，在官方教程或者民间教程里面，在构造container时一般是两个方式
# 1. 基于官方镜像，使用 --rm 来构造一个临时的container，比如：
docker run --rm --ipc=host --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.4.1-devel-ubuntu22.04
# 2. 使用trt-llm命令行工具来创建一个临时镜像来构造一个临时的container，比如：
make -C docker release_build
make -C docker release_build CUDA_ARCHS="89-real;90-real" # Restrict the compilation to Ada and Hopper architectures.
make -C docker release_run

# 但在实践中是构造了一个lasting的容器，并没有进行注销操作
# 原因是：1）保留编译模型过程中的log；2）需要反复调试编译参数，编译环境需要复用
# 因此在构造容器时使用了非临时的构造方法
# 弊端是编译环境无法随着官方组件版本的更新而更新，可能会导致新特性无法支持甚至编译错误
docker run --runtime=nvidia --gpus all -v /data2:/data -p 8080:8080 --entrypoint /bin/bash -itd nvcr.io/nvidia/cuda:12.6.0-devel-ubuntu22.04

轻量级部署

cd /data/TensorRT-LLM/examples/apps
pip install -r requirements.txt

nohup python3 openai_server.py /data/TensorRT-LLM/examples/llama/trt_engines/llama3.1_70B_128K_fp16_8-gpu/ --tokenizer /data/meta-llama/Llama-3.1-70B-Instruct/ > ./openai_server.logs 2>&1 &

Triton Container 部署

环境设置

# 部署nvidia ubuntu NTIS容器
docker run -it --gpus all --network host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -v /data2:/data nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 bash

# 确认容器的trt_llm版本
python3 -c "import tensorrt_llm"

重新使用 tensorrt_llm 编译模型引擎（可选）

可以直接使用之前 trt-llm 编译好的引擎的，但如果 trt-llm 容器中的 trt-llm 版本和 ntis 容器中的 trt-llm-backend 使用的 trt-llm 版本不一致会报错，所以在这里我们重新再编译一次模型。

# 下载tensorrtllm_backend
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git

# Update the submodule TensorRT-LLM repository
git submodule update --init --recursive
git lfs install
git lfs pull

# TensorRT-LLM is required for generating engines. You can skip this step if# you already have the package installed. If you are generating engines within# the Triton container, you have to install the TRT-LLM package.
(cd tensorrt_llm &&
    bash docker/common/install_cmake.sh &&export PATH=/usr/local/cmake/bin:$PATH &&
    python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
    pip3 install ./build/tensorrt_llm*.whl)

# 再次确认容器的trt_llm版本是否更新
python3 -c "import tensorrt_llm"  


# 在加载模型前，需要先将模型格式转为TensorRT-LLM的checkpoint格式
cd tensorrt_llm/examples/llama/
pip install -r requirement.txt
pip install --upgrade transformers # 升级镜像的transformers包，适配huggingface config

# 大概处理时间5分钟左右                            
python3 convert_checkpoint.py --model_dir /data/meta-llama/Llama-3.1-70B-Instruct \
                            --output_dir /data/TensorRT-LLM/examples/llama/trt_ckpts/llama3.1_ckpts_ntis_8gpu_tp8 \
                            --dtype float16 \
                            --tp_size 8 \
                            --workers 8


# 编译模型，大概处理时间也是5分钟左右
trtllm-build --checkpoint_dir /data/TensorRT-LLM/examples/llama/trt_ckpts/llama3.1_ckpts_ntis_8gpu_tp8 \
             --output_dir /data/TensorRT-LLM/examples/llama/trt_engines/llama3.1_70B_fp16_8-gpu_ntis \
             --workers 8 \
             --remove_input_padding \
             --gemm_plugin auto \
             --context_fmha enable \
             --paged_kv_cache enable \
             --use_paged_context_fmha enable \
             --max_num_tokens 131072

创建模型 repo for Triton

cd tensorrtllm_backend
# 创建一个model repo的独立文件夹，方便管理和用于后续给triton使用
mkdir triton_model_repo

# 复制示例模型到repo的独立文件夹，示例模型可以方便对照修改配置
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# 可以发现现在triton_model_repo下面分别有5个文件夹，官方解释如下
## preprocessing: This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
## tensorrt_llm: This model is a wrapper of your TensorRT-LLM model and is used for inferencing. Input specification can be found here
## postprocessing: This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).
## ensemble: This model can be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
## tensorrt_llm_bls: This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together.
### "BLS model" 指的是 "Batch Load Sharing model"，这是一种模型设计方式，用于在执行机器学习任务时优化资源使用和提高效率。在这种模型中，多个请求可以并行处理，共享模型加载的开销，从而提高整体的执行速度和资源利用率。

# 将之前编译好的模型复制过来，model的version 定义为 1
# 如果有多个model version，则model文件夹下分别命名为 2、3、4的文件夹，如此类推
cp /data/TensorRT-LLM/examples/llama/trt_engines/llama3.1_70B_fp16_8-gpu_ntis/* triton_model_repo/tensorrt_llm/1

配置服务

# 配置config.pbtxt，指定例如模型的位置、使用的tokenizer、启用in-flight batching等
## 全部配置可看：https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md

## 指定postprocessing的模型位置和tokenizer
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt \
tokenizer_dir:/data/meta-llama/Llama-3.1-70B-Instruct,\
tokenizer_type:auto,\
triton_max_batch_size:64,\
postprocessing_instance_count:1

## 指定preprocessing的模型位置和tokenizer    
python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt \
tokenizer_dir:/data/meta-llama/Llama-3.1-70B-Instruct,\
tokenizer_type:auto,\
triton_max_batch_size:64,\
preprocessing_instance_count:1

## 配置tensorrt_llm_bls, decoupled_mode为true 来允许客户端流式输出
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt \
triton_max_batch_size:64,\
decoupled_mode:true,\
bls_instance_count:1

## 配置ensemble
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt \
triton_max_batch_size:64

## 配置trt模型，启动inflight_fused_batching推理策略
# decoupled_mode改用false，http endpoint不支持解耦模式
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt \
triton_backend:tensorrtllm,\
triton_max_batch_size:64,\
decoupled_mode:true,\
engine_dir:triton_model_repo/tensorrt_llm/1,\
max_queue_delay_microseconds:10000,\
enable_kv_cache_reuse:true,\
batching_strategy:inflight_fused_batching

启动服务

# 启动serving，这里world_size表示使用多少个GPU用作serving
python3 scripts/launch_triton_server.py \
--world_size 8 \
--model_repo=/opt/tritonserver/tensorrtllm_backend/triton_model_repo/ \
--log \
--log-file=./triton_server.logs
--max_input_length 131072

关于 In-flight Batching

在部署过程中，大家可能有注意到：在指定 tensorrt_llm 的 backend 时，我们指定了一个 batching_strategy:inflight_fused_batching 的配置项。这里 In-flight Batching 也是模型推理场景里的一个优化项，一般也称为 Continuous Batching。

首先，Batching（批处理）的基本原理是：将多个独立的推理请求组合为一个批次，然后一次性提交给模型进行推理，从而提高计算资源的利用率。在 GPU 上的推理请求可以以 4 种方式进行 Batching：

No Batching：每次处理 1 个请求
Static Batching：请求放入 batch 队列，在队列满时一次性执行这个批次
Dynamic Batching：收到请求后放入 batch 队列，在队列满时或达到一定时间阈值，一次性执行这个批次
In-flight Batching/Continuous Batching：请求按照 token 逐个处理，如果有请求已经完成并释放了其占据的 GPU 资源，则新的请求可以直接进行处理，不需要等待整个批次完成

为什么需要 In-flight Batching？我们可以想象一个例子，在 Static Batching 的场景下，会有多个请求提交给模型去处理。假设一个批次有 2 个请求，其中 request_a 的推理需要生成 3 个 token（假设耗时 3s），request_b 的推理要生成 100 个 token（假设耗时 100s），所以当前这个批次需要耗时 100s 才能结束。同时又来了 request_c 的推理请求，此时 request_c 只能等待。这样会造成2个问题：

request_a 只耗时 3s 即结束，但是由于它与 request_b 是同一个批次，所以无法在完成后立即返回结果给客户端，而是需要等待 request_b 也完成之后才能一起返回结果
新的请求 request_c 也必须等待前一个批次完成后（例如当前这个批次耗时 100s），才能被模型进行推理

在这种方式下，可以看到，会有 GPU 资源空闲得不到完全利用，并且会导致整体推理延迟上升。而 In-flight Batching 的方式是：

它可以动态修改构成当前批次的请求，即使是这个批次正在运行当中。

还是以上面的场景举例，假设当前过去了 3s，request_a 已经完成， request_b 仍需 97s 完成。这时候 request_a 由于已经完成，所以可以直接返回结果给客户端并结束。而由于 request_a 释放了其占据的资源，request_c 的推理请求可以立即被处理。这种方式就提升了整个系统的 GPU 使用率，并降低了整体推理延迟。

关于 Decoupled Mode 解耦模型

对于 LLM 模型来说，支持 stream 的话需要了解这个模式。

Triton 可以支持发送多个响应或零个响应的后端和模型请求。解耦的模型/后端也可以相对于执行请求批次的顺序无序地发送响应，这使得后端能够在适当时候传递响应，具有大量响应的请求不会阻塞来自其他请求的响应被传送。

对于解耦模型，Triton 的 HTTP 端点不能用于运行推理，因为它仅支持每个请求一个响应。即使是标准的 ModelInfer RPC 在 GRPC 端点中也不支持解耦响应。为了在解耦模型上运行推理，客户端必须使用双向流式 RPC。

系列文章：

一、大模型推理框架选型调研
 二、TensorRT-LLM & Triton Server 部署过程记录
 三、vLLM 大模型推理引擎调研文档
 四、vLLM 推理引擎性能分析基准测试
 五、vLLM 部署大模型问题记录
 六、Triton Inference Server 架构原理

Triton中文社区

欢迎来到由智源人工智能研究院发起的Triton中文社区，这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂，为推动人工智能技术的普及与深化应用贡献力量。

更多推荐

Windows下安装triton

1、triton官方只支持Linux 2、Windows下安装triton，只能通过whl安装，且必须使用严格对应python版本 3、建议使用triton-windows的版本 https://github.com/woct0rdho/triton-windows 4、下载地址 https://github.com/woct0rdho/triton-windows/releases ------