DeepSeek V3源码剖析：高效的FP8量化与矩阵乘法实现，基于Triton的优化

通过对输入数据进行FP8量化，并利用Triton提供的高效矩阵计算和量化/去量化操作，这段代码可广泛应用于各种深度学习任务，尤其是在推理加速、量化感知训练和大规模矩阵运算优化方面。对于需要在内存受限的设备上部署深度学习模型的场景，使用FP8量化提供了显著的性能提升和存储节省。

YT2022

501人浏览 · 2025-03-30 13:26:46

YT2022 · 2025-03-30 13:26:46 发布

from typing import Tuple

import torch
import triton
import triton.language as tl
from triton import Config


@triton.jit
def act_quant_kernel(x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr):
    """
    Quantizes the input tensor `x_ptr` and stores the result in `y_ptr` and the scaling factor in `s_ptr`.

    Args:
        x_ptr (triton.Pointer): Pointer to the input tensor.
        y_ptr (triton.Pointer): Pointer to the output tensor where quantized values will be stored.
        s_ptr (triton.Pointer): Pointer to the output tensor where scaling factors will be stored.
        BLOCK_SIZE (tl.constexpr): The size of the block to be processed by each program instance.

    Returns:
        None
    """
    pid = tl.program_id(axis=0)
    offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    x = tl.load(x_ptr + offs).to(tl.float32)
    s = tl.max(tl.abs(x)) / 448.
    y = x / s
    y = y.to(y_ptr.dtype.element_ty)
    tl.store(y_ptr + offs, y)
    tl.store(s_ptr + pid, s)


def act_quant(x: torch.Tensor, block_size: int = 128) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Quantizes the input tensor `x` using block-wise quantization.

    Args:
        x (torch.Tensor): The input tensor to be quantized. Must be contiguous and its last dimension size must be divisible by `block_size`.
        block_size (int, optional): The size of the blocks to be used for quantization. Default is 128.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
            - The quantized tensor with dtype `torch.float8_e4m3fn`.
            - A tensor of scaling factors with dtype `torch.float32`.
    """
    assert x.is_contiguous(), 'Input tensor must be contiguous'
    assert x.size(-1) % block_size == 0, f'Last dimension size must be divisible by block_size (block_size={block_size})'
    y = torch.empty_like(x, dtype=torch.float8_e4m3fn)
    s = x.new_empty(*x.size()[:-1], x.size(-1) // block_size, dtype=torch.float32)
    grid = lambda meta: (triton.cdiv(x.numel(), meta['BLOCK_SIZE']), )
    act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)
    return y, s


@triton.jit
def weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE: tl.constexpr):
    """
    Dequantizes weights using the provided scaling factors and stores the result.

    Args:
        x_ptr (tl.pointer): Pointer to the quantized weights.
        s_ptr (tl.pointer): Pointer to the scaling factors.
        y_ptr (tl.pointer): Pointer to the output buffer for dequantized weights.
        M (int): Number of rows in the weight matrix.
        N (int): Number of columns in the weight matrix.
        BLOCK_SIZE (tl.constexpr): Size of the block for tiling.

    Returns:
        None
    """
    pid_m = tl.program_id(axis=0)
    pid_n = tl.program_id(axis=1)
    n = tl.cdiv(N, BLOCK_SIZE)
    offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    offs = offs_m[:, None] * N + offs_n[None, :]
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    x = tl.load(x_ptr + offs, mask=mask).to(tl.float32)
    s = tl.load(s_ptr + pid_m * n + pid_n)
    y = x * s
    tl.store(y_ptr + offs, y, mask=mask)


def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    """
    Dequantizes the given weight tensor using the provided scale tensor.

    Args:
        x (torch.Tensor): The quantized weight tensor of shape (M, N).
        s (torch.Tensor): The scale tensor of shape (M, N).
        block_size (int, optional): The block size to use for dequantization. Defaults to 128.

    Returns:
        torch.Tensor: The dequantized weight tensor of the same shape as `x`.

    Raises:
        AssertionError: If `x` or `s` are not contiguous or if their dimensions are not 2.
    """
    assert x.is_contiguous() and s.is_contiguous(), 'Input tensors must be contiguous'
    assert x.dim() == 2 and s.dim() == 2, 'Input tensors must have 2 dimensions'
    M, N = x.size()
    y = torch.empty_like(x, dtype=torch.get_default_dtype())
    grid = lambda meta: (triton.cdiv(M, meta['BLOCK_SIZE']), triton.cdiv(N, meta['BLOCK_SIZE']))
    weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
    return y


fp8_gemm_configs = [
    Config({'BLOCK_SIZE_M': block_m, 'BLOCK_SIZE_N': block_n, 'BLOCK_SIZE_K': 128}, num_stages=num_stages, num_warps=8)
    for block_m in [16, 32, 64] for block_n in [32, 64, 128] for num_stages in [3, 4, 5, 6]
]

@triton.autotune(configs=fp8_gemm_configs, key=['N', 'K'])
@triton.jit
def fp8_gemm_kernel(a_ptr, b_ptr, c_ptr,
                    a_s_ptr, b_s_ptr,
                    M, N: tl.constexpr, K: tl.constexpr,
                    BLOCK_SIZE_M: tl.constexpr,
                    BLOCK_SIZE_N: tl.constexpr,
                    BLOCK_SIZE_K: tl.constexpr):
    """
    Performs a matrix multiplication operation on FP8 matrices with scaling factors.

    Args:
        a_ptr (tl.tensor): Pointer to the first input matrix A.
        b_ptr (tl.tensor): Pointer to the second input matrix B.
        c_ptr (tl.tensor): Pointer to the output matrix C.
        a_s_ptr (tl.tensor): Pointer to the scaling factors for matrix A.
        b_s_ptr (tl.tensor): Pointer to the scaling factors for matrix B.
        M (int): Number of rows in matrix A and C.
        N (tl.constexpr): Number of columns in matrix B and C.
        K (tl.constexpr): Number of columns in matrix A and rows in matrix B.
        BLOCK_SIZE_M (tl.constexpr): Block size for the M dimension.
        BLOCK_SIZE_N (tl.constexpr): Block size for the N dimension.
        BLOCK_SIZE_K (tl.constexpr): Block size for the K dimension.

    Returns:
        None
    """
    pid_m = tl.program_id(axis=0)
    pid_n = tl.program_id(axis=1)
    k = tl.cdiv(K, BLOCK_SIZE_K)
    offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
    offs_k = tl.arange(0, BLOCK_SIZE_K)
    a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :]
    b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None]
    a_s_ptrs = a_s_ptr + offs_m * k
    b_s_ptrs = b_s_ptr + (offs_n // BLOCK_SIZE_K) * k

    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    for i in range(k):
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, other=0.0)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - i * BLOCK_SIZE_K, other=0.0)
        a_s = tl.load(a_s_ptrs)
        b_s = tl.load(b_s_ptrs)
        accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]
        a_ptrs += BLOCK_SIZE_K
        b_ptrs += BLOCK_SIZE_K
        a_s_ptrs += 1
        b_s_ptrs += 1
    c = accumulator.to(c_ptr.dtype.element_ty)
    offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :]
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    tl.store(c_ptrs, c, mask=mask)


def fp8_gemm(a: torch.Tensor, a_s: torch.Tensor, b: torch.Tensor, b_s: torch.Tensor):
    """
    Perform a matrix multiplication using FP8 precision.

    Args:
        a (torch.Tensor): The first input matrix, must be contiguous.
        a_s (torch.Tensor): The scaling factor for the first input matrix, must be contiguous.
        b (torch.Tensor): The second input matrix, must be contiguous.
        b_s (torch.Tensor): The scaling factor for the second input matrix, must be contiguous.

    Returns:
        torch.Tensor: The result of the matrix multiplication.
    """
    assert a.is_contiguous() and b.is_contiguous(), 'Input tensors must be contiguous'
    assert a_s.is_contiguous() and b_s.is_contiguous(), 'Scaling factor tensors must be contiguous'
    K = a.size(-1)
    M = a.numel() // K
    N = b.size(0)
    c = a.new_empty(*a.size()[:-1], N, dtype=torch.get_default_dtype())
    grid = lambda META: (triton.cdiv(M, META['BLOCK_SIZE_M']), triton.cdiv(N, META['BLOCK_SIZE_N']))
    fp8_gemm_kernel[grid](a, b, c, a_s, b_s, M, N, K)
    return c

这段代码是利用 Triton 和 PyTorch 构建的高效量化和矩阵乘法操作，用于量化张量和进行 FP8 精度的矩阵乘法。下面是对代码的模块化解析：

1. `act_quant_kernel` - 量化核函数

@triton.jit
def act_quant_kernel(x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr):
    # 量化操作核函数
    pid = tl.program_id(axis=0)  # 获取程序ID
    offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)  # 计算块偏移
    x = tl.load(x_ptr + offs).to(tl.float32)  # 载入输入张量
    s = tl.max(tl.abs(x)) / 448.  # 计算缩放因子
    y = x / s  # 使用缩放因子进行量化
    y = y.to(y_ptr.dtype.element_ty)  # 转换回目标类型
    tl.store(y_ptr + offs, y)  # 存储量化后的结果
    tl.store(s_ptr + pid, s)  # 存储缩放因子

作用: 该函数用于将输入张量 x 量化，并将结果存储到输出张量 y 中，同时计算并存储量化的缩放因子 s。
输入:
- x_ptr: 输入张量的指针。
- y_ptr: 输出量化后张量的指针。
- s_ptr: 缩放因子存储指针。
- BLOCK_SIZE: 每个程序实例处理的数据块大小。

2. `act_quant` - 量化操作

def act_quant(x: torch.Tensor, block_size: int = 128) -> Tuple[torch.Tensor, torch.Tensor]:
    # 输入张量量化
    assert x.is_contiguous(), 'Input tensor must be contiguous'
    assert x.size(-1) % block_size == 0, f'Last dimension size must be divisible by block_size (block_size={block_size})'
    y = torch.empty_like(x, dtype=torch.float8_e4m3fn)  # 创建量化后的输出张量
    s = x.new_empty(*x.size()[:-1], x.size(-1) // block_size, dtype=torch.float32)  # 创建缩放因子张量
    grid = lambda meta: (triton.cdiv(x.numel(), meta['BLOCK_SIZE']), )
    act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)  # 调用核函数
    return y, s

作用: 对输入张量 x 进行量化，并返回量化后的张量 y 和缩放因子 s。
输入:
- x: 需要量化的张量。
- block_size: 量化操作中使用的块大小。
输出: 量化后的张量和缩放因子。

3. `weight_dequant_kernel` - 权重反量化核函数

@triton.jit
def weight_dequant_kernel(x_ptr, s_ptr, y_ptr, M, N, BLOCK_SIZE: tl.constexpr):
    # 权重反量化操作
    pid_m = tl.program_id(axis=0)
    pid_n = tl.program_id(axis=1)
    n = tl.cdiv(N, BLOCK_SIZE)
    offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    offs = offs_m[:, None] * N + offs_n[None, :]
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    x = tl.load(x_ptr + offs, mask=mask).to(tl.float32)
    s = tl.load(s_ptr + pid_m * n + pid_n)
    y = x * s
    tl.store(y_ptr + offs, y, mask=mask)

作用: 该函数用于根据缩放因子反量化权重矩阵。
输入:
- x_ptr: 量化权重的指针。
- s_ptr: 缩放因子的指针。
- y_ptr: 存储反量化结果的指针。
- M, N: 权重矩阵的维度。
- BLOCK_SIZE: 每个程序实例处理的数据块大小。

4. `weight_dequant` - 权重反量化操作

def weight_dequant(x: torch.Tensor, s: torch.Tensor, block_size: int = 128) -> torch.Tensor:
    # 对权重矩阵进行反量化
    assert x.is_contiguous() and s.is_contiguous(), 'Input tensors must be contiguous'
    assert x.dim() == 2 and s.dim() == 2, 'Input tensors must have 2 dimensions'
    M, N = x.size()
    y = torch.empty_like(x, dtype=torch.get_default_dtype())
    grid = lambda meta: (triton.cdiv(M, meta['BLOCK_SIZE_M']), triton.cdiv(N, meta['BLOCK_SIZE_N']))
    weight_dequant_kernel[grid](x, s, y, M, N, BLOCK_SIZE=block_size)
    return y

作用: 对量化的权重张量 x 和缩放因子 s 进行反量化，恢复出原始权重矩阵。
输入:
- x: 量化的权重矩阵。
- s: 缩放因子矩阵。
- block_size: 反量化时使用的块大小。
输出: 反量化后的权重矩阵。

5. `fp8_gemm_kernel` - FP8 精度的矩阵乘法核函数

@triton.autotune(configs=fp8_gemm_configs, key=['N', 'K'])
@triton.jit
def fp8_gemm_kernel(a_ptr, b_ptr, c_ptr,
                    a_s_ptr, b_s_ptr,
                    M, N: tl.constexpr, K: tl.constexpr,
                    BLOCK_SIZE_M: tl.constexpr,
                    BLOCK_SIZE_N: tl.constexpr,
                    BLOCK_SIZE_K: tl.constexpr):
    # 执行FP8精度的矩阵乘法
    pid_m = tl.program_id(axis=0)
    pid_n = tl.program_id(axis=1)
    k = tl.cdiv(K, BLOCK_SIZE_K)
    offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
    offs_k = tl.arange(0, BLOCK_SIZE_K)
    a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :]
    b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None]
    a_s_ptrs = a_s_ptr + offs_m * k
    b_s_ptrs = b_s_ptr + (offs_n // BLOCK_SIZE_K) * k

    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    for i in range(k):
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, other=0.0)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - i * BLOCK_SIZE_K, other=0.0)
        a_s = tl.load(a_s_ptrs)
        b_s = tl.load(b_s_ptrs)
        accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :]
        a_ptrs += BLOCK_SIZE_K
        b_ptrs += BLOCK_SIZE_K
        a_s_ptrs += 1
        b_s_ptrs += 1
    c = accumulator.to(c_ptr.dtype.element_ty)
    offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :]
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    tl.store(c_ptrs, c, mask=mask)

作用: 执行基于 FP8 精度的矩阵乘法，支持量化和缩放因子的操作。
输入:
- a_ptr, b_ptr, c_ptr: 输入矩阵和输出矩阵的指针。
- a_s_ptr, b_s_ptr: 用于矩阵 A 和 B 的缩放因子指针。
- M, N, K: 矩阵的维度。
- BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K: 各个维度的块大小。

应用场景

这段代码展示了如何使用Triton进行FP8（16位浮点数）量化和矩阵乘法（GEMM）的高效实现。FP8量化可以显著减小模型参数的存储需求，同时保持模型精度，在深度学习模型训练和推理中非常有用，尤其是当模型参数非常庞大时。以下是代码应用的几种典型场景：

1. 深度学习模型的推理加速

背景：现代深度学习模型（如GPT、BERT等）通常具有成千上万的参数，推理过程中的计算量非常大。为了提高推理速度和减少内存带宽的压力，常常需要进行量化。
应用：这段代码中的act_quant和weight_dequant函数通过量化和去量化操作，将模型参数压缩成FP8精度，从而减少内存占用并加速推理。特别适用于计算资源有限的环境，如边缘设备、嵌入式设备或GPU推理。
实例：例如，推理模型在GPU上执行时，通过使用FP8精度代替32位浮点数，可以提高计算速度并减少显存使用，这对于大型模型（如GPT-3）尤为重要。

2. 训练过程中的量化感知训练

背景：量化感知训练（QAT, Quantization Aware Training）是一种常见的训练优化方法，通过模拟量化误差来训练量化后的模型，以便模型能够适应量化带来的精度损失。
应用：在训练阶段，使用act_quant对激活值进行量化，weight_dequant则用于去量化权重矩阵。这种方式可以使模型在训练时就适应量化带来的变化，从而提升量化后模型的精度。
实例：例如，在使用Triton进行训练的过程中，量化会在模型的每个训练步骤中进行，从而确保最终生成的模型能够在FP8精度下仍然保持较高的性能。

3. 大规模矩阵运算优化

背景：许多深度学习模型的核心计算都涉及大规模矩阵乘法，尤其是在神经网络的前向传播和反向传播中，矩阵乘法通常是最消耗计算资源的部分。
应用：通过fp8_gemm函数，该代码可以高效地在FP8精度下执行矩阵乘法计算，利用Triton优化的GPU并行计算能力来加速矩阵乘法。这使得即便是计算密集型的任务（例如大规模的Transformer模型计算）也能在较低的计算成本下完成。
实例：当在大型NLP模型（如BERT或GPT）上执行矩阵乘法时，FP8精度可以加速矩阵运算，并减少对GPU内存的需求。

4. 量化模型部署

背景：在推理或部署过程中，量化是常用的模型优化方法。量化不仅能够节省存储空间，还能提升执行效率，尤其在部署到内存和计算资源有限的设备上时。
应用：该代码中的量化和去量化功能可以用于将训练好的模型转换为量化版本，然后将其部署到生产环境中。量化后的模型不仅具有较低的内存占用，还能够保持接近原始模型的精度。
实例：例如，在某些嵌入式系统中，由于存储和计算资源有限，通常需要将模型进行量化后才能进行推理。在这种场景下，使用act_quant进行激活量化，weight_dequant进行权重去量化，可以显著提升推理效率。

5. 大规模深度学习框架优化

背景：在大规模的分布式训练和推理框架中，数据并行计算需要在多个节点上同时处理大量的矩阵计算。Triton作为一个高性能的GPU计算库，能够在这种环境下充分发挥其并行计算能力。
应用：利用Triton的GPU并行能力，结合FP8量化优化技术，可以显著提高大规模训练和推理的效率，尤其是在分布式计算环境中。通过使用fp8_gemm_kernel等高效的矩阵运算核，能够在多个GPU上并行加速矩阵乘法。
实例：例如，训练一个大规模的多语言模型时，fp8_gemm可以帮助减少每个训练步骤的计算时间，同时保持高效的并行计算，提升整个训练过程的速度。

总结

Triton中文社区

欢迎来到由智源人工智能研究院发起的Triton中文社区，这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂，为推动人工智能技术的普及与深化应用贡献力量。

更多推荐

早鸟票倒计时｜TVM/Triton/TileLang同台炫技，Meet AI Compiler邀你一同解锁AI编译器的无限可能！

Triton中文社区

triton inference server的backend插件机制代码流程梳理、模型加载代码梳理

triton inference server代码流程梳理、模型加载代码梳理

Triton中文社区

obs-backgroundremoval项目CUDA依赖问题的分析与解决方案

obs-backgroundremoval项目CUDA依赖问题的分析与解决方案obs-backgroundremoval是OBS Studio的一个插件，主要用于在肖像图像和视频中替换背景以及增强低光场景。在将该项目打包为Fedora RPM时，发现了一个关于CUDA依赖的重要技术问题。问题根源分析项目默认构建会捆绑onnxruntime库，这些库在默认构建配置下强制包含了CUDA运行时路径...