CUDA_C编程权指南-chapter2_CUDA编程模型

CUDA编程模型提供了一个计算机架构抽象来作为应用程序和其可用的硬件之间的桥梁。GPU编程模型根据GPU架构的计算能力，提供了以下几个特有的功能：1、一种通过层次结构在GPU中组织线程的方式，2、一种通过层次结构在GPU中访问内存的方式。

九是否非随机的称呼

667人浏览 · 2024-02-04 20:12:23

九是否非随机的称呼 · 2024-02-04 20:12:23 发布

CUDA C编程权指南-chapter2 CUDA编程模型

1. Introduction — CUDA C Programming Guide (nvidia.com)

CUDA Runtime API :: CUDA Toolkit Documentation (nvidia.com)

CUDA C编程权指南 professional CUDA C programmingproduct.dangdang.com/25089854.html

下面的内容来自书籍

《CUDA C编程权威指南》([美]程润伟（John Cheng）)【简介_书评_在线阅读】 - 当当图书 (dangdang.com)

CUDA是一种通用的并行计算平台和编程模型，在C语言基础上扩展的。使用CUDA可以像编写C语言一样实现并行的算法。这个chapter内向量相加的例子都可以在GitHub的repository内找到

vectorAdd CUDA sample.

cuda-samples/Samples/0_Introduction/vectorAdd/vectorAdd.cu at master · NVIDIA/cuda-samples (github.com)github.com/NVIDIA/cuda-samples/blob/master/Samples/0_Introduction/vectorAdd/vectorAdd.cu

2.1 CUDA编程模型概述

2.1.1 CUDA编程结构

“统一寻址”（Unified Memory）的编程模型的改进，主机内存和设备内存看作统一的同一个内存，不需要在host和device之间复制数据。核函数是在GPU上运行。内核被启动以后，管理权返回给主机，设备上运行核函数，CPU可以做其他的事情，CUDA编程模型是异步的，GPU执行并行运算，CPU执行串行运算，

2.1.2 内存管理

CUDA编程模型假设主机和设备拥有独立的内存，CUDA运行时API负责分配和释放内存，并在两者之间传输数据。对应的函数是这些，cudaMalloc和Malloc几乎是相同的，只是分配的内存地方不相同。cudaMemcpy函数负责主机和设备之间的数据传输，cudaMemcpy的第一个参数是目的地，第二个参数是source，第三个是方向的呢。cudaMemcpy是同步执行的，数据传输完成以前，主机的程序会被阻塞起来，直到传输完成。

在这里插入图片描述

可以使用运行时API函数cuda::cudaGetErrorString获取返回的错误信息，除去核函数之外，其他的CUDA调用都会返回错误的枚举类型cudaError_t。

cudaError_t err = cudaSuccess;
string message = cudaGetErrorString(err);

内存层次结构

CUDA编程模型给出了内存层次结构，GPU设备有用于不同用途的存储类型。最主要的两类内存是全局内存和共享内存，全局内存对所有线程可见，共享内存存在每个线程块内，线程块内的每个线程使用同一块共享内存。

下面的program来自nvidia的example。

https://github.com/ZouJiu1/cuda-samples/blob/master/Samples/0_Introduction/vectorAdd ，数据传输是在主机内存和GPU全局内存之间进行的。

编译方式就是：nvcc -Xcompiler -std=c99

vectorAdd.cu -o sum，运行是：./sum

编译器的选项可以从这个网页查询的：

1. Introduction — cuda-compiler-driver-nvcc 12.2 documentation (nvidia.com)

主机调用核函数以后，控制权就交给了CPU，CPU可以执行其他函数，所以核函数是异步的。cudaMemcpy函数将结果从设备复制到主机，这个函数是同步的，会阻塞起来直到数据传输完成。

/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/**
 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void vectorAdd(const float *A, const float *B, float *C,
                          int numElements) {
  // blockDim.x 一个线程块block内的线程数量
  // blockIdx.x 网格grid内该线程块block的标号 index
  // threadIdx.x 线程块block内该线程的标号 index
  int i = blockDim.x * blockIdx.x + threadIdx.x;

  if (i < numElements) {  // 防止线程越界的
    C[i] = A[i] + B[i] + 0.0f;   // 累加
  }
}

/**
 * Host main routine
 */
int main(void) {
  // Error code to check return values for CUDA calls
  cudaError_t err = cudaSuccess;

  // Print the vector length to be used, and compute its size
  int numElements = 50000;
  size_t size = numElements * sizeof(float);
  printf("[Vector addition of %d elements]\n", numElements);

  // Allocate the host input vector A
  float *h_A = (float *)malloc(size);  //分配主机的内存

  // Allocate the host input vector B
  float *h_B = (float *)malloc(size);

  // Allocate the host output vector C
  float *h_C = (float *)malloc(size);

  // Verify that allocations succeeded
  if (h_A == NULL || h_B == NULL || h_C == NULL) {   // 分配的结果
    fprintf(stderr, "Failed to allocate host vectors!\n");
    exit(EXIT_FAILURE);
  }

  // Initialize the host input vectors
  for (int i = 0; i < numElements; ++i) {    // 给定随机数值的
    h_A[i] = rand() / (float)RAND_MAX;
    h_B[i] = rand() / (float)RAND_MAX;
  }

  // Allocate the device input vector A
  float *d_A = NULL; 
  err = cudaMalloc((void **)&d_A, size);   // 分配设备内存

  if (err != cudaSuccess) {    // 检查结果的
    fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Allocate the device input vector B
  float *d_B = NULL;      
  err = cudaMalloc((void **)&d_B, size);    // 分配设备内存

  if (err != cudaSuccess) {     // 检查结果的
    fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Allocate the device output vector C
  float *d_C = NULL;
  err = cudaMalloc((void **)&d_C, size);    // 分配设备内存

  if (err != cudaSuccess) {     // 检查结果的
    fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Copy the host input vectors A and B in host memory to the device input
  // vectors in
  // device memory
  printf("Copy input data from the host memory to the CUDA device\n");
  err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); // 同步函数的，从主机端复制到设备端

  if (err != cudaSuccess) {     // 检查结果的
    fprintf(stderr,
            "Failed to copy vector A from host to device (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

  if (err != cudaSuccess) {
    fprintf(stderr,
            "Failed to copy vector B from host to device (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Launch the Vector Add CUDA Kernel
  int threadsPerBlock = 256; // 每个block的线程数量
  int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock; // 每个grid的block数量
  printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
         threadsPerBlock);
  // 三重括号内是执行配置的
  vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
  err = cudaGetLastError();

  if (err != cudaSuccess) {
    fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Copy the device result vector in device memory to the host result vector
  // in host memory.
  printf("Copy output data from the CUDA device to the host memory\n");
  err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

  if (err != cudaSuccess) {
    fprintf(stderr,
            "Failed to copy vector C from device to host (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Verify that the result vector is correct
  for (int i = 0; i < numElements; ++i) {
    if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
      fprintf(stderr, "Result verification failed at element %d!\n", i);
      exit(EXIT_FAILURE);
    }
  }

  printf("Test PASSED\n");

  // Free device global memory
  err = cudaFree(d_A);   // 释放分配的显存

  if (err != cudaSuccess) {
    fprintf(stderr, "Failed to free device vector A (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  err = cudaFree(d_B);   // 释放分配的显存

  if (err != cudaSuccess) {
    fprintf(stderr, "Failed to free device vector B (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  err = cudaFree(d_C);   // 释放分配的显存

  if (err != cudaSuccess) {
    fprintf(stderr, "Failed to free device vector C (error code %s)!\n",
            cudaGetErrorString(err));
    exit(EXIT_FAILURE);
  }

  // Free host memory
  free(h_A);            // 释放主机分配的内存
  free(h_B);
  free(h_C);

  printf("Done\n");
  return 0;
}

2.1.3 线程管理

在host启动device以后，device会产生很多线程来运行kernel函数，CUDA有着明确的抽象层次概念，也就是两部分，线程块block和网格grid组成的。grid包含了很多的block，每个block包含了很多thread，grid内所有threads使用全局内存，每一个线程块内部都有一块共享内存，共享内存由线程块内的线程访问，其他线程块不能访问，不过GPU9.0以后，多了线程块簇，共享内存组成了分布式共享内存，此时不同线程块可以访问其他的共享内存。

同一个线程块内的线程协作，可以通过下面的方式来实现：同步、共享内存、不同块内的线程不能协作。

线程依靠以下两个坐标变量来区分彼此：·blockIdx，线程块block在网格grid内的索引项，threadIdx，线程在线程块block内的索引项。

blockIdx和threadIdx是核函数的内置变量，核函数运行时会给每个线程分配这两个坐标变量，根据这两个变量，就可以将数据分配给每个线程。blockIdx和threadIdx是基于uint3数据类型的，是CUDA内置的向量类型，也就是包含了3个无符号整数的struct结构，可通过x, y, z来指定。blockIdx.x, blockIdx.y, blockIdx.z或者threadIdx.x, threadIdx.y和threadIdx.z。

CUDA是可以产生3dim的网格和线程块的，下图就是个2dim的网格和线程块，网格和块的大小可以通过查询内置变量来获得，blockDim是一个线程块内部的线程数量，gridDim是一个网格内的线程块数量，分不同的方向。也是dim3类型的，基于uint3的数据类型，方向那就是x, y和z。

在这里插入图片描述

网格和线程块的dim

grid和block都是dim3类型的变量，x,y和z方向默认值都是1，手动初始化以后就是每个方向的数值，host端应该使用dim3来定义grid和block的大小，然后device端会自动初始化uint3类型的变量blockDim、gridDim、blockIdx和threadIdx。这几个变量只在核函数内可见，host不可见，而host端的dim3类型变量只在host端可见。

__global__ void printThreadIndex(int *A, const int nx, const int ny)
{
    // blockDim.x是一个block在x方向的线程数量，blockIdx.x是block在grid内的索引index
    // threadIdx.x是thread在block内的索引
    int ix = threadIdx.x + blockIdx.x * blockDim.x; //线程在x方向的索引
    int iy = threadIdx.y + blockIdx.y * blockDim.y; //线程在y方向的索引
    unsigned int idx = iy * nx + ix;  // 线程最终的索引，nx是x方向的个数

    printf("thread_id (%d,%d) block_id (%d,%d) coordinate (%d,%d) global index"
           " %2d ival %2d\n", threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y,
           ix, iy, idx, A[idx]);
}

下面的codes就是用来检查dimension的

#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>

/*
 * Display the dimensionality of a thread block and grid from the host and
 * device.
 */

__global__ void checkIndex(void)
{
    printf("threadIdx:(%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
    printf("blockIdx:(%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);

    printf("blockDim:(%d, %d, %d)\n", blockDim.x, blockDim.y, blockDim.z);
    printf("gridDim:(%d, %d, %d)\n", gridDim.x, gridDim.y, gridDim.z);

}

int main(int argc, char **argv)
{
    // define total data element
    int nElem = 6;

    // define grid and block structure
    dim3 block(3);
    dim3 grid((nElem + block.x - 1) / block.x);

    // check grid and block dimension from host side
    printf("grid.x %d grid.y %d grid.z %d\n", grid.x, grid.y, grid.z);
    printf("block.x %d block.y %d block.z %d\n", block.x, block.y, block.z);

    // check grid and block dimension from device side
    checkIndex<<<grid, block>>>();

    // reset device before you leave
    CHECK(cudaDeviceReset());

    return(0);
}

编译的方式就是：nvcc -arch=sm_20

checkDimension.cu -o check，默认情况下，nvcc会产生支持最低版本GPU架构的 codes。

从主机端和设备端访问网格/块变量

主机端的变量使用dim3来产生，设备端则是uint3类型的，方向则都是x, y和z。当数据大小固定时，首先确定线程块的大小，然后再确定网格的大小。块的大小需要考虑到GPU resource的限制。

下面的codes就定义了不同的块的网格

#include "../common/common.h"
#include <cuda_runtime.h>
#include <stdio.h>

/*
 * Demonstrate defining the dimensions of a block of threads and a grid of
 * blocks from the host.
 */

int main(int argc, char **argv)
{
    // define total data element
    int nElem = 1024;

    // define grid and block structure
    dim3 block (1024);
    dim3 grid  ((nElem + block.x - 1) / block.x);
    printf("grid.x %d block.x %d \n", grid.x, block.x);

    // reset block
    block.x = 512;
    grid.x  = (nElem + block.x - 1) / block.x;
    printf("grid.x %d block.x %d \n", grid.x, block.x);

    // reset block
    block.x = 256;
    grid.x  = (nElem + block.x - 1) / block.x;
    printf("grid.x %d block.x %d \n", grid.x, block.x);

    // reset block
    block.x = 128;
    grid.x  = (nElem + block.x - 1) / block.x;
    printf("grid.x %d block.x %d \n", grid.x, block.x);

    // reset device before you leave
    CHECK(cudaDeviceReset());

    return(0);
}

线程层次结构

网格和块的blockDim和gridDim形成了线程的层次结构，在x, y和z方向分别有着不同的尺寸。这样的线程组织方式可以在不同的设备上执行相同的codes。

2.1.four 启动CUDA核函数

<<<>>>运算符是核函数的执行配置，启动方式和常规的C函数类似的，就是多了个执行配置，然后执行配置的参数共four个，依次是gridDim、blockDim、shared_memory等。

blockIdx.x和threadIdx.x可以用来指定一个线程，线程和数据的某个数值就对应起来了。核函数是异步的，启动以后控制权就交还给主机了。cudaDeviceSynchronize函数可以等待上面的所有设备函数执行完成，也就是同步操作的。

某些CUDA API函数是隐式同步函数，主机端在执行cudaMemcpy函数以后需要等待这个函数完成，才能继续执行其他操作。

2.1.five 编写核函数

用__global__声明来定义一个核函数，核函数的返回类型必须是void，__device__和__host__限定符可以同时使用，__device__指定函数只能从设备端调用，__host__仅能从主机端调用。

CUDA核函数的限制

只能访问设备内存；返回值必须是void；参数数量固定；静态变量是不支持的；异步

下面的program显示了加和的区别

    for (int idx = 0; idx < N; idx++)
    {
        C[idx] = A[idx] + B[idx];
    }
。。。。。。
    int i = threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];

2.1.7 处理错误

错误检查通常使用定义好的宏

#define CHECK(call)                                                            \
{                                                                              \
    const cudaError_t error = call;                                            \
    if (error != cudaSuccess)                                                  \
    {                                                                          \
        fprintf(stderr, "Error: %s:%d, ", __FILE__, __LINE__);                 \
        fprintf(stderr, "code: %d, reason: %s\n", error,                       \
                cudaGetErrorString(error));                                    \
        exit(1);                                                               \
    }                                                                          \
}

通常CUDA函数会返回一个结果值，通过这个值就可以知道是否执行success或者错误的error codes。

2.2.1 用CPU计时器计时

gettimeofday系统函数会返回自1970年1月1日零点以来的时间us，也就是系统的时钟时间，需要加入头文件sys/time.h。需要使用到结构体 timeval。下面这个是返回的秒，若是需要us或者ms，则需要做相应的修改。也就是return((double)tp.tv_sec*1e3 +(double)tp.tv_usec *1.e-3);对应ms，return((double)tp.tv_sec *1e6 +(double)tp.tv_usec);对应us。

inline double seconds()
{
    struct timeval tp;
    struct timezone tzp;
    int i = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}

当线程的索引越界的时候，可以直接跳过计算的部分 i<N

了解自身局限性

GPU自身对网格和块的大小做了限制，每个块的最大线程数存在限制，每个网格的块的个数存在限制。而且线程的总数也存在限制的。

2.2.2 用nvprof工具计时

nvprof是命令行分析工具，nvprof --help获取帮助信息，nvprof ./obj执行，并输出CUDA部分的耗时。有时传输时间花的比计算时间还多，此时应该减少主机和设备之间的传输；若计算时间>>传输时间的那么就可以忽略传输时间。

2.3 组织并行线程

合适的组织网格和块的大小，会对核函数的性能产生影响。需要根据实际情况来调整网格和块的大小。也就是改变块和网格的组织形式，以及修改块的大小，然后根据数据的数量来修改网格的大小。

通常矩阵相加是使用2dim的网格和2dim的块，这样总体来看就是矩阵的形式，每个线程负责个数字。

其他的网格和块的组织形式：

2dim的block构成2dim的grid；1dim的block构成1dim的grid；1dim的block构成2dim的grid；

2.3.1 使用块block和线程thread产生矩阵的索引

矩阵加核函数，每个线程负责一个数值，首先需要使用块和线程索引访问全局内存指定的数字

线程和块索引映射到矩阵坐标上，blockDim是该方向上某个block包含的线程数量，blockIdx是该方向上某个block在grid内的索引值，threadIdx是该方向上某个线程

    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y

然后将矩阵坐标映射到全局内存的索引/存储unit上

unsigned int idx = iy * nx + ix;

在这里插入图片描述

2.3.2 使用2dim网格和2dim块对矩阵求和

重要的就是将线程索引映射到全局内存的线性索引，nx也就是每个行共多少数值。
在这里插入图片描述

// grid 2D block 2D
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx,
                                 int ny)
{
    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
    unsigned int idx = iy * nx + ix; 

    if (ix < nx && iy < ny)
        MatC[idx] = MatA[idx] + MatB[idx];
}
......
{
    // invoke kernel at host side
    int dimx = 32;
    int dimy = 32;
    dim3 block(dimx, dimy);
    dim3 grid((nx + block.x - 1) / block.x, (ny + block.y - 1) / block.y);

    iStart = seconds();
    sumMatrixOnGPU2D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);
}

配置是<<<(512, 512), (32, 32)>>>时速度没有<<<(512, 1024), (32, 16)>>>好，

2.3.3 使用一维网格和一维块对矩阵求和

在这里插入图片描述

此时每个线程需要处理多个数值，也就是每个线程在ny方向上处置多个数值。此时就是每个线程处置一列的，相当是矩阵相加的二重循环，去掉了最外面的循环，每个线程处置了一列，然后内部的循环还是在的。

// grid 1D block 1D
__global__ void sumMatrixOnGPU1D(float *MatA, float *MatB, float *MatC, int nx,
                                 int ny)
{
    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    if (ix < nx )
        for (int iy = 0; iy < ny; iy++)
        {
            int idx = iy * nx + ix;
            MatC[idx] = MatA[idx] + MatB[idx];
        }
}
......
    // invoke kernel at host side
    int dimx = 32;
    dim3 block(dimx, 1);
    dim3 grid((nx + block.x - 1) / block.x, 1);

    iStart = seconds();
    sumMatrixOnGPU1D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);

2.3.four 使用二维网格和一维块对矩阵求和

在这里插入图片描述

block是2dim的，但是thread排列是1dim的

// grid 2D block 1D
__global__ void sumMatrixOnGPUMix(float *MatA, float *MatB, float *MatC, int nx,
                                  int ny)
{
    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int iy = blockIdx.y;
    unsigned int idx = iy * nx + ix;

    if (ix < nx && iy < ny)
        MatC[idx] = MatA[idx] + MatB[idx];
}
......
    // invoke kernel at host side
    int dimx = 32;
    dim3 block(dimx, 1);
    dim3 grid((nx + block.x - 1) / block.x, ny);

    iStart = seconds();
    sumMatrixOnGPUMix<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);

从书上的结果可见，不同的配置的耗时是不相同的，也就是：

》》改变内核的执行配置对性能有影响

》》传统的核函数实现一般不能获得最佳性能

》》对一个给定的核函数，尝试不同的网格和线程块大小可以获得不同的性能

2.four 设备management

查询和manage设备，执行配置需要用来设置内核执行配置，有两类方式可以查询：

》》CUDA运行时API函数

》》NVIDIA系统管理界面（nvidia-smi）命令行使用程序

2.four.1 使用运行时API查询GPU信息

cudaDeviceProp结构体用来存储返回的所有设备information，

checkDeviceInfor.cu档案查询了很多内容。

2.four.2 确定最优的GPU

多GPU设备的system内，选择性能最好的GPU，通常核心数量较多的GPU更好

2.four.three nvidia-smi

nvidia-smi是命令行工具，用来manage和监控GPU设备，可以用来查询和修改设备状态

要确定安装了多少个GPU和每个GPU的id，可以使用命令 nvidia-smi -L

获取某个GPU的详细内容： nvidia-smi -q -i 0

还可以简化输出，只显示设备内存，可使用下面的命令： nvidia-smi -q -i 0 -d MEMORY | tail -n 6

只显示设备使用内容：nvidia-smi -q -i 0 UTILIZATION | tail -n 6

2.four.four 在运行时设置设备

对包含多个GPU的系统，GPU的id通常是0-N-1，设置环境变量CUDA_VISIBLE_DEVICES，可以配置运行时使用的GPU。若设置CUDA_VISIBLE_DEVICES=1，此时会将设备1作为设备0出现在应用程序列，屏蔽了其他的GPU

也可以配置多个GPU，像CUDA_VISIBLE_DEVICES=1,2，会将设备1，2映射到id=0，1，屏蔽了其他的GPU

https://zhuanlan.zhihu.com/p/664543121

Triton中文社区

欢迎来到由智源人工智能研究院发起的Triton中文社区，这里是一个汇聚了AI开发者、数据科学家、机器学习爱好者以及业界专家的活力平台。我们致力于成为业内领先的Triton技术交流与应用分享的殿堂，为推动人工智能技术的普及与深化应用贡献力量。

更多推荐

早鸟票倒计时｜TVM/Triton/TileLang同台炫技，Meet AI Compiler邀你一同解锁AI编译器的无限可能！

Triton中文社区

triton inference server的backend插件机制代码流程梳理、模型加载代码梳理

triton inference server代码流程梳理、模型加载代码梳理

Triton中文社区

obs-backgroundremoval项目CUDA依赖问题的分析与解决方案

obs-backgroundremoval项目CUDA依赖问题的分析与解决方案obs-backgroundremoval是OBS Studio的一个插件，主要用于在肖像图像和视频中替换背景以及增强低光场景。在将该项目打包为Fedora RPM时，发现了一个关于CUDA依赖的重要技术问题。问题根源分析项目默认构建会捆绑onnxruntime库，这些库在默认构建配置下强制包含了CUDA运行时路径...