库达优化

CUDA Optimization

本文关键字：优化更新时间：2023-10-16

我使用 CUDA 开发了枕形失真来支持实时 - 3680*2456 图像序列超过 40 fps。

但是如果我使用 CUDA - nVIDIA GeForce GT 610,2GB DDR3，则需要 130 毫秒。

但是如果我使用 CPU 和 OpenMP - 酷睿 i7 3.4GHz，四核，只需要 60 毫秒。

请告诉我该怎么做才能加快速度。谢谢。

完整的源代码可以在这里下载。https://drive.google.com/file/d/0B9SEJgsu0G6QX2FpMnRja0o5STA/view?usp=sharinghttps://drive.google.com/file/d/0B9SEJgsu0G6QOGNPMmVQLWpSb2c/view?usp=sharing

代码如下。

__global__
void undistort(int N, float k, int width, int height, int depth, int pitch, float R, float L, unsigned char* in_bits, unsigned char* out_bits)
{
    // Get the Index of the Array from GPU Grid/Block/Thread Index and Dimension.
    int i, j;
    i = blockIdx.y * blockDim.y + threadIdx.y;
    j = blockIdx.x * blockDim.x + threadIdx.x;
    // If Out of Array
    if (i >= height || j >= width)
    {
        return;
    }
    // Calculating Undistortion Equation.
    // In CPU, We used Fast Approximation equations of atan and sqrt - It makes 2 times faster.
    // But In GPU, No need to use Approximation Functions as it is faster.
    int cx = width  * 0.5;
    int cy = height * 0.5;
    int xt = j - cx;
    int yt = i - cy;
    float distance = sqrt((float)(xt*xt + yt*yt));
    float r = distance*k / R;
    float theta = 1;
    if (r == 0)
        theta = 1;
    else
        theta = atan(r)/r;
    theta = theta*L;
    float tx = theta*xt + cx;
    float ty = theta*yt + cy;
    // When we correct the frame, its size will be greater than Original.
    // So We should Crop it.
    if (tx < 0)
        tx = 0;
    if (tx >= width)
        tx = width - 1;
    if (ty < 0)
        ty = 0;
    if (ty >= height)
        ty = height - 1;
    // Output the Result.
    int ux = (int)(tx);
    int uy = (int)(ty);
    tx = tx - ux;
    ty = ty - uy;
    unsigned char *p = (unsigned char*)out_bits + i*pitch + j*depth;
    unsigned char *q00 = (unsigned char*)in_bits + uy*pitch + ux*depth;
    unsigned char *q01 = q00 + depth;
    unsigned char *q10 = q00 + pitch;
    unsigned char *q11 = q10 + depth;
    unsigned char newVal[4] = {0};
    for (int k = 0; k < depth; k++)
    {
        newVal[k] = (q00[k]*(1-tx)*(1-ty) + q01[k]*tx*(1-ty) + q10[k]*(1-tx)*ty + q11[k]*tx*ty);
        memcpy(p + k, &newVal[k], 1);
    }
}
void wideframe_correction(char* bits, int width, int height, int depth)
{
    // Find the device.
    // Initialize the nVIDIA Device.
    cudaSetDevice(0);
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, 0);
    // This works for Calculating GPU Time.
    cudaProfilerStart();
    // This works for Measuring Total Time
    long int dwTime = clock();
    // Setting Distortion Parameters
    // Note that Multiplying 0.5 works faster than divide into 2.
    int cx = (int)(width * 0.5);
    int cy = (int)(height * 0.5);
    float k = -0.73f;
    float R = sqrt((float)(cx*cx + cy*cy));
    // Set the Radius of the Result.
    float L = (float)(width<height ? width:height);
    L = L/2.0f;
    L = L/R;
    L = L*L*L*0.3333f;
    L = 1.0f/(1-L);
    // Create the GPU Memory Pointers.
    unsigned char* d_img_in = NULL;
    unsigned char* d_img_out = NULL;
    // Allocate the GPU Memory2D with pitch for fast performance.
    size_t pitch;
    cudaMallocPitch( (void**) &d_img_in, &pitch, width*depth, height );
    cudaMallocPitch( (void**) &d_img_out, &pitch, width*depth, height );
    _tprintf(_T("nPitch : %dn"), pitch);
    // Copy RAM data to VRAM.
    cudaMemcpy2D( d_img_in, pitch, 
            bits, width*depth, width*depth, height, 
            cudaMemcpyHostToDevice );
    cudaMemcpy2D( d_img_out, pitch, 
            bits, width*depth, width*depth, height, 
            cudaMemcpyHostToDevice );
    // Create Variables for Timing
    cudaEvent_t startEvent, stopEvent;
    cudaError_t err = cudaEventCreate(&startEvent, 0);
    assert( err == cudaSuccess );
    err = cudaEventCreate(&stopEvent, 0);
    assert( err == cudaSuccess );
    // Execution of the version using global memory
    float elapsedTime;
    cudaEventRecord(startEvent);
    // Process image
    dim3 dGrid(width / BLOCK_WIDTH + 1, height / BLOCK_HEIGHT + 1);
    dim3 dBlock(BLOCK_WIDTH, BLOCK_HEIGHT);
    undistort<<< dGrid, dBlock >>> (width*height, k,  width, height, depth, pitch, R, L, d_img_in, d_img_out);
    cudaThreadSynchronize();
    cudaEventRecord(stopEvent);
    cudaEventSynchronize( stopEvent );
    // Estimate the GPU Time.
    cudaEventElapsedTime( &elapsedTime, startEvent, stopEvent);
    // Calculate the Total Time.
    dwTime = clock() - dwTime;
    // Save Image data from VRAM to RAM
    cudaMemcpy2D( bits, width*depth, 
        d_img_out, pitch, width*depth, height,
        cudaMemcpyDeviceToHost );
    _tprintf(_T("GPU Processing Time(ms) : %dn"), (int)elapsedTime);
    _tprintf(_T("VRAM Memory Read/Write Time(ms) : %dn"), dwTime - (int)elapsedTime);
    _tprintf(_T("Total Time(ms) : %dn"), dwTime );
    // Free GPU Memory
    cudaFree(d_img_in);
    cudaFree(d_img_out);
    cudaProfilerStop();
    cudaDeviceReset();
}

我没有阅读源代码，但有些事情你不能通过。

您的 GPU 的性能几乎与 CPU 相同：

根据

真实的 GPU/CPU 型号调整以下信息。

Specification | GPU          | CPU
----------------------------------------
Bandwith      | 14,4 GB/sec  | 25.6 GB/s
Flops         | 155 (FMA)    |  135

我们可以得出结论，对于内存绑定内核，您的 GPU 永远不会比 CPU 快。

在这里找到的 GPU 信息：http://www.nvidia.fr/object/geforce-gt-610-fr.html#pdpContent=2

在这里找到的CPU信息：http://ark.intel.com/products/75123/Intel-Core-i7-4770K-Processor-8M-Cache-up-to-3_90-GHz?q=Intel%20Core%20i7%204770K

这里 http://www.ocaholic.ch/modules/smartsection/item.php?page=6&itemid=1005

人们不会简单地通过查看源代码来优化代码。首先，您应该使用Nvidia Profiler https://developer.nvidia.com/nvidia-visual-profiler，看看GPU上的哪一部分代码需要花费太多时间。但是，您可能希望先编写一个 UnitTest，以确保只测试项目中已调查的部分。

此外，您可以使用 CallGrind http://valgrind.org/docs/manual/cl-manual.html 来测试 CPU 代码性能。

一般来说，您的 GPU"优化"代码比"未优化"代码慢也就不足为奇了。CUDA 内核通常比 CPU 慢几倍，您必须实际引入大量并行性才能注意到显着的加速。

编辑，回复您的评论：

作为一个单元测试框架，我强烈推荐GoogleTest。在这里，您可以学习如何使用它。除了其明显的功能（代码测试）之外，它还允许您仅从类接口运行特定方法以进行性能分析。

通常，Nvidia 分析器只是一个运行代码并告诉您每个内核消耗多少时间的工具。请查看他们的文档。

我所说的"大量并行性"是指：在您的处理器上，您可以运行 8 x 3.4GHz 线程，您的 GPU 有一个时钟为 810MHz 的 SM（流式多处理器），假设每个 SM 有 1024 个线程（我没有确切的数据，但您可以运行 deviceQuery Nvidia 脚本以了解确切的参数），因此如果您的 GPU 代码可以并行运行（3.4*8）/0.81 = 33 次计算，你将一事无成。CPU 和 GPU 代码的执行时间将是相同的（忽略 L 缓存 GPU 内存复制，这是昂贵的）。结论：您的 GPU 代码应该能够并行计算至少 ~ 40 个操作以引入任何加速。另一方面，假设您能够充分利用GPU的潜力，并且可以始终保持SM上的所有1024都处于忙碌状态。在这种情况下，您的代码将仅运行（0.81*1024）/（8*3.4） = 30 倍（大约，请记住我们忽略了 GPU L 缓存操作），这在大多数情况下是不可能的，因为通常您无法以如此高效并行化您的串行代码。

祝您在研究中好运！

是的，充分利用 nvprof，这是一个很棒的工具。

我能从你的代码中看到什么...1.考虑使用线性螺纹块而不是扁平块，它可以节省一些整数运算。2. 手动校正图像边框和/或螺纹索引会导致大量发散和/或影响合并。请考虑使用纹理提取和/或预填充数据。3. 内核内部的 memcpy 单值通常是一个坏主意。4. 尽量减少类型转换。