与CPU程序相比，CUDA编程的计算精度不能具有相同的计算精度

Cuda programming cannot have the same computation accuracy compared with CPU program in terms of float type

本文关键字：精度计算不能编程程序 CPU CUDA 更新时间：2023-10-16

我尝试使用GPU来加速我的程序，该程序计算两个浮点数组之间的L2距离。为了检查计算精度，我同时编写了CUDA程序和CPU程序。但是，我发现总错误超过200，我不明白。在两种情况下，我都使用浮点类型，我相信我应该得到相同的结果。我的代码如下。

#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>
#define VECTORDIM 3

double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}
void DistanceCPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    float temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2); 
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(float* array1, float* array2, int narray1, int narray2, float* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    float temp;
    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
            output[i * narray2 + j] = temp;
        }
    }
}
int main()
{
    int narray1 = 7000;
    int narray2 = 60000;
    float* array1 = new float[narray1 * VECTORDIM];
    float* array2 = new float[narray2 * VECTORDIM];
    float* outputGPU = new float[narray1 * narray2];
    float* outputCPU = new float[narray1 * narray2];
    float* outputCPUTest = new float[narray1 * narray2];
    float* d_array1;
    float* d_array2;
    float* d_output;
    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }
    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast<float> (rand() / (static_cast<float> (RAND_MAX / 10)));
    }
    cudaError_t err;
    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(float));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(float));
    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(float), cudaMemcpyHostToDevice);
    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threadsn", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;
    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(float), cudaMemcpyDeviceToHost);
    printf("Total computation time is %lf n" , iElaps);
    DistanceCPU(array1, array2, narray1, narray2, outputCPU);
    float error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);
    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f n", outputCPU[i]);
        printf("GPU result %f n", outputGPU[i]);
    }
    printf("Error is %f n", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}

我尝试从CPU和GPU打印一些计算结果。我得到以下输出。

CPU result 84.315201 
GPU result 84.315193 
CPU result 48.804039 
GPU result 48.804039 
CPU result 26.388403 
GPU result 26.388403 
CPU result 150.009735 
GPU result 150.009750

我相信浮点精度就足够了，我不知道真正的问题是什么。

我会说这里的主要贡献者是powf函数的使用。GPU上的特定数学功能定义不能保证与CPU代码中相同的数学功能具有相同的精度。我不能说这是足够的甚至适用的描述，因为我们可能必须讨论您使用的CPU编译器以及编译开关/设置。GPU数学功能的错误可能性记录在CUDA编程指南中。

但是，如果您感兴趣的是性能，我认为使用 pow或 powf对事物进行平衡确实没有什么意义。我认为，由于您在问一个有关GPU的问题，因此您对性能感兴趣。

如果我们用普通的平方操作替换了powf函数的使用，则GPU结果与我所看到的更接近CPU结果。

在CUDA 10.0，TESLA P100，CENTOS 7，GCC 4.8.5：

上运行代码的结果

$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000038
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970345
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315193
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009750
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000001

修改的代码，用普通平方代替POWF：

$ cat t415.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <sys/time.h>
#include <math.h>
// #include <helper_functions.h>
#define VECTORDIM 3
typedef float mt;
double cpuSecond()
{
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return ((double) tp.tv_sec + (double)tp.tv_usec*1e-6);
}
void DistanceCPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    mt temp;
    for (int i = 0; i < narray1; i++)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
            for (int l = 0; l < VECTORDIM; l++)
            {
#ifndef USE_POW
                temp += (array1[i + l * narray1] - array2[j + l * narray2])*(array1[i + l * narray1] - array2[j + l * narray2]);
#else
                temp += powf(array1[i + l * narray1] - array2[j + l * narray2], 2);
#endif
            }
            output[i * narray2 + j] = temp;
        }
    }
}
__global__ void DistGPU(mt* array1, mt* array2, int narray1, int narray2, mt* output)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    mt temp;
    if (i < narray1)
    {
        for (int j = 0; j < narray2; j++)
        {
            temp = 0;
#ifndef USE_POW
            temp += (array1[i] - array2[j])*(array1[i] - array2[j]);
            temp += (array1[i + narray1] - array2[j + narray2])*(array1[i + narray1] - array2[j + narray2]);
            temp += (array1[i + 2 * narray1] - array2[j + 2 * narray2])*(array1[i + 2 * narray1] - array2[j + 2 * narray2]);
#else
            temp += powf(array1[i] - array2[j], 2);
            temp += powf(array1[i + narray1] - array2[j + narray2], 2);
            temp += powf(array1[i + 2 * narray1] - array2[j + 2 * narray2], 2);
#endif
            output[i * narray2 + j] = temp;
        }
    }
}
int main()
{
    int narray1 = 7000;
    int narray2 = 60000;
    mt* array1 = new mt[narray1 * VECTORDIM];
    mt* array2 = new mt[narray2 * VECTORDIM];
    mt* outputGPU = new mt[narray1 * narray2];
    mt* outputCPU = new mt[narray1 * narray2];
    mt* outputCPUTest = new mt[narray1 * narray2];
    mt* d_array1;
    mt* d_array2;
    mt* d_output;
    for (int i = 0; i < narray1 * VECTORDIM; i++)
    {
        array1[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
        // std::cout << "Element " << i << " " << array1[i] << std::endl;
    }
    for (int i = 0; i < narray2 * VECTORDIM; i++)
    {
        array2[i] = static_cast<mt> (rand() / (static_cast<mt> (RAND_MAX / 10)));
    }
    cudaError_t err;
    err = cudaMalloc((void**)&d_array1, narray1 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_array2, narray2 * VECTORDIM * sizeof(mt));
    err = cudaMalloc((void**)&d_output, narray1 * narray2 * sizeof(mt));
    err = cudaMemcpy(d_array1, array1, narray1 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
    err = cudaMemcpy(d_array2, array2, narray2 * VECTORDIM * sizeof(mt), cudaMemcpyHostToDevice);
    int threadsPerBlock = 512;
    int blocksPerGrid = (narray1 + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threadsn", blocksPerGrid, threadsPerBlock);
    double iStart = cpuSecond();
    DistGPU<<<blocksPerGrid, threadsPerBlock>>>(d_array1, d_array2, narray1, narray2, d_output);
    double iElaps = cpuSecond() - iStart;
    err = cudaMemcpy(outputGPU, d_output, narray1 * narray2 * sizeof(mt), cudaMemcpyDeviceToHost);
    printf("Total computation time is %lf n" , iElaps);
    DistanceCPU(array1, array2, narray1, narray2, outputCPU);
    mt error = 0;
    for (long i = 0; i < narray1 * narray2; i++)
    {
        error += abs(outputCPU[i] - outputGPU[i]);
    }
    error /= (narray2 * narray1);
    for (int i = 0; i < 20; i++)
    {
        printf("CPU result %f n", outputCPU[i]);
        printf("GPU result %f n", outputGPU[i]);
    }
    printf("Error is %f n", error);
    delete [] array1;
    delete [] array2;
    delete [] outputCPU;
    delete [] outputGPU;
    return 0;
}
$ nvcc -o t415 t415.cu
t415.cu(87): warning: variable "err" was set but never used
$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000042
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145149
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092331
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000

一些笔记：

仍然有一些我尚未调查的差异。GPU的进行FMA收缩可能与CPU代码不同。分析过程中的下一步是比较float与double计算，以确定哪个数字更接近正确的结果。在某些情况下，GPU产生的数字与相应的CPU代码更接近正确的结果，因此仅仅做出CPU代码正确的假设，然后要求解释为什么GPU代码为何并非总是如此正确的方法。这是这种错误的示例。
如果我们考虑了普通平方的版本，对我而言，该代码确实或需要具有浮点计算订购订单和GPU版本之间的差异，那么我不认为浮点（缺乏浮点（）关联是这里的主要考虑因素。但是，我没有确定的解释来解释其余的差异。需要更多的工作（请参阅上一项）。
至少在GPU上，普通平方的速度可能比powf( ,2)
您在GPU代码上进行的计时度量仅在捕获内核启动开销。内核发射是异步的。要捕获完整的内核执行持续时间，请在内核调用后立即在定时区域中添加cudaDeviceSynchronize();调用。

编辑：由@njuffa提供，他提醒我，如果我们将先前修改的代码重新编译为-fmad=false - out good）在GPU和CPU之间相同的结果。因此，这意味着FMA收缩（在GPU方面）可能是对上一节中剩下的几个差异的最终贡献者。正如Njuffa评论中提到的那样，FMA收缩可能会产生更高的精度结果，因此这里可能的解释是GPU结果（如前所述，FMA收缩）可能是比CPU结果更准确。同样，切换到双重精神将有助于确认这一点。该代码已经设置为可以通过typedef更改轻松实现。无论如何，这是先前代码的输出（float，使用普通平方）与-fmad=false：

$ nvcc -o t415 t415.cu -fmad=false
t415.cu(87): warning: variable "err" was set but never used
$ ./t415
CUDA kernel launch with 14 blocks of 512 threads
Total computation time is 0.000039
CPU result 28.795628
GPU result 28.795628
CPU result 50.995567
GPU result 50.995567
CPU result 46.970348
GPU result 46.970348
CPU result 29.031254
GPU result 29.031254
CPU result 111.297745
GPU result 111.297745
CPU result 19.145151
GPU result 19.145151
CPU result 20.508183
GPU result 20.508183
CPU result 133.916077
GPU result 133.916077
CPU result 84.315201
GPU result 84.315201
CPU result 48.804039
GPU result 48.804039
CPU result 26.388403
GPU result 26.388403
CPU result 150.009735
GPU result 150.009735
CPU result 108.421936
GPU result 108.421936
CPU result 73.092339
GPU result 73.092339
CPU result 79.486023
GPU result 79.486023
CPU result 89.990150
GPU result 89.990150
CPU result 20.142567
GPU result 20.142567
CPU result 43.482445
GPU result 43.482445
CPU result 29.460800
GPU result 29.460800
CPU result 86.545860
GPU result 86.545860
Error is 0.000000