混合推力和cuBLAS意想不到的结果输出

Mixing Thrust and cuBLAS unexpected results in output

本文关键字：结果输出意想不到 cuBLAS 混合更新时间：2023-10-16

我喜欢thrust库，特别是它如何很好地隐藏了cudaMalloc, cudaFree等的复杂性。

我想对一个矩阵的所有列求和。所以我使用了cuBlas的"cublasSgemv"，并将矩阵乘以一个1的向量。下面是我的代码:

void sEarColSum(std::vector<float>& inMatrix, int colSize)
{
    cublasHandle_t handle; // CUBLAS context
    float al = 1.0f; // al =1
    float bet = 1.0f; // bet =1
    int rowSize = inMatrix.size() / colSize;
    float *devOutputPtr = thrust::raw_pointer_cast(thrust::device_malloc<float>(colSize));
    thrust::device_vector<float> deviceT2DMatrix(inMatrix.begin(), inMatrix.end());
    float* device2DMatrixPtr = thrust::raw_pointer_cast(deviceT2DMatrix.data());
    thrust::device_vector<float> deviceVector(rowSize, 1.0f);
    float* deviceVecPtr = thrust::raw_pointer_cast(deviceVector.data());
    cublasCreate(&handle);
    cublasSgemv(handle, CUBLAS_OP_N, colSize, rowSize, &al, device2DMatrixPtr, colSize, deviceVecPtr, 1, &bet, devOutputPtr, 1);
    std::vector<float> outputVec(colSize);
    cudaMemcpy(outputVec.data(), devOutputPtr, outputVec.size() * sizeof(float), cudaMemcpyDeviceToHost);
    for (auto elem : outputVec)
        std::cout << elem << std::endl;
}

int main(void)
{
    std::vector < float > temp(100, 1); // A vector of 100 elements each 1 
    sEarColSum( temp, 10 ); // Means my vector will have 10 columns and 100/10 = 10 rows  
  //so I expect a output vector with 10 elements. Which all elements have the value of 10. 
}

不幸的是结果只是垃圾。我想要一个包含十个元素的向量，每个元素的值都是10。但是我得到的却是:

30
30
-2.80392e+036
30
30
-4.95176e+029
30
6.64319e+016
-3.72391e+037
30

我错过了什么，我的代码哪里出错了?
其次，无论如何都要检查例如"float* device2DMatrixPtr"与调试器?Visual studio显示它的地址，但由于它在GPU内存中，所以它不显示地址内部的数据。

cublas函数gemv做一个矩阵向量积:

y = alpha*A*x + beta*y

上述方程中的y由您分配的devOutputPtr表示:

float *devOutputPtr = thrust::raw_pointer_cast(thrust::device_malloc<float>(colSize));

普通推力分配如下:

thrust::device_vector<float> my_vec...

将分配并初始化存储，但thrust::device_malloc只分配存储，不初始化存储。

因此你的y"向量"最初包含垃圾。如果您将beta设置为零，那么这将无关紧要。但是由于您的beta被设置为1，这个未初始化区域的内容被添加到您的结果向量。

如果你设置

float bet = 0.0f;

我认为你会得到预期的结果(我认为，有了那个改变。)

关于这个问题:

其次，无论如何检查例如"浮动* device2DMatrixPtr"与调试器?

您可以直接使用例如printf或std::cout打印deviceT2DMatrix值。Thrust将为您复制值device->host"under the hood"，以方便操作。如果您想在调试器中访问设备副本，请使用windows上的insight VSE或linux上的insight EE或cuda-gdb的设备调试功能