用Cublas设备API计算矩阵决定因素

Calculate matrix determinants with cublas device API

本文关键字：决定因素计算 API Cublas 设备更新时间：2023-10-16

我正在尝试评估标量函数f（x），其中x是k维向量（即f：r^k-> r）。在评估过程中，我必须执行许多矩阵操作：反转，乘法和查找中等尺寸矩阵的矩阵决定因素和痕迹（其中大多数小于30x30）。现在，我想通过使用GPU上的不同线程在许多不同的XS上评估该函数。这就是为什么我需要设备API。

我已经编写了以下代码，以测试Cublas Device API，CublassGetRfBatched测试矩阵决定因素，在那里我首先找到了矩阵的LU分解，并计算了U Matrix中所有对角元素的乘积。我使用Cublas返回的结果在GPU线程和CPU上都进行了此操作。但是，当CPU的结果正确时，GPU的结果没有任何意义。我已经使用了cuda-memcheck，但没有发现任何错误。有人可以帮助您阐明这个问题吗？非常感谢。

    cat test2.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

__host__ __device__ unsigned int IDX(unsigned int i,unsigned  int j,unsigned int ld){return j*ld+i;}
#define PERR(call) 
  if (call) {
   fprintf(stderr, "%s:%d Error [%s] on "#call"n", __FILE__, __LINE__,
      cudaGetErrorString(cudaGetLastError()));
   exit(1);
  }
#define ERRCHECK 
  if (cudaPeekAtLastError()) { 
    fprintf(stderr, "%s:%d Error [%s]n", __FILE__, __LINE__,
       cudaGetErrorString(cudaGetLastError()));
    exit(1);
  }
__device__ float
det_kernel(float *a_copy,unsigned int *n,cublasHandle_t *hdl){
  int *info = (int *)malloc(sizeof(int));info[0]=0;
  int batch=1;int *p = (int *)malloc(*n*sizeof(int));  
  float **a = (float **)malloc(sizeof(float *));
  *a = a_copy;  
  cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch);  
  unsigned int i1;
  float res=1;
  for(i1=0;i1<(*n);++i1)res*=a_copy[IDX(i1,i1,*n)];
  return res;
}
__global__ void runtest(float *a_i,unsigned int n){
  cublasHandle_t hdl;cublasCreate_v2(&hdl);
  printf("det on GPU:%fn",det_kernel(a_i,&n,&hdl));  
  cublasDestroy_v2(hdl);
}
int
main(int argc, char **argv)
{
  float a[] = {
    1,   2,   3,
    0,   4,   5,
    1,   0,   0};
  cudaSetDevice(1);//GTX780Ti on my machine,0 for GTX1080
  unsigned int n=3,nn=n*n;
  printf("a is n");
  for (int i = 0; i < n; ++i){    
    for (int j = 0; j < n; j++) printf("%f, ",a[IDX(i,j,n)]);    
    printf("n");}
  float *a_d;
  PERR(cudaMalloc((void **)&a_d, nn*sizeof(float)));
  PERR(cudaMemcpy(a_d, a, nn*sizeof(float), cudaMemcpyHostToDevice));
  runtest<<<1, 1>>>(a_d,n);
  cudaDeviceSynchronize();
  ERRCHECK;
  PERR(cudaMemcpy(a, a_d, nn*sizeof(float), cudaMemcpyDeviceToHost));
  float res=1;
  for (int i = 0; i < n; ++i)res*=a[IDX(i,i,n)];
  printf("det on CPU:%fn",res);
}
  nvcc -arch=sm_35 -rdc=true -o test test2.cu -lcublas_device -lcudadevrt
./test
a is 
1.000000, 0.000000, 1.000000, 
2.000000, 4.000000, 0.000000, 
3.000000, 5.000000, 0.000000, 
det on GPU:0.000000
det on CPU:-2.000000

cublas设备调用是异步。

这意味着他们在Cublas调用完成之前将控制返回到调用线程。

如果您希望调用线程能够直接处理结果（就像在此处进行计算res一样），则必须强迫同步等待结果，然后才能开始计算。

您在主机侧计算中没有看到这一点，因为在父核终止之前，任何设备活动都有隐式同步（包括Cublas设备动态并行性）。

因此，如果您在设备Cublas调用后添加添加同步，则是这样：

cublasStatus_t status=cublasSgetrfBatched(*hdl, *n, a, *n, p, info, batch); 
cudaDeviceSynchronize(); // add this line

我认为您会看到设备计算与主机计算之间的匹配。