Mex Cuda动态分配/慢速Mex代码

Mex Cuda Dynamic Allocation / Slow mex code

本文关键字：Mex 代码慢速动态分配 Cuda 更新时间：2023-10-16

我有cuda/c++代码返回c++主机端数组。我想在MATLAB中操作这些数组，所以我用mex格式重写了我的代码，并用mex进行编译。

我让它通过传递预分配数组从MATLAB到mex脚本的工作，但这减慢了事情疯狂。(54秒vs 14秒)

下面是我的代码的简化，无输入1输出版本的缓慢解决方案:

#include "mex.h"
#include "gpu/mxGPUArray.h"
#include "matrix.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "curand.h"
#include <cuda_runtime.h>
#include "math.h"
#include <curand_kernel.h>
#include <time.h>
#include <algorithm>
#include <iostream>
#define iterations 159744
#define transMatrixSize 2592 // Just for clarity. Do not change. No need to adjust this value for this simulation.
#define reps 1024 // Is equal to blocksize. Do not change without proper source code adjustments.
#define integralStep 13125  // Number of time steps to be averaged at the tail of the Force-Time curves to get Steady State Force
__global__ void kern(float *masterForces, ...)
{
int globalIdx = ((blockIdx.x + (blockIdx.y * gridDim.x)) * (blockDim.x * blockDim.y)) + (threadIdx.x + (threadIdx.y * blockDim.x));
...
  ...
   {
...
      {
          masterForces[i] = buffer[0]/24576.0;
      }
      }
   }
...
}

}

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{
   ...
plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);

//Device input vectors
float *d_F0;
..
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F0, iterations * sizeof(float));
...


//////////////////////////////////////////////LAUNCH ////////////////////////////////////////////////////////////////////////////////////
kern<<<1, 1024>>>( d_F0);

//////////////////////////////////////////////RETRIEVE DATA ////////////////////////////////////////////////////////////////////////////////////

cudaMemcpyAsync( h_F0 , d_F0 , iterations * sizeof(float), cudaMemcpyDeviceToHost);

///////////////////Free Memory///////////////////

cudaDeviceReset();
////////////////////////////////////////////////////
}

为什么这么慢?

编辑:Mex正在使用较旧的体系结构(SM_13)而不是SM_35编译。现在时间有意义了。

如果CUDA代码的输出是纯旧数据(POD)主机端(相对于设备端)数组，则不需要使用mxGPUArray，就像使用new创建的float的Forces1数组一样。您引用的MathWorks示例可能演示了MATLAB的gpuArray和内置CUDA功能的使用，而不是如何在MEX函数中向常规CUDA函数传递数据。

如果你可以初始化Forces1(或h_F0在你的完整代码)在和CUDA函数之前(例如在mexFunction)，那么解决方案就是从new改为mxCreate*函数之一(即mxCreateNumericArray, mxCreateDoubleMatrix, mxCreateNumericMatrix等)，然后将数据指针传递给你的CUDA函数:

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
// myCudaWrapper(...,h_F0 ,...) /* i.e. cudaMemcpyAsync(h_F0,d_F0,...)

对代码的唯一更改如下:

:

float *h_F0 = new float[(iterations)];

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);

删除:

delete h_F0;

注意:如果相反你的CUDA代码拥有输出主机端数组，那么你将不得不复制数据到mxArray。这是因为除非您使用mx API分配mexFunction输出，否则您分配的任何数据缓冲区(例如mxSetData)将不会由MATLAB内存管理器处理，并且您将有段错误或最多，内存泄漏。