优化CUDA内核
Optimizing CUDA Kernel
如何进一步优化以下CUDA内核?或者它已经为它的目的进行了优化?
我在想也许我可以在主机代码中使用__constant__
内存来设置随机数字的数组。这可能吗?我知道它是只读内存,所以我很困惑是我是否可以使用恒定内存而不是__global__
内存。
/*
* CUDA kernel that will execute 100 threads in parallel
* and will populate these parallel arrays with 100 random numbers
* array size = 100.
*/
__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc,
float* opacity ,float* angle, unsigned char* color, int height,
int width, curandState* state, size_t pitch){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
curandState localState = state[idx];
posx[idx] = (float)(curand_normal(&localState)*width);
posy[idx] = (float)(curand_normal(&localState)*height);
rayon[idx] = (float)(10 + curand_normal(&localState)*50);
angle[idx] = (float)(curand_normal(&localState)*360);
veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
color[idx*pitch] = (unsigned char)(curand_normal(&localState)*255);
color[(idx*pitch)+1] = (unsigned char)(curand_normal(&localState)*255);
color[(idx*pitch)+2] = (unsigned char)(curand_normal(&localState)*255);
opacity[idx] = (float)(0.3f + 1.5f *curand_normal(&localState));
__syncthreads();
}
我将尝试使2D线程块,使每个线程只执行一个操作。考虑这样一个内核:
__global__ void initializeArrays(float* posx, float* posy,float* rayon, float* veloc,
float* opacity ,float* angle, unsigned char* color, int height,
int width, curandState* state, size_t pitch){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = threadIdx.y;
curandState localState = state[idy][idx];
switch(idy)
{
case 0:
posx[idx] = (float)(curand_normal(&localState)*width);
break;
case 1:
posy[idx] = (float)(curand_normal(&localState)*height);
break;
case 2:
rayon[idx] = (float)(10 + curand_normal(&localState)*50);
break;
case 3:
angle[idx] = (float)(curand_normal(&localState)*360);
break;
case 4:
veloc[idx] = (float)(curand_uniform(&localState)*20 - 10);
break;
case 5:
color[idx*pitch] = (unsigned char)(curand_normal(&localState)*255);
break;
case 6:
color[(idx*pitch)+1] = (unsigned char)(curand_normal(&localState)*255);
break;
case 7:
color[(idx*pitch)+2] = (unsigned char)(curand_normal(&localState)*255);
break;
case 8:
opacity[idx] = (float)(0.3f + 1.5f *curand_normal(&localState));
break;
default:
break;
}
__syncthreads();
}
相关文章:
- CUDA内核和数学函数的显式命名空间
- 将 2D 推力::d evice_vector 复矩阵传递给 CUDA 内核函数
- 如何将矢量的数据传递给 CUDA 内核?
- 无法在 cuda 内核中使用我的模板类
- CUDA非法访问内核内存
- CUDA内核printf()在终端中不产生输出,在探查器中工作
- 编译为 cuda 内核调用提供了"expression must have integral or unscoped enum type"
- 使用模板模式优化 CUDA 内核
- 带有大结构变量的 CUDA 内核函数给出了错误的结果
- CUDA 内核在第二次运行时运行得更快 - 为什么?
- 是否可以从 CUDA 10.1 内核调用 cuBLAS 或 cuBLASLt 函数?
- 在CUDA内核中传递一个常数整数
- 如何将函数作为CUDA内核参数传递
- 验证调用 cuda 内核的次数
- cuda 内核调用/传递参数中的编译错误
- 如何在 CUDA 中的内核函数中乘以两个 openCV 矩阵?
- 预期;在 CUDA 内核上
- CUDA 内核"Only a single pack parameter is allowed"解决方法?
- 内核代码中矩阵的CUDA多乘法
- 二维多维数组传递到内核CUDA