CUDA C:内核输出不良结果
CUDA C: Kernel outputs bad results
首先我想说这不是家庭作业,我只是从CUDA开始。
我试图运行以下代码来添加2个向量…问题是每次运行后,结果向量(c_device)保持不变,不会得到两个向量相加的结果。
我已经尝试改变向量的长度,并使用整型和无符号整型,并尝试在visual studio中的x64和win32之间移动。
我把代码附在这里:
这是。h文件
#ifndef ODINN_CUDA_MAIN_H
#define ODINN_CUDA_MAIN_H
#define ARR_SIZE 100
#define ITER_AMOUNT 1
typedef enum cudaError cudaError_t;
static void HandleError(cudaError_t err, const char *file, int line) {
if (err != CUDA_SUCCESS) {
printf("%s in %s at line %dn", cudaGetErrorString(err), file, line);
exit(EXIT_FAILURE);
}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))
#define GET_CURRENT_CLOCKS(var) (var = clock())
#define GET_CLOCK_INTERVAL_SEC(start, end, result) (result = ((double)((double)end - (double)start) / (double)CLOCKS_PER_SEC))
__host__ dim3 requestBlockSize(int x, int y=0, int z=0);
__host__ dim3 requestNumBlocks(int x, int y=0, int z=0);
__host__ void allocateVectors(unsigned int **a_host, unsigned int **b_host, unsigned int **c_host, unsigned int **a_device, unsigned int **b_device, unsigned int **c_device);
__global__ void addVectors(unsigned int* a, unsigned int* b, unsigned int* result, int n);
__host__ void cleanUp(unsigned int *a_host, unsigned int *b_host, unsigned int *c_host, unsigned int *a_device, unsigned int *b_device, unsigned int *c_device);
#endif
.cu文件:
#include <cuda.h>
#include <stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include "main.h"
static cudaDeviceProp prop;
int main(void) {
// Start lazy init now so first cudaMallow will run faster.
cudaSetDevice(0);
cudaFree(0);
unsigned int *a_host, *b_host, *c_host;
unsigned int *a_device, *b_device, *c_device;
double delta_in_sec;
size_t size = sizeof(unsigned int) * ARR_SIZE;
clock_t start_clock, end_clock;
HANDLE_ERROR(cudaGetDeviceProperties(&prop, 0));
dim3 block_size = requestBlockSize(1024);
int blocks_requested = floor((double)(ARR_SIZE / block_size.x));
dim3 n_blocks = requestNumBlocks(blocks_requested > 0 ? blocks_requested : 1);
fprintf(stdout, "Allocating vectors ...n");
allocateVectors(&a_host, &b_host, &c_host, &a_device, &b_device, &c_device);
fprintf(stdout, "Copying to device ...n");
HANDLE_ERROR(cudaMemcpy(a_device, a_host, size, cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(b_device, b_host, size, cudaMemcpyHostToDevice));
fprintf(stdout, "Running kernel ...n");
GET_CURRENT_CLOCKS(start_clock);
for(int i=0; i<ITER_AMOUNT; i++) {
addVectors<<<n_blocks, block_size>>>(a_device, b_device, c_device, ARR_SIZE);
}
GET_CURRENT_CLOCKS(end_clock);
GET_CLOCK_INTERVAL_SEC(start_clock, end_clock, delta_in_sec);
fprintf(stdout, "Runtime of kernel %d times on arrays in length %d took %f secondsn"
"Copying results back to host ...n", ITER_AMOUNT, ARR_SIZE, delta_in_sec);
HANDLE_ERROR(cudaMemcpy(c_host, c_device, size, cudaMemcpyDeviceToHost));;
fprintf(stdout, "%u + %u != %un", a_host[0], b_host[0], c_host[0]);
fprintf(stdout, "Cleaning up ...n");
cleanUp(a_host, b_host, c_host, a_device, b_device, c_device);
fprintf(stdout, "Done!n");
}
__host__ dim3 requestBlockSize(int x, int y, int z) {
dim3 blocksize(
x <= prop.maxThreadsDim[0] ? x : prop.maxThreadsDim[0],
y <= prop.maxThreadsDim[1] ? y : prop.maxThreadsDim[1],
z <= prop.maxThreadsDim[2] ? z : prop.maxThreadsDim[2]
);
return blocksize;
}
__host__ dim3 requestNumBlocks(int x, int y, int z) {
dim3 numblocks(x, y, z);
return numblocks;
}
__host__ void allocateVectors(unsigned int **a_host, unsigned int **b_host, unsigned int **c_host, unsigned int **a_device, unsigned int **b_device, unsigned int **c_device) {
size_t size = sizeof(unsigned int) * ARR_SIZE;
*a_host = (unsigned int *)malloc(size);
*b_host = (unsigned int *)malloc(size);
*c_host = (unsigned int *)malloc(size);
HANDLE_ERROR(cudaMalloc((void **)a_device, size));
HANDLE_ERROR(cudaMalloc((void **)b_device, size));
HANDLE_ERROR(cudaMalloc((void **)c_device, size));
srand(time(NULL));
for(int i=0; i<ARR_SIZE; i++) {
(*a_host)[i] = rand() % ARR_SIZE;
(*b_host)[i] = rand() % ARR_SIZE;
}
}
__global__ void addVectors(unsigned int* a, unsigned int* b, unsigned int* result, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= 0 && idx < n)
result[idx] = a[idx] + b[idx];
}
__host__ void cleanUp(unsigned int *a_host, unsigned int *b_host, unsigned int *c_host, unsigned int *a_device, unsigned int *b_device, unsigned int *c_device) {
free(a_host);
free(b_host);
free(c_host);
HANDLE_ERROR(cudaFree(a_device));
HANDLE_ERROR(cudaFree(b_device));
HANDLE_ERROR(cudaFree(c_device));
}
如果你喜欢看那里的代码,这里是pastebin的链接:http://pastebin.com/04jy1CaB
我想提到的是,当复制a_device到c_host时,它工作。我也尝试将c_host复制到c_device,看看会发生什么,结果是一样的。
有什么建议吗?
好的,所以感谢talonmies对我的问题的评论,我已经意识到我没有做足够的错误检查,当我做额外的检查时,我发现我传递了错误的参数给内核调用。
我向内核线程数量和块数量参数传递了无效的dim3 y和z值。如果你注意到我的默认值是0,它们应该是1。
调试器万岁:)
相关文章:
- 为什么"do while"循环不断退出,即使条件计算结果为 false?
- valgrind-hellgrind与泄漏检查的结果不同
- 用C++20 fmt限制结果的总大小
- 如何返回一个类的两个对象相加的结果
- 使用QProcess执行命令,并将结果存储在QStringList中
- 如果我std::dynamic_pointer_cast并且底层dynamic_cast的结果为null,那么返回的sh
- 在没有定义返回类型的函数中返回布尔值,并将结果保存在无错误的char编译中-为什么
- 序列化,没有库的整数,得到奇怪的结果
- 使用取消引用的指针的多态性会产生意外的结果.为什么?
- 在更改for循环的第三部分后,未使用for循环结果
- 使用++运算符会导致意外的结果
- 为什么在逗号分隔符上下文中将预增量的结果强制转换为void
- C++Brute Force攻击函数不会返回结果
- C/C :libcurl 扩展ASCII chars =不良的Google Translation API结果
- 当deRERECHENCER结果访问std :: find_if时,内存访问不良
- 强制转换为父类对象而不是引用时会出现哪些不良结果
- Emscripten 将C++编译为 JavaScript 和 Asm 的不良结果.js
- QImageRGB32到QImageRGB24,某种图像的不良结果
- 作者所说的不良结果是什么
- CUDA C:内核输出不良结果