在cudaMallocPitch中计算cudaMalloc的音调

Calculate the pitch for cudaMalloc as in cudaMallocPitch

本文关键字：cudaMalloc 计算 cudaMallocPitch 更新时间：2023-10-16

简单的问题:是否有可能在不分配内存的情况下计算或获得数组的最佳间距，如

cudaMallocPitch(void** p, size_t *pitch, size_t width, size_t height)

我想获得音调，而不分配内存，然后使用函数cudaMalloc代替!

(如果想为cuda平台的倾斜分配实现一些缓存分配器，这是至关重要的)

:

// round width to next mulitple of  prop.textureAlignment;  
size_t proper_pitch = ((width / (size_t)device.m_prob.textureAlignment) + 1) * device.m_prob.textureAlignment;

更新:我现在将proper_pitch计算为32/64/128字节的最小上限倍数:我没有尝试过这个，我仍然不知道运行时API还能做什么，也许看看已经分配的内存，做一些拟合?在CUDA编程指南中，对于完全合并的访问，上述是必要的要求(不够，因为在运行时翘曲需要连续访问)…

// use Cuda Programming Guide Alignmenet (which should be the best i think)
    // Upper closest multible of 32/64/128
    //size_t upperMultOf32 = ((widthInBytes + 32 - 1)/32)*32;   //  ((widthInBytes-1)/32 + 1)*32
    proper_pitch = std::min(
                        std::min( ((widthInBytes + 32 - 1)>>5)<<5 , ((widthInBytes + 64 - 1)>>6)<<6 ), 
                        ((widthInBytes + 128 - 1)>>7)<<7
                    );

目前没有办法获得音高计算。细节可能是硬件版本特定的，NVIDIA既没有记录计算，也没有通过API公开计算(尽管，正如指出的那样，他们这样做是微不足道的)。

如果这对现实世界的用例来说是一个严重的限制，我建议通过NVIDIA注册开发人员的门户提出错误报告/功能请求。根据我的经验，他们确实会听取严肃的功能要求。

[此答案主要来自评论，并作为社区wiki条目添加，以将此问题从未回答列表中删除]