Cuda有效地从字节数组复制到不同大小的共享内存元素

Cuda efficiently copy from byte array to shared memory elements of different sizes

本文关键字：共享元素内存有效地字节字节数复制数组 Cuda 更新时间：2023-10-16

我有一个字节数组(unsigned char *)，它表示内存中的树状数据结构。树的每个节点包含不同大小的元素:1bool开头，nunsigned ints和nunsigned shorts。我之所以这样做，是因为对我来说，内存使用最少是非常重要的。不幸的是，当我试图访问从全局内存复制到共享内存时，会导致内存对齐问题:

__global__ void sampleerror(unsigned char * global_mem, unsigned int updated_idx...) {
    __shared__ unsigned int offsets[MAX_NUM_CHILDREN/2 +1]; 
    __shared__ unsigned int entries[ENTRIES_PER_NODE];
    __shared__ bool booleans[4];
    bool * is_last = &booleans[0];
    //First warp divergence here. We are reading in from global memory
    if (i == 0) {
        *is_last = (bool)global_mem[updated_idx];
    }
    __syncthreads();
    if (*is_last) {
        //The number of entries in the bottom most nodes may be smaller than the size
        if (i < (size - 1)/entry_size) {
            entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);
        }
    } else {
        int num_entries = (size - 1 - sizeof(unsigned int) - sizeof(unsigned short))/(entry_size + sizeof(unsigned short));
        //Load the unsigned int start offset together with the accumulated offsets to avoid warp divergence
        if (i < ((num_entries + 1)/2) + 1) {
            offsets[i] = *(unsigned int *)(&global_mem[updated_idx + 1 * i*sizeof(unsigned int)]);
        }
        __syncthreads();
        //Now load the entries
        if (i < num_entries) {
            entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + (num_entries + 1)*sizeof(unsigned int) + i*sizeof(unsigned int)]);
        }
    }
    __syncthreads();
}

我得到不一致的内存访问，因为我试图复制到共享内存在这里(和在else语句):

entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);

因为updated_idx + 1不一定对齐。问题:

1)如果我不想填充我的数据结构来很好地对齐整数，那么逐个字节复制是我唯一的选择吗?

如果我用字节从全局拷贝字节到共享内存，它会比用unsigned int拷贝unsigned int慢4倍吗?

3)是否有可能得到不对齐的内存访问，如果我做它的字节字节?我想我读到过字节访问总是对齐的。

编辑:

我有一个类似btree的数据结构，其中每个节点包含一个有效负载，形式为:

struct Entry {
    unsigned int key;
    unsigned int next_level_offset;
    float prob1;
    float prob2;
 }

对于搜索b树，我只需要每个条目的键信息，而不需要结构体中的其余信息。因此，每个节点按以下方式折叠成字节数组:

(bool is_last)(key1, key2, key3…)((offset, prob1 prob2 (key1))， (offset, prob1 prob2 (key2))， (offset, prob1 prob2 (key3)))(unsigned int first_child_start_offset)(short sizeofChild1, short sizeofChild2, short sizeofChild3…)

显然，如果is_last为false，那么将只存储没有childnoffsets。

我以这种方式布置数据的原因是每个节点的条目数量可以是可变的，所以如果我在单独的数组中存储单独的东西，我将不得不保持那些"元数据"数组的开始和结束索引的额外跟踪，这将导致存储更多的数据或在搜索期间使用状态机，这是我想避免的。我相信，对于每个节点的bool部分，只需相对较少的工作就可以完成，但对于其他部分(如偏移量)则不行。

如果我不想填充我的数据结构以很好地对齐整数，那么逐个字节复制是我唯一的选择吗?

看看你提供的代码，我想说或多或少，是的。您可能需要使用内存。这样做，编译器将发出相当优化的字节复制循环。您可能还想研究更改ptxas默认缓存行为，以便加载绕过L1缓存(因此-Xptxas="——def-load-cache=cg"选项)。它可以提供更好的性能。
如果我一个字节一个字节地从全局复制到共享内存，它会比我能够通过unsigned int复制unsigned int慢4倍吗?

您应该期望内存吞吐量降低。在没有基准测试的情况下，很难说有多少。那是你的工作，如果你愿意的话
是否有可能得到不对齐的内存访问，如果我做它的字节字节?我想我读到过字节访问总是对齐的。

对齐标准始终是字长。所以单字节字总是对齐的。但是请记住，如果您执行字节加载到共享内存缓冲区，然后尝试使用reinterpret_cast读取未与共享字节数组对齐的较大单词，则会遇到相同的问题。

你没有给出关于给定子树大小的太多细节。可能有一些模板技巧可以用于将先验已知大小的字节加载扩展为一系列32位char4加载，并附带1到3个字节加载，以获得给定字节缓冲区大小的内存。如果它适合您的数据结构设计，那么它应该性能更高。