共享内存原子为sm_20编译，但不为sm_13编译

Shared memory atomics compile for sm_20 but not sm_13

本文关键字：sm 编译内存共享更新时间：2023-10-16

我使用的是具有1.3计算能力和nvcc编译器驱动程序4.0的特斯拉C1060。我正在尝试对线程块进行一些本地计算。每个线程块都提供有一个共享数组，该数组首先初始化为零值。为了同步线程块的线程对共享数据的并发更新（添加），我使用CUDAatomicAdd原语。

一旦每个线程块在其共享数据数组中准备好了结果，共享数据数组的每个条目就会迭代地合并（使用atomicAdd）到全局数据数组中的相应条目。

下面是一个代码，非常类似于我基本上试图做的事情。

#define DATA_SZ 16
typedef unsigned long long int ULLInt;
__global__ void kernel( ULLInt* data, ULLInt ThreadCount )
{
  ULLInt thid = threadIdx.x + blockIdx.x * blockDim.x;
  __shared__ ULLInt sharedData[DATA_SZ];
  // Initialize the shared data
  if( threadIdx.x == 0 )
  {
    for( int i = 0; i < DATA_SZ; i++ ) { sharedData[i] = 0; }
  }
  __syncthreads();
  //..some code here
  if( thid < ThreadCount )
  {
    //..some code here
    atomicAdd( &sharedData[getIndex(thid), thid );
    //..some code here        
    for(..a loop...)
    { 
      //..some code here
      if(thid % 2 == 0)
      {           
        // getIndex() returns a value in [0, DATA_SZ )
        atomicAdd( &sharedData[getIndex(thid)], thid * thid );
      }
    }
  }
  __syncthreads();
  
  if( threadIdx.x == 0 )
  {
    // ...
    for( int i = 0; i < DATA_SZ; i++ ) { atomicAdd( &Data[i], sharedData[i] ); }
    //...
  }
}

如果我用-arch=sm_20编译，我不会得到任何错误。然而，当我使用-arch=sm_13选项编译内核时，我会得到以下错误：

ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error   : Global state space expected for instruction 'atom'
ptxas /tmp/tmpxft_00004dcf_00000000-2_mycode.ptx, line error   : Global state space expected for instruction 'atom'
ptxas fatal   : Ptx assembly aborted due to errors

如果我注释掉以下两行，-arch=sm_13:不会出现任何错误

atomicAdd( &sharedData[getIndex(thid), thid );
atomicAdd( &sharedData[getIndex(thid)], thid * thid );

有人能告诉我可能做错了什么吗？

在CUDA C编程指南中找到了解决方案：在共享内存上操作的原子函数和在64位字上操作的核函数仅适用于计算能力为1.2及以上的设备。对共享内存中的64位字进行操作的原子函数仅适用于计算能力为2.x及更高的设备。

因此，基本上我不能在这里使用共享内存的ULLInt，并且不知何故，我需要使用无符号的int