并行算法，做一个小的插入/移位

Parallel algorithm that does a small insertion/shifting

本文关键字：插入移位一个并行算法更新时间：2023-10-16

假设我有一个包含8个数字的数组a，我还有一个包含数字的数组B来决定a中的数字应该向右移动多少位

A 3、6、7、8、1、2、3、5

B

0表示有效，1表示该数字应该在1位之后，输出数组应该在3位之后插入0，输出数组C应该是:

C: 3 0 6, 7, 8, 1, 2, 3

是否插入0或其他数字并不重要，关键是3之后的所有数字都移动了一位。出站号码将不再在数组中。

另一个例子:

A 3、6、7、8、1、2、3、5

B 0、1、0、2、0、0、0

C、0、6、7、8、0、1、2

.......................................

A 3、6、7、8、1、2、3、5

B 0、1、0、1、0、0

C, 0, 6, 7, 8, 1, 2, 3

我正在考虑使用扫描/前缀和或类似的东西来解决这个问题。此外，这个数组很小，我应该能够将数组放入一次warp (<32个数字)并使用shuffle指令。有人知道吗?

一个可能的方法。

由于移位的模糊性(例如，0, 1, 0, 1, 0, 1, 1, 1和0, 1, 0 ,0都产生相同的数据偏移模式)，不可能仅仅创建移位模式的前缀和来产生每个位置的相对偏移量。然而，我们可以观察到，如果移位模式中的每个零都被其左侧的第一个非零移位值替换，则将创建一个有效的偏移模式:

0, 1, 0, 0   (shift pattern)
0, 1, 1, 1   (offset pattern)

或

0, 2, 0, 2   (shift pattern)
0, 2, 2, 2   (offset pattern)

怎么做呢?让我们假设我们有第二个测试用例移位模式:

      0, 1, 0, 0, 2, 0, 0, 0

我们期望的偏移模式是:

      0, 1, 1, 1, 2, 2, 2, 2

对于给定的移位模式，创建一个二进制值，如果移位模式对应索引处的值为零，则每个位为1，否则为零。我们可以使用一个叫做__ballot()的扭曲投票指令。每个通道将从选票中获得相同的值:
```
  1  0  1  1  0  1  1  1  (this is a single binary 8-bit value in this case)
```
每个曲线现在将采用这个值，并添加一个值，它在曲线位置有1位。在示例的剩余部分使用第1行:
```
+ 0  0  0  0  0  0  1  0  (the only 1 bit in this value will be at the lane index)
= 1  0  1  1  1  0  0  1
```
我们现在取第2步的结果，用第1步的结果进行位异或运算:
```
= 0  0  0  0  1  1  1  0
```
我们现在计算这个值中1位的数量(有一个__popc()固有的)，并从结果中减去1。因此，对于上面的通道1示例，这一步的结果将是2，因为设置了3位。这给了我们到左边第一个值的距离，这个值在原始移位模式中是非零的。因此，对于1号车道的例子，1号车道左边的第一个非零值比2个车道高，即3号车道。
对于每个车道，我们使用第4步的结果来获取该车道的适当偏移值。我们可以使用__shfl_down() warp shuffle指令一次处理所有通道。
```
  0, 1, 1, 1, 2, 2, 2, 2
```

一旦我们有了所需的偏移模式，让每个warp lane使用其偏移值来适当地移动其数据项的过程就很简单了。

这是一个完整的工作示例，使用您的3个测试用例。上面的步骤1-4包含在__device__函数mydelta中。内核的其余部分执行第5步shuffle，适当地为数据建立索引，并复制数据。由于使用了warp shuffle指令，我们必须为cc3.0或更高版本的GPU编译此代码。(但是，用允许在cc2.0或更高版本的设备上操作的其他索引代码替换warp shuffle指令并不困难。)此外，由于使用了各种内在特性，此函数不能用于超过32个数据项，但这是您的问题中所述的先决条件。

$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) 
    do { 
        cudaError_t __err = cudaGetLastError(); 
        if (__err != cudaSuccess) { 
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)n", 
                msg, cudaGetErrorString(__err), 
                __FILE__, __LINE__); 
            fprintf(stderr, "*** FAILED - ABORTINGn"); 
            exit(1); 
        } 
    } while (0)

__device__ int mydelta(const int shift){
  unsigned nz = __ballot(shift == 0);
  unsigned mylane = (threadIdx.x & 31);
  unsigned lanebit = 1<<mylane;
  unsigned temp = nz + lanebit;
  temp = nz ^ temp;
  unsigned delta = __popc(temp);
  return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
  if (threadIdx.x < limit){
    unsigned lshift = shift[(limit - 1) - threadIdx.x];
    unsigned delta = mydelta(lshift);
    unsigned myshift = __shfl_down(lshift, delta);
    myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
    result[threadIdx.x] = 0;
    if ((myshift + threadIdx.x) < limit)
    result[threadIdx.x + myshift] = data[threadIdx.x];
  }
}
int main(){
  int A[DSIZE]         = {3, 6, 7, 8, 1, 2, 3, 5};
  unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
  unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
  unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
  int *d_data, *d_result, *h_result;
  unsigned *d_shift;
  h_result = (int *)malloc(DSIZE*sizeof(int));
  if (h_result == NULL) { printf("malloc failn"); return 1;}
  cudaMalloc(&d_data, DSIZE*sizeof(int));
  cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
  cudaMalloc(&d_result, DSIZE*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMempcyH2D fail");
  mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMempcyD2H fail");
  printf("index: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", i);
  printf("nA:     ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", A[i]);
  printf("ntc1 B: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", tc1B[i]);
  printf("ntc1 C: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", h_result[i]);
  cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMempcyH2D fail");
  mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMempcyD2H fail");
  printf("ntc2 B: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", tc2B[i]);
  printf("ntc2 C: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", h_result[i]);
  cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMempcyH2D fail");
  mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMempcyD2H fail");
  printf("ntc3 B: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", tc3B[i]);
  printf("ntc2 C: ");
  for (int i = 0; i < DSIZE; i++)
    printf("%d, ", h_result[i]);
  printf("n");
  return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A:     3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$