ARM Neon C: Wrong answer

本文关键字：Wrong answer Neon ARM 更新时间：2023-10-16

我是一个学习ARM neon C扩展的初学者，我正试图向量化下面给出的for循环

for (p = Lp [i] + 1 ; p < c [i] ; p++)
 {
      x [Li [p]] += Lx [p] * lki ;
 }

，其中x和Lx是双精度数组。我将代码修改如下:

float32x4_t Lx_vec;
float32x4_t lki_vec;
float32x4_t result_vec;
lki_vec = vdupq_n_f32(lki);/* duplicate lki in all lanes*/
for (p = Lp [i] + 1 ; p < c[i]/4 ; p+=4)
{
float lx_float[4]; 
for (int m = 0; m < 4; ++m) /* loop needed because double not supported in neon*/
{ 
   A_float[m] = (float)A[p+m]; 
}
Lx_vec = vld1q_f32(lx_float);/*vectorise subset of Lx*/
//parallel multiplication of the vectors
result_vec[p] = vmulq_f32(Lx_vec,lki_vec);
//store value to x[Li[p]]
vst1q_lane_f32(&result,result_vec,0);
x [Li [p]] += (double)result;
result = 0;
vst1q_lane_f32(&result,result_vec,1);
x [Li [p+1]] += (double)result;
result = 0;
vst1q_lane_f32(&result,result_vec,2);
x [Li [p+2]] += (double)result;
result = 0;
vst1q_lane_f32(&result,result_vec,3);
x [Li [p+3]] += (double)result;
}

我想我做的是完全错误的，因为代码给了我分割错误。我不知道我做错了什么。此外，我也想不出另一种方法来对循环进行矢量化。

我现在根据下面评论中的建议添加了数组大小不是4的倍数的条件的处理。

   int loopCount = (c[i]- (Lp [i] + 1))/4;
   p = Lp [i] + 1;
   int count = 0;
   while (count<loopCount)
   {
      float lx_float[4]; 
      for (int m = 0; m < 4; ++m) /* loop needed because double not supported in neon*/
      { 
        lx_float[m] = (float)Lx[p+m]; 
      }
      Lx_vec = vld1q_f32(lx_float);/*vectorise subset of Lx*/
      //parallel multiplication of the vectors
      result_vec = vmulq_f32(Lx_vec,lki_vec);
      //store value to x[Li[p]]
      vst1q_lane_f32(&result,result_vec,0);
      x [Li [p]] -= (double)result;
      result = 0;
      vst1q_lane_f32(&result,result_vec,1);
      x [Li [p+1]] -= (double)result;
      result = 0;
      vst1q_lane_f32(&result,result_vec,2);
      x [Li [p+2]] -= (double)result;
      result = 0;
      vst1q_lane_f32(&result,result_vec,3);
      x [Li [p+3]] -= (double)result;
      count++;
      p+=4;
    }
    //normal calculation for the remaining indices
    for ( ; p < c [i] ; p++)
    {
      x [Li [p]] -= Lx [p] * lki ;
    }

现在没有出现分段故障。但是我的代码仍然给出错误的答案。也就是说，得到的结果与矢量化之前得到的结果不同。我哪里做错了?

将double转换为float，使用NEON进行计算，然后再转换回double，对吗?

32位NEON不支持double。因此，转换为float是由VFP完成的，切换到NEON，做数学运算，切换回VFP，转换为double并存储结果。

然而，需要注意的是，每次转换大约浪费12个周期。你浪费了24个周期/迭代，这远远超过了你想通过向量化节省的周期。

现在你有了选择:

当你必须处理double时，远离NEON。
执行非常繁重的展开，以便减少24周期开销。
使用NEON读取双64bit整数，使用整数位黑客进行转换

1)是最现实的，2)几乎不可能使用intrinsic, 3)需要NEON和IEEE754数据类型的专业知识。

天下没有免费的午餐。编译器和intrinsic不会自动生成优化的机器码。intrinsic生成的NEON代码大多是垃圾，有时甚至是错误的，除非你更深入地研究反汇编，否则你不会发现哪里出了问题。