std::inner_product比手动快4倍，但不使用SIMD

std::inner_product 4x faster than manual but no SIMD being used?

本文关键字：SIMD 4倍 inner product std 更新时间：2023-10-16

我对std::inner_product()与手动点积计算的性能比较感兴趣，所以我做了一个测试。

std::inner_product()比手动实现快4倍。我觉得这很奇怪，因为没有那么多方法来计算它，不是吗?我也看不到任何SSE/AVX寄存器在计算点被使用。

安装:VS2013/MSVC(12?)， Haswell i7 4770 CPU, 64位编译，发布模式

下面是c++测试代码:

#include <iostream>
#include <functional>
#include <numeric>
#include <cstdint>
int main() {
   const int arraySize = 1000;
   const int numTests = 500;
   unsigned int x, y = 0;
   unsigned long long* array1 = new unsigned long long[arraySize];
   unsigned long long* array2 = new unsigned long long[arraySize];
   //Initialise arrays
   for (int i = 0; i < arraySize; i++){
      unsigned long long val = __rdtsc();
      array1[i] = val;
      array2[i] = val;
   }
   //std::inner_product test
   unsigned long long timingBegin1 = __rdtscp(&s);
   for (int i = 0; i < numTests; i++){
      volatile unsigned long long result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0));
   }
   unsigned long long timingEnd1 = __rdtscp(&s);
   f, s = 0;
   //Manual Dot Product test
   unsigned long long timingBegin2 = __rdtscp(&f);
   for (int i = 0; i < numTests; i++){
      volatile unsigned long long result = 0;
      for (int i = 0; i < arraySize; i++){
         result += (array1[i] * array2[i]);
      }
   }
   unsigned long long timeEnd2 = __rdtscp(&f);

   std::cout << "STL:     :  " << static_cast<double>(finish1 - start1) / numTests << " CPU cycles per dot product" << std::endl;
   std::cout << "Manually :  " << static_cast<double>(finish2 - start2) / numTests << " CPU cycles per dot product" << std::endl;

您的测试很糟糕，这可能会造成很大的差异。

volatile uint64_t result = 0;
for (int i = 0; i < arraySize; i++){
   result += (array1[i] * array2[i]);

注意您是如何在这里连续使用volatile限定变量的。这将强制编译器将临时结果写入内存。

相反，您的inner_product版本:

volatile uint64_t result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0));

首先计算内部积，允许优化，然后才将结果分配给volatile合格的变量。