在OpenCV中测试parallel_for_性能

Testing parallel_for_ performance in OpenCV

本文关键字：for 性能 parallel 测试 OpenCV 更新时间：2023-10-16

我在OpenCV中通过与仅用于简单数组求和和和乘法的正常操作进行比较来测试parallel_for_。

我有一个由100个整数组成的数组，每个整数分成10个，并使用parallel_for_运行。

然后，我也有正常的0到99运算，用于求和和和多次叠加。

然后我测量了经过的时间，正常操作比parallel_for_操作快。

我的CPU是Intel（R）Core（TM）i7-2600 Quard Core CPU。CCD_ 4运算求和耗时0.002秒（耗时2个时钟周期），乘法耗时0.003秒（耗时3个时钟循环）。

但正常操作需要0.0000秒（不到一个点击周期）才能完成求和和和乘法运算。我错过了什么？我的代码如下。

测试等级

#include <opencv2coreinternal.hpp>
#include <opencv2corecore.hpp>
#include <tbbtbb.h>
using namespace tbb;
using namespace cv;
template <class type>
class Parallel_clipBufferValues:public cv::ParallelLoopBody
{
   private:
       type *buffertoClip;
       type maxSegment;
       char typeOperation;//m = mul, s = summation
       static double total;
   public:
       Parallel_clipBufferValues(){ParallelLoopBody::ParallelLoopBody();};
       Parallel_clipBufferValues(type *buffertoprocess, const type max, const char op): buffertoClip(buffertoprocess), maxSegment(max), typeOperation(op){ 
           if(typeOperation == 's')
                total = 0; 
           else if(typeOperation == 'm')
                total = 1; 
       }
       ~Parallel_clipBufferValues(){ParallelLoopBody::~ParallelLoopBody();};
       virtual void operator()(const cv::Range &r) const{
           double tot = 0;        
           type *inputOutputBufferPTR = buffertoClip+(r.start*maxSegment);
           for(int i = 0; i < 10; ++i)
           {
               if(typeOperation == 's')
                  total += *(inputOutputBufferPTR+i);
               else if(typeOperation == 'm')
                  total *= *(inputOutputBufferPTR+i);
           }
       }
       static double getTotal(){return total;}
       void normalOperation(){
           //int iteration = sizeof(buffertoClip)/sizeof(type);
           if(typeOperation == 'm')
           {
               for(int i = 0; i < 100; ++i)
               {
                  total *= buffertoClip[i];
               }
           }
           else if(typeOperation == 's')
           {
               for(int i = 0; i < 100; ++i)
               {
                  total += buffertoClip[i];
               }
           }
       }
};

主要

    #include "stdafx.h"
    #include "TestClass.h"
    #include <ctime>
    double Parallel_clipBufferValues<int>::total;
    int _tmain(int argc, _TCHAR* argv[])
    {
        const int SIZE=100;
        int myTab[SIZE];
        double totalSum_by_parallel;
        double totalSun_by_normaloperation;
        double elapsed_secs_parallel;
        double elapsed_secs_normal;
        for(int i = 1; i <= SIZE; i++)
        {
            myTab[i-1] = i;
        }
        int maxSeg =10;
        clock_t begin_parallel = clock();
        cv::parallel_for_(cv::Range(0,maxSeg), Parallel_clipBufferValues<int>(myTab, maxSeg, 'm'));
        totalSum_by_parallel = Parallel_clipBufferValues<int>::getTotal();
        clock_t end_parallel = clock();
        elapsed_secs_parallel = double(end_parallel - begin_parallel) / CLOCKS_PER_SEC;
        clock_t begin_normal = clock();
        Parallel_clipBufferValues<int> norm_op(myTab, maxSeg, 'm');
        norm_op.normalOperation();
        totalSun_by_normaloperation = norm_op.getTotal();
        clock_t end_normal = clock();
        elapsed_secs_normal = double(end_normal - begin_normal) / CLOCKS_PER_SEC;
        return 0;
    }

让我考虑一下：

准确性

clock()函数根本不准确。它的勾号大致为1 / CLOCKS_PER_SEC，但它的更新频率以及它是否统一取决于系统和实现。有关这方面的更多详细信息，请参阅本文。

测量时间的更好替代方案：

这篇文章适用于Windows
这篇文章是为*nix写的

试验和测试环境

度量总是受到错误的影响。代码的性能测量会受到其他程序、缓存、操作系统作业、调度和用户活动的影响（短列表中，还有更多）。要获得更好的度量，您必须多次（假设1000次或更多），然后计算平均值。此外，您应该准备好您的测试环境，使其尽可能干净。

关于这些帖子测试的更多细节：

如何用Java编写正确的微基准测试
NAS并行基准
Visual C++11并行循环测试版（用于代码示例）
我们的Eric Lippert关于基准测试的精彩文章（这是关于C#的，但大多数文章直接适用于任何bechmark）：C#性能基准错误（第二部分）

开销和可扩展性

在这种情况下，并行执行（以及测试代码结构）的开销要比循环体本身高得多。在这种情况下，使算法并行是无效的。并行执行必须始终在特定场景中进行评估、测量和比较。这不是一种加速一切的灵丹妙药。看看这篇关于如何量化可伸缩性的文章。

举个例子，如果你必须对100个数字求和/乘，最好使用SIMD指令（在展开的循环中更好）。

测量一下

试着让循环体为空（或者执行单个NOP操作或volatile写入，这样它就不会被优化掉）。你将粗略地测量头顶。现在将其与您的结果进行比较。

关于此测试的说明

IMO这种测试毫无用处。您无法以通用的方式比较串行或并行执行。这是您应该始终针对特定情况进行检查的内容（在现实世界中，许多事情都会发生，例如同步）。

想象一下：你让你的循环体真的很"重"，你会看到并行执行的速度大大加快。现在，您将实际程序并行，您会发现性能更差。为什么？因为锁、缓存问题或对共享资源的串行访问会减慢并行执行的速度。

测试本身是没有意义的，除非你在特定的情况下测试你的特定代码（因为会有太多因素起作用，你不能忽视它们）。这意味着什么？好吧，你只能比较你测试的。。。如果您的程序执行total *= buffertoClip[i];，那么您的结果是可靠的。如果你的真实程序做了其他事情，那么你必须用重复测试其他事情。