std::线程的创建使主程序的速度减慢了50%

Creation of std::thread slows down main program by 50%

本文关键字：速度线程创建 std 主程序更新时间：2023-10-16

仅仅创建一个线程并加入它，就会将主线程的执行速度降低50%。正如您在下面的示例中看到的，线程什么也不做，仍然对性能有显著影响。我认为这可能是一个与功率/频率缩放相关的问题，所以我在创建线程后试图入睡，但没有成功。如果使用编译以下程序

g++ -std=c++11 -o out thread_test.cpp -pthread

显示的结果

Before thread() trial 0 time: 312024526 ignore -1593025974
Before thread() trial 1 time: 243018707 ignore -494037597
Before thread() trial 2 time: 242929293 ignore 177714863
Before thread() trial 3 time: 242935290 ignore 129069571
Before thread() trial 4 time: 243113945 ignore 840242475
Before thread() trial 5 time: 242824224 ignore -1635749271
Before thread() trial 6 time: 242809490 ignore -1256215542
Before thread() trial 7 time: 242910180 ignore -555222712
Before thread() trial 8 time: 235645414 ignore 537501443
Before thread() trial 9 time: 235746347 ignore 118363977
After thread() trial 0 time: 567509646 ignore 223146324
After thread() trial 1 time: 476450035 ignore -393907838
After thread() trial 2 time: 476377789 ignore -1678874628
After thread() trial 3 time: 476377012 ignore -1015350122
After thread() trial 4 time: 476185152 ignore 2034280344
After thread() trial 5 time: 476420949 ignore -1647334529
After thread() trial 6 time: 476354679 ignore 441573900
After thread() trial 7 time: 476120322 ignore -1576726357
After thread() trial 8 time: 476464850 ignore -895798632
After thread() trial 9 time: 475996533 ignore -997590921

而所有的试验都应该是相同的速度。

编辑：使用rdtsc（）进行时间测量，使用较大的持续时间，使用计算结果

thread_test.cpp:

#include <ctime>
#include <thread>
#include <iostream>
int dorands(){
  int a =0;
  for(int i=0; i<10000000; i++){
   a +=rand();
  }
  return a;
}
inline uint64_t rdtsc(){
  uint32_t lo, hi;
  __asm__ __volatile__ (
    "xorl %%eax, %%eaxn"
    "cpuidn"
    "rdtscpn"
    : "=a" (lo), "=d" (hi)
    :
    : "%ebx", "%ecx" );
  return (uint64_t)hi << 32 | lo;
}

int foo(){return 0;}
int main(){
  uint64_t begin;
  uint64_t end;
  for(int i = 0; i< 10; i++){
    begin= rdtsc();
    volatile int e = dorands();
    end = rdtsc();
    std::cout << "Before thread() trial "<<i<<" time: " << end-begin << " ignore " << e << std::endl;;
  }
  std::thread t1(foo);
  t1.join();
  for(int i = 0; i< 10; i++){
    begin= rdtsc();
    volatile int e = dorands();
    end = rdtsc();
    std::cout << "After thread() trial "<<i<<" time: " << end-begin << " ignore " << e << std::endl;;
  }
  return 1;
}

std::rand()是C rand()，它在glibc下调用__random()。__random()调用__libc_lock_lock()和__libc_lock_unlock()，我认为如果我们深入研究该代码，我们会发现在创建线程之前，锁本质上是不操作的。

我认为你遇到了一个基本问题：至少在一个典型的多任务操作系统上，从几毫秒到一秒钟左右的时间范围内，很难获得有意义的时间测量。

对于极短的序列，您可以使用时钟计数器（例如x86上的RDTSC），并运行它几次。如果任务切换发生在运行过程中，它会非常突出，因为运行时间比其他运行长很多倍。

这就指向了真正的问题：一旦你进入一个序列（比如你的序列），它需要足够长的时间，几乎可以肯定在它运行时至少会发生一次任务切换，那么你就会遇到一个问题：任务切换所损失的时间会大大缩短时间。特别是，如果任务切换发生在一次运行过程中，而不是在另一次运行中，则会使第二次运行的速度明显快于第一次。

最终，你会看到需要足够长时间的任务，所有任务都包括几个任务切换，因此由于任务切换数量的差异在噪音中几乎消失了。

注：理论上，clock应该只测量CPU时间，而不是墙上的时钟时间。事实上，几乎不可能完全排除所有任务切换时间。

您的测试演示了（或者可能演示）另一个相当基本的问题：dorand()计算了一些东西，但没有（例如）打印出结果。一个足够聪明的编译器可能（很容易）能够推断出它基本上没有效果，并基本上完全排除它。

即使打印出dorand的结果，也没有为随机数生成器设定种子，因此需要在每次运行中产生相同的结果。同样，一个足够聪明的编译器可以弄清楚这一点，并在编译时计算出正确的结果，只需打印出三个正确的结果。为了防止这种情况发生，我们可以（作为一种可能性）在每次运行时以不同的方式对随机数进行播种——通常的方法是检索当前时间，并将其传递给srand。

为了消除（或者至少减少）这些问题，我们可以重写代码如下：

#include <ctime>
#include <thread>
#include <iostream>
long long int dorands(){
  long long int a =0;
  for(int i=0; i<100000000; i++){
    a +=rand();
  }
  return a;
}
int foo(){return 0;}
int main(){
    srand(time(NULL));
  clock_t begin = clock();
  long long int e = dorands();
  clock_t end = clock();
  std::cout << "ignore: " << e << ", trial 1 time: " << end-begin << std::endl;;
  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 2 time: " << end - begin << std::endl;;
  std::thread t1(foo);
  t1.join();
  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 3 time: " << end - begin << std::endl;;
  begin = clock();
  e = dorands();
  end = clock();
  std::cout << "ignore: " << e << ", trial 4 time: " << end - begin << std::endl;;

  return 1;
}

这里我已经打印出了从dorand返回的值，所以编译器不能完全跳过对rand的调用。我还增加了dorand中的数字，所以每次试运行至少一秒钟（在我的电脑上，他们无论如何都会这样做）。

运行它，我得到这样的结果：

ignore: 1638407535924, trial 1 time: 1519
ignore: 1638386748597, trial 2 time: 1455
ignore: 1638433228933, trial 3 time: 1433
ignore: 1638288863328, trial 4 time: 1491

在这一特定的运行中，第一次试验（平均而言）比第二次试验慢，但有足够的变化和重叠，我们可能可以很安全地猜测这只是噪音——如果平均速度有任何真正的差异，那对我们来说太小了，无法测量。