数组的并行求和比C++中的顺序求和慢

Parallel summation of the array is slower than the sequential one in C++

本文关键字：求和顺序并行数组 C++ 更新时间：2023-10-16

我编写了使用 C++ std：：thread 对数组进行并行求和的代码。但并行求和需要 0.6 秒，顺序总和需要 0.3 秒。

我不认为这段代码在arr或ret上进行任何同步。

为什么会出现这种情况？

我的 CPU 是 i7-8700，它有 6 个物理内核。

#include <stdio.h>
#include <ctime>
#include <thread>
// Constants
#define THREADS 4
#define ARR_SIZE 200000000
int ret[THREADS];
// Function for thread.
void parallel_sum(int *arr, int thread_id) {
    int s = ARR_SIZE / THREADS * thread_id, e = ARR_SIZE / THREADS * (thread_id + 1);
    printf("%d, %dn", s, e);
    for (int i = s; i < e; i++) ret[thread_id] += arr[i];
}
int main() {
    // Variable definitions
    int *arr = new int[ARR_SIZE]; // 1 billion
    time_t t1, t2; // Variable for time consuming checking
    std::thread *threads = new std::thread[THREADS];
    // Initialization
    for (int i = 0; i < ARR_SIZE; i++) arr[i] = 1;
    for (int i = 0; i < THREADS; i++) ret[i] = 0;
    long long int sum = 0;
    // Parallel sum start
    t1 = clock();
    for (int i = 0; i < THREADS; i++) threads[i] = std::thread(parallel_sum, arr, i);
    for (int i = 0; i < THREADS; i++) threads[i].join();
    t2 = clock();
    for (int i = 0; i < THREADS; i++) sum += ret[i];
    printf("[%lf] Parallel sum %lld n", (float)(t2 - t1) / (float)CLOCKS_PER_SEC, sum);
    // Parallel sum end

    sum = 0; // Initialization

    // Sequential sum start
    t1 = clock();
    for (int i = 0; i < ARR_SIZE; i++) sum += arr[i];
    t2 = clock();
    printf("[%lf] Sequential sum %lld n", (float)(t2 - t1) / (float)CLOCKS_PER_SEC, sum);
    // Sequential sum end

    return 0;
}

for (int i = s; i < e; i++) ret[thread_id] += arr[i];

这会导致大量缓存争用，因为数组ret元素可能共享同一缓存行。它通常被称为虚假共享。

一个简单的解决方法是使用辅助（线程）局部变量进行循环更新，最后递增共享计数器，例如：

int temp = 0;
for (int i = s; i < e; i++) temp += arr[i];
ret[thread_id] += temp;

或者，最好对多线程总和使用类型 std::atomic<int> 的单个全局ret。然后，您可以简单地编写：

int temp = 0;
for (int i = s; i < e; i++) temp += arr[i];
ret += temp;

或者，更有效率：

ret.fetch_add(temp, std::memory_order_relaxed);

启用编译器优化后（对任何其他方式进行基准测试都没有意义），我得到以下结果：

[0.093481] 并行和 200000000
[0.073333] 顺序总和 200000000

请注意，我们已经记录了这两种情况下的总 CPU 消耗。并行总和使用更多的总 CPU 并不奇怪，因为它必须启动线程并聚合结果。并行版本使用更多的 CPU 时间，因为它有更多的工作要做。

您不会记录挂载时间，但很可能是因为四个内核有助于完成这项工作，因此在并行情况下，挂墙时间可能更短。添加代码以记录经过的挂断时间显示，并行版本使用的挂断时间大约是串行版本的一半。至少，在我的机器上具有合理的编译器优化设置。