Parallel for with omp

本文关键字：omp with for Parallel 更新时间：2023-10-16

我尝试使用 OpenMP 优化以下循环：

    #pragma omp parallel for private(diff)
    for (int j = 0; j < x.d; ++j) {
        diff = x(example,j) - x(chosen_pts[ndx - 1],j);
        #pragma omp atomic
        d2 += diff * diff;
    }

但它的运行速度实际上比没有#pragma慢 4 倍。

编辑

正如 Piotr S.， coincoin 和 erenon 指出的那样，就我而言，x.d是如此之小，这就是为什么并行性使我的代码运行速度变慢的原因。我也发布了外循环，也许多线程有一些可能性：（x.n 超过 1 亿）

float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
    int example = dist2[i].second;
    float d2 = 0.0, diff;
    //#pragma omp parallel for private(diff) reduction(+:d2)
    for (int j = 0; j < x.d; ++j) {
        diff = x(example,j) - x(chosen_pts[ndx - 1],j);
        d2 += diff * diff;
    }
    if (d2 < dist2[i].first) {
        dist2[i].first = d2;
    }
    if (dist2[i].first > max_dist) {
        max_dist = dist2[i].first;
    }
    sum_distribution += dist2[i].first;
}

如果有人感兴趣，这里是整个函数：https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169，但据我测量，85% 的经过时间来自这个循环。

是的，发布的外部循环可以与 OpenMP 并行化。在循环中修改的所有变量要么是迭代的本地变量，要么用于循环聚合。而且我认为在计算diff时调用x()没有副作用。

要正确高效地并行执行聚合，您需要使用带有 reduction 子句的 OpenMP 循环。对于sum_distribution，约简操作是+，对于max_dist，它是max。因此，在外部循环前面添加以下编译指示应该可以完成这项工作：

#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)

请注意，max作为缩减操作只能从 OpenMP 3.1 开始使用。它并不是什么新鲜事，因此大多数支持 OpenMP 的编译器已经支持它，但不是全部;或者您可能使用旧版本。因此，查阅编译器的文档是有意义的。