并行区域的OpenMP迭代for循环

OpenMP iteration for loop in parallel region

本文关键字：for 循环迭代 OpenMP 区域并行更新时间：2023-10-16

如果标题不够清晰，请原谅。我不知道该怎么说。

我想知道是否有什么方法可以做到以下几点:

#pragma omp parallel
{
    for (int i = 0; i < iterations; i++) {
        #pragma omp for
        for (int j = 0; j < N; j++)
            // Do something
    }
}

忽略诸如省略for循环中的私有说明符之类的事情，是否有任何方法可以在我的外部循环之外分叉线程，以便我可以并行化内部循环?从我的理解(请纠正我，如果我错了)，所有的线程将执行外循环。我不确定内循环的行为，但我认为for会将块分发给遇到它的每个线程。

我想做的是不需要分叉/加入iterations次，而只是在外部循环中做一次。这是正确的策略吗?

如果存在另一个不应该并行化的外部循环怎么办?这是…

#pragma omp parallel
{
    for (int i = 0; i < iterations; i++) {
        for(int k = 0; k < innerIterations; k++) {
            #pragma omp for
            for (int j = 0; j < N; j++)
                // Do something
            // Do something else
        }
    }
}

如果有人能给我举一个使用OpenMP并行化的大型应用程序的例子，这样我就能更好地理解使用OpenMP时应该采用的策略，那就太好了。我似乎找不到。

澄清:我正在寻找不改变循环顺序或涉及阻塞，缓存和一般性能考虑的解决方案。我想了解如何在指定的循环结构上在OpenMP中做到这一点。// Do something可能有依赖性，也可能没有，假设它们有依赖性，并且你不能移动东西。

你处理两个for循环的方式在我看来是正确的，从某种意义上说，它实现了你想要的行为:外部循环不是并行的，而内部循环是并行的。

为了更好地说明发生了什么，我将尝试在代码中添加一些注释:

#pragma omp parallel
{
  // Here you have a certain number of threads, let's say M
  for (int i = 0; i < iterations; i++) {
        // Each thread enters this region and executes all the iterations 
        // from i = 0 to i < iterations. Note that i is a private variable.
        #pragma omp for
        for (int j = 0; j < N; j++) {
            // What happens here is shared among threads so,
            // according to the scheduling you choose, each thread
            // will execute a particular portion of your N iterations
        } // IMPLICIT BARRIER             
  }
}

隐式屏障是线程相互等待的同步点。作为一般的经验法则，因此最好并行化外部循环而不是内部循环，因为这将为iterations*N迭代创建一个单点同步(而不是您上面创建的iterations点)。

我不能肯定我能回答你的问题。我现在只使用OpenMP几个月了，但是当我试图回答这样的问题时，我做了一些hello world打印测试，如下所示。我想这可能有助于回答你的问题。再试试#pragma omp for nowait，看看会发生什么。

只要确保当你"//做某件事和//做另一件事"时，你不会写入相同的内存地址并创建竞争条件。此外，如果你要进行大量的读写操作，你需要考虑如何有效地使用缓存。

#include "stdio.h"
#include <omp.h>
void loop(const int iterations, const int N) {
    #pragma omp parallel
    {
        int start_thread = omp_get_thread_num();
        printf("start thread %dn", start_thread);
        for (int i = 0; i < iterations; i++) {
            printf("titeration %d, thread num %dn", i, omp_get_thread_num());
            #pragma omp for
            for (int j = 0; j < N; j++) {
                printf("tt inner loop %d, thread num %dn", j, omp_get_thread_num());
            }
        }
    }
}
int main() {
    loop(2,30);
}

在性能方面，您可能需要考虑像这样融合您的循环。

#pragma omp for
for(int n=0; n<iterations*N; n++) {
    int i = n/N;
    int j = n%N;    
    //do something as function of index i and j
}

这个问题很难回答，因为它实际上取决于代码中的依赖项。但解决这个问题的一般方法是将循环嵌套倒置，像这样:

#pragma omp parallel
{
    #pragma omp for
    for (int j = 0; j < N; j++) {
        for (int i = 0; i < iterations; i++) {
            // Do something
        }
    }
}

当然，这是可能的还是不可能的，这取决于你在循环中的代码。