与std::线程结合使用的嵌套openMP并行化

Nested openMP parallelisation in combination with std::thread

本文关键字：嵌套 openMP 并行化 std 线程结合更新时间：2023-10-16

大家好，

我目前在图像处理领域从事一个更大的项目。我正在使用Visual Studio 2013进行开发(没有商量余地)。我不想再麻烦你了，我的问题是这样的:

我有两个动作必须并行运行:

线性方程组的迭代解(使用1-2个线程)
一个相当复杂的过程，涉及图像到图像的配准。(使用所有剩余线程)

为了知道哪些图像需要配准，需要线性方程组的近似解。因此它们需要同时运行。(感谢Z玻色子指出了这些信息的缺失)。迭代解决方案不断运行，并在每次成功的图像配准后得到通知。

代码将在24核系统上运行。

目前使用openMP和"#pragma omp parallel for"实现图像配准。迭代解决方案正在使用std::线程启动，并在内部使用openMP"#pragma omp parallel for"。

现在我知道，根据omp文档，发现嵌套并行性的omp线程将使用其线程组来执行代码。但我认为这在我的情况下不起作用，因为它是一个std::线程启动第二个并行程序。

为了更好地理解，这里有一个示例代码:

int main()
{
    std::thread * m_Thread = new std::thread(&IterativeSolution);
    #pragma omp parallel for
    for(int a = 0; a < 100; a++)
    {
        int b = GetImageFromApproximateSolution();
        RegisterImages(a,b);
        // Inform IterativeSolution about result of registration
    }
}
void IterativeSolution()
{
    #pragma omp parallel for
    for(int i = 0; i < 2; i++)
    {
        //SolveColumn(i);
    }
}
void RegisterImage(int a, int b)
{
    // Do Registration
}

在这一点上我的问题是:上面的代码会创建太多的线程吗?如果是这样，下面的代码能解决问题吗?

int main()
{
    // The max is to avoid having less than 1 thread
    int numThreads = max(omp_get_max_threads() - 2, 1); 
    std::thread * m_Thread = new std::thread(&IterativeSolution);
    #pragma omp parallel for num_threads(numThreads)
    for(int a = 0; a < 100; a++)
    {
        int b = GetImageFromApproximateSolution();
        RegisterImages(a,b);
        // Inform IterativeSolution about result of registration
    }
}
void IterativeSolution()
{
    #pragma omp parallel for num_threads(2)
    for(int i = 0; i < 2; i++)
    {
        //SolveColumn(i);
    }
}
void RegisterImage(int a, int b)
{
    // Do Registration
}

这会产生OpenMP标准中未定义的行为。我测试过的大多数实现将为第一个示例中的两个并行区域分别创建24个线程，总共48个线程。第二个例子不应该创建太多的线程，但由于它依赖于未定义的行为，它可能会在没有警告的情况下做任何事情，从崩溃到将您的计算机变成果冻状的物质。

既然您已经在使用OpenMP，我建议您通过简单地删除std::线程，并使用嵌套的OpenMP并行区域来使其符合OpenMP标准。你可以这样做:

int main()
{
    // The max is to avoid having less than 1 thread
    int numThreads = max(omp_get_max_threads() - 2, 1); 
    #pragma omp parallel num_threads(2)
    {
        if(omp_get_thread_num() > 0){
            IterativeSolution();
        }else{
            #pragma omp parallel for num_threads(numThreads)
            for(int a = 0; a < 100; a++)
            {
                int b = GetImageFromApproximateSolution();
                RegisterImages(a,b);
                // Inform IterativeSolution about result of registration
            }
        }
    }
}
void IterativeSolution()
{
    #pragma omp parallel for num_threads(2)
    for(int i = 0; i < 2; i++)
    {
        //SolveColumn(i);
    }
}
void RegisterImage(int a, int b)
{
    // Do Registration
}

您可能需要向您的环境添加环境变量定义OMP_NESTED=true和OMP_MAX_ACTIVE_LEVELS=2，或者更多，以启用嵌套区域。这个版本的优点是完全在OpenMP中定义，可以移植到任何支持嵌套并行区域的环境中。如果您的版本不支持嵌套的OpenMP并行区域，那么您建议的解决方案可能是剩下的最佳选择。