MPI 的散射操作

MPI's Scatterv operation

本文关键字：操作 MPI 更新时间：2023-10-16

我不确定我是否正确理解了MPI_Scatterv应该做什么。我有79个项目要分散数量——一个可变数量的节点。然而，当我使用MPI_Scatterv命令时，我会得到荒谬的数字（就好像我的接收缓冲区的数组元素没有初始化一样）。以下是相关的代码片段：

MPI_Init(&argc, &argv);
int id, procs;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
//Assign each file a number and figure out how many files should be
//assigned to each node
int file_numbers[files.size()];
int send_counts[nodes] = {0}; 
int displacements[nodes] = {0};
for (int i = 0; i < files.size(); i++)
{
    file_numbers[i] = i;
    send_counts[i%nodes]++;
}   
//figure out the displacements
int sum = 0;
for (int i = 0; i < nodes; i++)
{
    displacements[i] = sum;
    sum += send_counts[i];
}   
//Create a receiving buffer
int *rec_buf = new int[79];
if (id == 0)
{
    MPI_Scatterv(&file_numbers, send_counts, displacements, MPI_INT, rec_buf, 79, MPI_INT, 0, MPI_COMM_WORLD);
}   
cout << "got here " << id << " checkpoint 1" << endl;
cout << id << ": " << rec_buf[0] << endl;
cout << "got here " << id << " checkpoint 2" << endl;
MPI_Barrier(MPI_COMM_WORLD); 
free(rec_buf);
MPI_Finalize();

当我运行该代码时，我会收到以下输出：

got here 1 checkpoint 1
1: -1168572184
got here 1 checkpoint 2
got here 2 checkpoint 1
2: 804847848
got here 2 checkpoint 2
got here 3 checkpoint 1
3: 1364787432
got here 3 checkpoint 2
got here 4 checkpoint 1
4: 903413992
got here 4 checkpoint 2
got here 0 checkpoint 1
0: 0
got here 0 checkpoint 2

我阅读了OpenMPI的文档，并查看了一些代码示例，我不确定我遗漏了什么——任何帮助都会很棒！

最常见的MPI错误之一再次出现：

if (id == 0)    // <---- PROBLEM
{
    MPI_Scatterv(&file_numbers, send_counts, displacements, MPI_INT,
                 rec_buf, 79, MPI_INT, 0, MPI_COMM_WORLD);
}

MPI_SCATTERV是集体MPI操作。集体操作必须由指定通信器中的所有进程执行，才能成功完成。您只在排名0中执行它，这就是为什么只有它才能获得正确的值。

解决方案：移除条件if (...)。

但这里还有另一个微妙的错误。由于集合操作不提供任何状态输出，MPI标准强制执行发送到某个秩的元素数量与该秩愿意接收的元素数量的严格匹配。在您的情况下，接收器总是指定可能与send_counts中的相应编号不匹配的79元素。您应该使用：

MPI_Scatterv(file_numbers, send_counts, displacements, MPI_INT,
             rec_buf, send_counts[id], MPI_INT,
             0, MPI_COMM_WORLD);

在这里发布问题时，还要注意代码中的以下差异，这可能是拼写错误：

MPI_Comm_size(MPI_COMM_WORLD, &procs);
                               ^^^^^
int send_counts[nodes] = {0};
                ^^^^^
int displacements[nodes] = {0};
                  ^^^^^

当您在procs变量中获得列组数时，nodes将用于代码的其余部分。我想nodes应该被procs取代。