MPI_Scatterv/ Gatherv 使用带有"large" 2D 矩阵的C++会引发 MPI 错误
MPI_Scatterv/ Gatherv using C++ with "large" 2D matrices throws MPI errors
我为并行矩阵矩阵乘法实现了一些MPI_Scatterv
和MPI_Gatherv
例程。对于最大 N = 180 的小矩阵大小,如果我超过这个大小,一切正常,例如 N = 184 MPI 在使用MPI_Scatterv
时会抛出一些错误。
对于 2D 散射,我使用了一些带有MPI_Type_create_subarray
和MPI_TYPE_create_resized
的结构。这些问题中可以找到这些结构的解释。
我编写的最小示例代码用一些值填充矩阵 A,将其分散到本地进程,并将每个进程的排名号写入分散的 A 的本地副本中。之后,本地副本将收集到主进程。
#include "mpi.h"
#define N 184 // grid size
#define procN 2 // size of process grid
int main(int argc, char **argv) {
double* gA = nullptr; // pointer to array
int rank, size; // rank of current process and no. of processes
// mpi initialization
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// force to use correct number of processes
if (size != procN * procN) {
if (rank == 0) fprintf(stderr,"%s: Only works with np = %d.n", argv[0], procN * procN);
MPI_Abort(MPI_COMM_WORLD,1);
}
// allocate and print global A at master process
if (rank == 0) {
gA = new double[N * N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
gA[j * N + i] = j * N + i;
}
}
printf("A is:n");
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%f ", gA[j * N + i]);
}
printf("n");
}
}
// create local A on every process which we'll process
double* lA = new double[N / procN * N / procN];
// create a datatype to describe the subarrays of the gA array
int sizes[2] = {N, N}; // gA size
int subsizes[2] = {N / procN, N / procN}; // lA size
int starts[2] = {0,0}; // where this one starts
MPI_Datatype type, subarrtype;
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_DOUBLE, &type);
MPI_Type_create_resized(type, 0, N / procN * sizeof(double), &subarrtype);
MPI_Type_commit(&subarrtype);
// compute number of send blocks
// compute distance between the send blocks
int sendcounts[procN * procN];
int displs[procN * procN];
if (rank == 0) {
for (int i = 0; i < procN * procN; i++) {
sendcounts[i] = 1;
}
int disp = 0;
for (int i = 0; i < procN; i++) {
for (int j = 0; j < procN; j++) {
displs[i * procN + j] = disp;
disp += 1;
}
disp += ((N / procN) - 1) * procN;
}
}
// scatter global A to all processes
MPI_Scatterv(gA, sendcounts, displs, subarrtype, lA,
N*N/(procN*procN), MPI_DOUBLE,
0, MPI_COMM_WORLD);
// print local A's on every process
for (int p = 0; p < size; p++) {
if (rank == p) {
printf("la on rank %d:n", rank);
for (int i = 0; i < N / procN; i++) {
for (int j = 0; j < N / procN; j++) {
printf("%f ", lA[j * N / procN + i]);
}
printf("n");
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Barrier(MPI_COMM_WORLD);
// write new values in local A's
for (int i = 0; i < N / procN; i++) {
for (int j = 0; j < N / procN; j++) {
lA[j * N / procN + i] = rank;
}
}
// gather all back to master process
MPI_Gatherv(lA, N*N/(procN*procN), MPI_DOUBLE,
gA, sendcounts, displs, subarrtype,
0, MPI_COMM_WORLD);
// print processed global A of process 0
if (rank == 0) {
printf("Processed gA is:n");
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%f ", gA[j * N + i]);
}
printf("n");
}
}
MPI_Type_free(&subarrtype);
if (rank == 0) {
delete gA;
}
delete lA;
MPI_Finalize();
return 0;
}
它可以编译和运行使用
mpicxx -std=c++11 -o test test.cpp
mpirun -np 4 ./test
对于小 N=4,...,180,一切正常
A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000
la on rank 0:
0.000000 6.000000 12.000000
1.000000 7.000000 13.000000
2.000000 8.000000 14.000000
la on rank 1:
3.000000 9.000000 15.000000
4.000000 10.000000 16.000000
5.000000 11.000000 17.000000
la on rank 2:
18.000000 24.000000 30.000000
19.000000 25.000000 31.000000
20.000000 26.000000 32.000000
la on rank 3:
21.000000 27.000000 33.000000
22.000000 28.000000 34.000000
23.000000 29.000000 35.000000
Processed gA is:
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
在这里,您会看到我使用 N = 184 时的错误:
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffee066bad0, displs=0x7ffee066bae0, dtype=USER<resized>, rbuf=0xe9e590, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(205).........: fail failed
I_MPIR_Scatterv_intra(265)......: Failure during collective
I_MPIR_Scatterv_intra(259)......: fail failed
MPIR_Scatterv(141)..............: fail failed
MPIC_Recv(418)..................: fail failed
MPIC_Wait(269)..................: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error! cannot read from remote process
Fatal error in PMPI_Scatterv: Other MPI error, error stack:
PMPI_Scatterv(655)..............: MPI_Scatterv(sbuf=(nil), scnts=0x7ffef0de9b50, displs=0x7ffef0de9b60, dtype=USER<resized>, rbuf=0x21a7610, rcount=8464, MPI_DOUBLE, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(205).........: fail failed
I_MPIR_Scatterv_intra(265)......: Failure during collective
I_MPIR_Scatterv_intra(259)......: fail failed
MPIR_Scatterv(141)..............: fail failed
MPIC_Recv(418)..................: fail failed
MPIC_Wait(269)..................: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error! cannot read from remote process
我的猜测是使用子数组出了点问题,但为什么它适用于 N=4,...,180?另一种可能性是,我的数组数据对于大数据来说不是线性的,因此散点不再起作用。缓存大小会出现问题吗?我不敢相信 MPI 无法分散 N> 180 的 2D 阵列......
我希望你们中有人能帮助我。多谢!
首先,你的代码不适用于小 N。如果我设置 N=6 并初始化矩阵,以便所有条目都是唯一的,即
gA[j * N + i] = j*N+i;
然后你可以看到有一个错误:
mpiexec -n 4 ./gathervorig
A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000
la on rank 0:
0.000000 2.000000 7.000000
1.000000 6.000000 8.000000
2.000000 7.000000 12.000000
la on rank 1:
3.000000 5.000000 10.000000
4.000000 9.000000 11.000000
5.000000 10.000000 15.000000
la on rank 2:
18.000000 20.000000 25.000000
19.000000 24.000000 26.000000
20.000000 25.000000 30.000000
la on rank 3:
21.000000 23.000000 28.000000
22.000000 27.000000 29.000000
23.000000 28.000000 33.000000
这里的错误不在于代码,而在于打印:
printf("%f ", lA[j * procN + i]);
应该是
printf("%f ", lA[j * N/procN + i]);
现在,这至少给出了散点的正确答案:
mpiexec -n 4 ./gathervorig
A is:
0.000000 6.000000 12.000000 18.000000 24.000000 30.000000
1.000000 7.000000 13.000000 19.000000 25.000000 31.000000
2.000000 8.000000 14.000000 20.000000 26.000000 32.000000
3.000000 9.000000 15.000000 21.000000 27.000000 33.000000
4.000000 10.000000 16.000000 22.000000 28.000000 34.000000
5.000000 11.000000 17.000000 23.000000 29.000000 35.000000
la on rank 0:
0.000000 6.000000 12.000000
1.000000 7.000000 13.000000
2.000000 8.000000 14.000000
la on rank 1:
3.000000 9.000000 15.000000
4.000000 10.000000 16.000000
5.000000 11.000000 17.000000
la on rank 2:
18.000000 24.000000 30.000000
19.000000 25.000000 31.000000
20.000000 26.000000 32.000000
la on rank 3:
21.000000 27.000000 33.000000
22.000000 28.000000 34.000000
23.000000 29.000000 35.000000
由于类似的原因,收集失败 - 本地初始化:
lA[j * procN + i] = rank;
应该是
lA[j * N/procN + i] = rank;
在此更改之后,集合也有效:
Processed gA is:
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
0.000000 0.000000 0.000000 2.000000 2.000000 2.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
1.000000 1.000000 1.000000 3.000000 3.000000 3.000000
我认为这里的教训总是使用唯一的测试数据 - 初始化为 i*j 即使在小型系统中也很难发现初始错误。
实际上,真正的问题是你设置了 N=4,以便 procN = N/procN = 2。我总是尝试使用导致奇数/异常数字的大小,例如 N=6 给出 N/procN = 3,因此不会与 procN = 2 混淆。
- 用MacOS Mojave编译C++:致命错误:mpi.h:没有这样的文件或目录
- MPI突然停止了对多个核心的操作
- 设置 Visual Studio for MPI: 找不到标识符错误
- 使用 make 编译 MPI,几个命名空间错误,例如"错误:未知类型名称'使用'?
- 如何使用 MPI 的远程内存访问 (RMA) 功能并行化数据聚合?
- 重载 MPI 中的运算符 ()
- MPI:检查是否有任何进程已终止
- 使用 pybind11 共享 MPI 通信器
- 使用 CMake,Microsoft MPI 和 Visual Studio 2017 找不到 mpi.h
- 在具有 MPI 的超立方体中广播
- 通过 mpi 发送 c++ 标准::矢量<bool>
- 使用 MPI 的 C++ 中的并行 for 循环
- 如何将 OpenMP 和 MPI 导入到大型 CLion CMake 项目中?
- 如何通过Boost.MPI发送2d Boost.MultiArray的子阵列?
- HDF5 构建了并行支持,但找不到特定于 mpi 的功能
- MPI 集合通信中的指针分配
- 仅特定内核计数上的 MPI 内存损坏
- 如何在 Mac OS 上安装 boost-mpi 及其对 clang 的依赖关系?
- 从Visual Studio 2017运行MPI应用程序,每个进程在不同的cmd窗口中
- 在 MPI 中共享数组的一部分