MPI_Scatter会减慢代码的速度

MPI_Scatter slows down the code?

本文关键字：代码速度 Scatter MPI 更新时间：2023-10-16

各位！我写了一段代码，用MPI计算两个巨大向量的标量乘积。首先，秩为0的进程创建两个随机向量，并通过MPI_Scatter将其发送给其他进程。然后，它们计算它们的部分和，并将其发送回秩为0。主要问题是MPI_Scatter将数据发送到其他进程需要花费大量时间，因此我的程序在添加其他进程时会变慢。我用MPI_Wtime((测量了它，在某些情况下，MPI_Scatter((函数占用了80%的计算时间。我的串行代码比我尝试过的任何MPI设置都快。

以下是我在具有不同进程数的dualcore上的结果：

处理时间

序列号03275

1 03453

2 04522

4 34755

8 58645

10 89112

20 244612

40 632633

你知道如何避免这种瓶颈吗？不要介意MPI_Allgather((。。。这是家庭作业的一部分：(

int main(int argc, char* argv[])
{
srand(time(NULL));
int size, len, whoAmI, i, j, k;
int N = 10000000;
double start, elapsed_time, end;
double *Vec1, *Vec2;
MPI_Init(&argc, &argv);
start = MPI_Wtime();
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &whoAmI);
if(N%size != 0){
    printf("choose a number that can be divided through 10000000n");
    exit(1);
}
int chunk = N/size;
double *buf1 = malloc(chunk * sizeof(double));  // Recv_Buf for MPI_scatter
double *buf2 = malloc(chunk * sizeof(double)); 
double *gatherResult = malloc(size*(sizeof(double)));   //Recv_Buf for MPI_Allgather
double result, FinalResult = 0;
if(whoAmI == 0){
    Vec1 = malloc(N * sizeof(double));
    Vec2 = malloc(N * sizeof(double));
    random_Vector(Vec1, N);
    random_Vector(Vec2, N); 
}   
/* sends the divided array to the other processes */
MPI_Scatter(Vec1, chunk, MPI_DOUBLE, buf1, chunk, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Scatter(Vec2, chunk, MPI_DOUBLE, buf2, chunk, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(whoAmI == 0){
    end = MPI_Wtime();
    elapsed_time = end - start;
    printf("Time taken %.4f secondsn", elapsed_time);
}
for(i = 0; i < chunk; i ++){
    result += buf1[i] * buf2[i];
}
printf("The sub result: #%d, %.2fn",whoAmI, result);
/* Allgather: (sendBuf, number of Elements in SendBuf, Type of Send, Number of Elements Recv, Recv Type, Comm)*/
MPI_Allgather(&result, 1 , MPI_DOUBLE, gatherResult, 1, MPI_DOUBLE , MPI_COMM_WORLD);
for(i = 0; i < size; i++){
    FinalResult += gatherResult[i]; 
}
MPI_Barrier(MPI_COMM_WORLD);
end = MPI_Wtime();
elapsed_time = end - start;
if(whoAmI == 0){
    printf("FinalResult is: %.2fn", FinalResult);
    printf("Time taken %.4f secondsn", elapsed_time);
    VecVec_Test(N, Vec1, Vec2, FinalResult);  // Test if the Result is correct
}
MPI_Barrier(MPI_COMM_WORLD);
return 0;
}

标量积的分布式计算只有在矢量已经以分布式方式存储的情况下才有意义，否则，每次通过网络(或任何其他IPC机制(将大矢量的内容从根进程推送到其他进程将花费比单线程进程完成所有工作更多的时间。标量积是一个内存绑定问题，这意味着当前的CPU核心速度太快，以至于当数据来自主内存而不是CPU缓存时，它很可能会以比CPU核心处理速度慢的速度到达。

为了演示MPI在这种情况下的帮助，您可以修改算法，使向量首先分散，然后多次计算分布式标量积：

MPI_Scatter(Vec1, buf1);
MPI_Scatter(Vec2, buf2);
// Always a good idea to sync the processes before benchmarking
MPI_Barrier();
start = MPI_Wtime();
for (i = 1; i <= 1000; i++) {
   local_result = dotprod(buf1, buf2);
   MPI_Reduce(&local_result, &result, MPI_SUM);
}
end = MPI_Wtime();
printf("Time per iteration: %fn", (end - start) / 1000);

(伪代码，不是真正的C++(

现在，您应该看到每次迭代的时间随着MPI进程的数量而减少，但前提是添加更多的MPI进程意味着更多的CPU套接字，从而获得更高的聚合内存带宽。请注意使用MPI_Reduce而不是MPI_Gather，后面跟sum。