Intelmpi错误(与OpenMPI一起使用)

IntelMPI error (Works with openMPI)

本文关键字:一起 OpenMPI 错误 Intelmpi      更新时间:2023-10-16

我遇到了这个错误,想知道我如何修复它。有点背景:它在开放式MPI中起作用,但不使用Intel MPI。

错误:

Fatal error in MPI_Recv: Invalid argument, error stack:
MPI_Recv(200): MPI_Recv(buf=0x7fffffff92f0, count=2, MPI_INT, src=0, 
tag=123, MPI_COMM_WORLD, status=(ni$
MPI_Recv(93).: Null pointer in parameter status

代码:首先发送

int rank;
int store[2];
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// if master node
if(rank == 0) {

    // Some maths to distribute the workload where positions_start and 
    // positions end are std::vector<int>
    int this_core[2];
    for(int i = 0; i < number_threads; i++) {
        store[0] = positions_start[i];
        store[1] = positions_end[i];
        if (i == 0) {
            //memcpy(this_core, store, 2 * sizeof(int));
            this_core[0] = positions_start[i];
            this_core[1] = positions_end[i];
        } else {
            MPI_Send(&store, 2, MPI_INT, i, 123, MPI_COMM_WORLD);
        }
    }
}

首先,接收:

// not master node
{ else {
   // Create array
    store[0] = -1;
    store[1] = -1;
    MPI_Recv(&store, 2, MPI_INT, 0, 123, MPI_COMM_WORLD, NULL);

第二个发送:

    float tempDataSet[(aVolumeSize[1] * aVolumeSize[0]) * ((store[1] - store[0]) + 1)];
    // Some math comutations here and results stored in tempDataset
    // Send results to the master
    MPI_Send(&tempDataSet, (aVolumeSize[1] * aVolumeSize[0]) * ((store[1] - store[0]) + 1), MPI_FLOAT, 0, 124, MPI_COMM_WORLD);
}

第二,接收:

 // back in master node receive the array from other nodes and store them in a larger array
 // Set the size of the buffer is necessary
    unsigned int number_of_voxels(aVolumeSize[0] * aVolumeSize[1] * aVolumeSize[2]);
     std::vector<float>& apVoxelDataSet;
     apVoxelDataSet.resize(number_of_voxels, 0);
        for (int i = 1; i < number_threads; ++i) {
            // Calculate the size of the array
            arrSize = (aVolumeSize[1] * aVolumeSize[0]) * ((positions_end[i] - positions_start[i]) + 1);
            float values[arrSize + 1];
            MPI_Recv(&values, arrSize, MPI_FLOAT, i, 124, MPI_COMM_WORLD, NULL);
            std::copy(values, values + arrSize + 1, apVoxelDataSet.begin() + curLoc);
            curLoc += arrSize;
        }

使用OpenMPI,一切都按预期工作,但是仅在我使用Intelmpi时出现错误。如何解决此错误,以便在Intelmpi中起作用?

编辑:slurm file

#!/bin/bash --login
#SBATCH --job-name=MPI_assignment
#SBATCH -o MPI_assignment.out
#SBATCH -e MPI_assignment.err
#SBATCH -t 0-12:00
#SBATCH --ntasks=60
#SBATCH --tasks-per-node=8
module purge
module load compiler/intel mpi/intel
printf "Making Project ...n"
make
printf "Done!"
mpirun -np $SLURM_NTASKS ./test >& MPI_assignment.log.$SLURM_JOBID

如果我更改了行:

module load compiler/intel mpi/intel

to

module load compiler/intel mpi/openmpi

module load compiler/gnu mpi/openmpi

程序有效,没有发生错误。

makefile

CXX=mpiicc
BIN=test
LIB=libImplicitSurface.a
OBJECTS= ImplicitSurface.o test.o
CXXFLAGS+=-I../include -O3 -Wall 
LDFLAGS+=-L. -lImplicitSurface

all: $(BIN)

$(BIN): $(LIB) test.o
@echo Build $@
@$(CXX) -o $@ test.o $(LDFLAGS)

$(LIB): ImplicitSurface.o
@echo Build $@ from $<
@$(AR) rcs $@ $<

# Default rule for creating OBJECT files from CXX files
%.o: ../src/%.cxx
@echo Build $@ from $<
@$(CXX) $(CXXFLAGS) -c $< -o $@

ImplicitSurface.o: ../include/ImplicitSurface.h 
../include/ImplicitSurface.inl ../src/ImplicitSurface.cxx
test.o: ../include/ImplicitSurface.h $(LIB) ../src/test.cxx

clean:
$(RM) $(OBJECTS)
$(RM) $(LIB)
$(RM) $(BIN)
$(RM) -r ../doc/html
$(RM) -r ../doc/tex

如果您不打算使用MPI_Recv()返回的状态«返回»,则必须使用MPI_STATUS_IGNORE而不是NULL

您的程序不正确,因此具有不确定的行为(阅读您很幸运,它与Open MPI一起使用(。