在Slurm和使用命令的情况下,MPI结果不同

MPI result is different under Slurm and by using command

本文关键字:MPI 结果 情况下 Slurm 命令      更新时间:2023-10-16

我在Slurm运行MPI项目时遇到了一个问题。

a1是我的可执行文件。当我刚运行mpiexec -np 4 ./a1时,它工作得很好

但当我在Slurm下运行它时,它不能很好地工作,看起来它停在了中间:

这是使用mpiexec -np 4 ./a1的输出,这是正确的。

Processor1 will send and receive with processor0
Processor3 will send and receive with processor0
Processor0 will send and receive with processor1
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor2 will send and receive with processor0
Processor1 will send and receive with processor2
Processor2 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor0 will send and receive with processor3
Processor0 finished send and receive with processor3
Processor3 finished send and receive with processor0
Processor1 finished send and receive with processor2
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor0: I am very good, I save the hash in range 0 to 65
p: 4
Tp: 8.61754
Processor1 will send and receive with processor3
Processor3 will send and receive with processor1
Processor3 finished send and receive with processor1
Processor1 finished send and receive with processor3
Processor2 will send and receive with processor3
Processor1: I am very good, I save the hash in range 65 to 130
Processor2 finished send and receive with processor3
Processor3 will send and receive with processor2
Processor3 finished send and receive with processor2
Processor3: I am very good, I save the hash in range 195 to 260
Processor2: I am very good, I save the hash in range 130 to 195

这是Slurm下的输出,它不会像使用命令那样返回整个结果。

Processor0 will send and receive with processor1
Processor2 will send and receive with processor0
Processor3 will send and receive with processor0
Processor1 will send and receive with processor0
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor2 finished send and receive with processor0
Processor1 will send and receive with processor2
Processor0 will send and receive with processor3
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor2 will send and receive with processor3
Processor1 finished send and receive with processor2

这是我的Slurm.sh文件:我想我在其中犯了一些错误,结果与命令不同,但我不确定。。。

#!/bin/bash
####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute
####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=64000
####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive
####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=12:00:00
mpiexec -np 4 ./a1

再次回来解决我的问题。我犯了一个愚蠢的错误,我在mpi代码中使用了一个错误的slum.sh。正确的slum.sh是:

#!/bin/bash
####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute
####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=32000
####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive
####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#SBATCH --time=01:00:00
####### check modules to see which version of MPI is available
####### and use appropriate module if needed
module load intel-mpi/2018.3
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun /.a1

我太傻了,这就是为什么我用科南作为昵称。。。我希望我能变得聪明。