我如何理解我的Valgrind错误消息

how can I understand my valgrind error message?

本文关键字:错误 消息 Valgrind 我的 何理解      更新时间:2023-10-16

我从valgrind收到以下错误消息:

==1808== 0 bytes in 1 blocks are still reachable in loss record 1 of 1,734
==1808==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808==    by 0x4CC2BA9: hwloc_build_level_from_list (topology.c:1603)
==1808==    by 0x4CC2BA9: hwloc_connect_levels (topology.c:1774)
==1808==    by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808==    by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808==    by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808==    by 0x632FDB3: ???
==1808==    by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808==    by 0x403E0E: orterun (orterun.c:693)
==1808==    by 0x4035E3: main (main.c:13)
==1808==
==1808== 0 bytes in 1 blocks are still reachable in loss record 2 of 1,734
==1808==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808==    by 0x4CC2BD5: hwloc_build_level_from_list (topology.c:1603)
==1808==    by 0x4CC2BD5: hwloc_connect_levels (topology.c:1775)
==1808==    by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808==    by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808==    by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808==    by 0x632FDB3: ???
==1808==    by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808==    by 0x403E0E: orterun (orterun.c:693)
==1808==    by 0x4035E3: main (main.c:13)

我无法理解Valgrind正在报告哪种问题。有人愿意解释吗?

我已经检查了所有新实例。它们都被正确删除了。

我正在获取valgrind错误messagges和代码结束时MPI的其他错误形式:

---------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1811 on node laki.pi.ingv.it exited on signal 11 (Segmentation fault).
----------------------------------------------------------------------

这是有关mpi_init的错误消息:

==31198== 0 bytes in 1 blocks are still reachable in loss record 1 of 368
==31198==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198==    by 0xC66DE49: hwloc_build_level_from_list (topology.c:1603)
==31198==    by 0xC66DE49: hwloc_connect_levels (topology.c:1774)
==31198==    by 0xC66E1C5: hwloc_discover (topology.c:2091)
==31198==    by 0xC66E1C5: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==31198==    by 0xC62B473: opal_hwloc_unpack (hwloc_base_dt.c:83)
==31198==    by 0xC6270AB: opal_dss_unpack_buffer (dss_unpack.c:120)
==31198==    by 0xC62815F: opal_dss_unpack (dss_unpack.c:84)
==31198==    by 0xC5F2349: orte_util_nidmap_init (nidmap.c:146)
==31198==    by 0xED98608: ???
==31198==    by 0xC5DC0B9: orte_init (orte_init.c:127)
==31198==    by 0xC59DBAE: ompi_mpi_init (ompi_mpi_init.c:357)
==31198==    by 0xC5B443F: PMPI_Init (pinit.c:84)
==31198==    by 0x55FA53: main (solver_2d.hpp:22)

line solver_2d.hpp:22完全组成:

MPI_Init(&argc, &argv);

此外,与mpi_finalize((相关的错误消息;是

==31198== 1 errors in context 1 of 58:
==31198== Syscall param write(buf) points to uninitialised byte(s)
==31198==    at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31198==    by 0x11F1F548: ???
==31198==    by 0x11F1E03F: ???
==31198==    by 0x11CD7FBA: ???
==31198==    by 0x11CE519A: ???
==31198==    by 0x11CE3C37: ???
==31198==    by 0x11CD90C1: ???
==31198==    by 0x11AC2E36: ???
==31198==    by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31198==    by 0x562185: main (solver_2d.hpp:171)
==31198==  Address 0x1ffeffda24 is on thread 1's stack
==31198==  Uninitialised value was created by a stack allocation
==31198==    at 0x11CCE050: ???

==31197== Syscall param write(buf) points to uninitialised byte(s)
==31197==    at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31197==    by 0x11F1F548: ipath_cmd_write (in /usr/lib64/libinfinipath.so.4.0)
==31197==    by 0x11F1E03F: ipath_poll_type (in /usr/lib64/libinfinipath.so.4.0)
==31197==    by 0x11CD7FBA: psmi_context_interrupt_set (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CE519A: ips_ptl_rcvthread_fini (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CE3C37: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CD90C1: psm_ep_close (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11AC2E36: ompi_mtl_psm_finalize (mtl_psm.c:200)
==31197==    by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31197==    by 0x562185: main (solver_2d.hpp:171)
==31197==  Address 0x1ffeffda24 is on thread 1's stack
==31197==  in frame #2, created by ipath_poll_type (???:)
==31197==  Uninitialised value was created by a stack allocation
==31197==    at 0x11CCE050: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)

其中line solver_2d.hpp:171对应于:

MPI_Finalize();

最后,与MPI_WRITE相对应的错误消息,或者更好地读取MPI_FILE_OPEN读取:

==31198== 48 bytes in 1 blocks are still reachable in loss record 104 of 368
==31198==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198==    by 0xC58C750: opal_obj_new (opal_object.h:469)
==31198==    by 0xC58C750: ompi_attr_set_c (attribute.c:761)
==31198==    by 0xC5AA0BE: PMPI_Attr_put (pattr_put.c:58)
==31198==    by 0x118501AB: ???
==31198==    by 0x11843159: ???
==31198==    by 0x1185657D: ???
==31198==    by 0xC5CEFB5: module_init (io_base_file_select.c:442)
==31198==    by 0xC5CEFB5: mca_io_base_file_select (io_base_file_select.c:214)
==31198==    by 0xC5977A5: ompi_file_open (file.c:128)
==31198==    by 0xC5C6557: PMPI_File_open (pfile_open.c:96)
==31198==    by 0x5638A1: p_fstream (p_fstream.hpp:86)

line p_fstream.hpp:86是:

MPI_File_open(MPI_COMM_WORLD, const_cast<char*>(fname.c_str()), flags, MPI_INFO_NULL, &mpi_file);

valgrind消息报告mpirun中的内存泄漏,您可能不在乎。

我假设你跑了

valgrind mpirun a.out

但是您确实想在MPI应用程序本身中查找不正确的内存访问/泄漏。在这种情况下,您应该运行

mpirun valgrind a.out

注意所有输出将交错,并且由于您使用的是开放MPI,您可以

mpirun --tag-output valgrind a.out

让每个任务的输出都带有其等级值。