pbs_server, E5-2620v4和一般保护

pbs_server, E5-2620v4 and general protection

本文关键字:保护 E5-2620v4 server pbs      更新时间:2023-10-16

我试图在英特尔至强E5-2620v4上安装Debian 8.5上的扭矩6.0.2。然而,当我尝试启动pbs_server时,我返回了一个段错误,gdb:

#1  0x0000000000440ab6 in container::item_container<pbsnode*>::unlock (this=0xb5d900 <allnodes>) at ../../src/include/container.hpp:537
#2  0x00000000004b787f in mom_hierarchy_handler::nextNode (this=0x4e610c0 <hierarchy_handler>, iter=0x7fffffff98b8) at mom_hierarchy_handler.cpp:122
#3  0x00000000004b7a7d in mom_hierarchy_handler::make_default_hierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:149
#4  0x00000000004b898d in mom_hierarchy_handler::loadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:433
#5  0x00000000004b8ae8 in mom_hierarchy_handler::initialLoadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:472
#6  0x000000000045262a in pbsd_init (type=1) at pbsd_init.c:2299
#7  0x00000000004591ff in main (argc=2, argv=0x7fffffffdec8) at pbsd_main.c:1883

dmesg:

traps: pbs_server[22249] general protection ip:7f9c08a7a2c8 sp:7ffe520b5238 error:0 in libpthread-2.19.so[7f9c08a69000+18000]

valgrind:

==22381== Memcheck, a memory error detector
==22381== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==22381== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==22381== Command: pbs_server
==22381==
==22381==
==22381== HEAP SUMMARY:
==22381==     in use at exit: 18,051 bytes in 53 blocks
==22381==   total heap usage: 169 allocs, 116 frees, 42,410 bytes allocated
==22381==
==22382==
==22382== HEAP SUMMARY:
==22382==     in use at exit: 19,755 bytes in 56 blocks
==22382==   total heap usage: 172 allocs, 116 frees, 44,114 bytes allocated
==22382==
==22381== LEAK SUMMARY:
==22381==    definitely lost: 0 bytes in 0 blocks
==22381==    indirectly lost: 0 bytes in 0 blocks
==22381==      possibly lost: 0 bytes in 0 blocks
==22381==    still reachable: 18,051 bytes in 53 blocks
==22381==         suppressed: 0 bytes in 0 blocks
==22381== Rerun with --leak-check=full to see details of leaked memory
==22381==
==22381== For counts of detected and suppressed errors, rerun with: -v
==22381== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== Process terminating with default action of signal 11 (SIGSEGV)
==22383==  General Protection Fault
==22383==    at 0x72192CB: __lll_unlock_elision (elision-unlock.c:33)
==22383==    by 0x4E7E1A: unlock_node(pbsnode*, char const*, char const*, int) (u_lock_ctl.c:268)
==22383==    by 0x4B7A66: mom_hierarchy_handler::make_default_hierarchy() (mom_hierarchy_handler.cpp:164)
==22383==    by 0x4B898C: mom_hierarchy_handler::loadHierarchy() (mom_hierarchy_handler.cpp:433)
==22383==    by 0x4B8AE7: mom_hierarchy_handler::initialLoadHierarchy() (mom_hierarchy_handler.cpp:472)
==22383==    by 0x452629: pbsd_init(int) (pbsd_init.c:2299)
==22383==    by 0x4591FE: main (pbsd_main.c:1883)
==22382== LEAK SUMMARY:
==22382==    definitely lost: 0 bytes in 0 blocks
==22382==    indirectly lost: 0 bytes in 0 blocks
==22382==      possibly lost: 0 bytes in 0 blocks
==22382==    still reachable: 19,755 bytes in 56 blocks
==22382==         suppressed: 0 bytes in 0 blocks
==22382== Rerun with --leak-check=full to see details of leaked memory
==22382==
==22382== For counts of detected and suppressed errors, rerun with: -v
==22382== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== HEAP SUMMARY:
==22383==     in use at exit: 325,348 bytes in 186 blocks
==22383==   total heap usage: 297 allocs, 111 frees, 442,971 bytes allocated
==22383==
==22383== LEAK SUMMARY:
==22383==    definitely lost: 134 bytes in 6 blocks
==22383==    indirectly lost: 28 bytes in 3 blocks
==22383==      possibly lost: 524 bytes in 17 blocks
==22383==    still reachable: 324,662 bytes in 160 blocks
==22383==         suppressed: 0 bytes in 0 blocks
==22383== Rerun with --leak-check=full to see details of leaked memory
==22383==
==22383== For counts of detected and suppressed errors, rerun with: -v
==22383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
~

没有其他软件有这个行为,我测试了2天的机器满负荷没有问题。已经在尝试更新处理器的微码了。请问,有人在扭矩6.0.2或其他情况下有这种行为吗?

问好。

这不是微码错误。无论你在运行什么软件,这都是一个直接的锁平衡问题(在glibc/libpthreads中不是)。

不要尝试解锁一个已经解锁的锁。这是被禁止的行为,也是陷阱的原因。

由于性能原因,glibc不会费心去测试它和段错误,所以很多被破坏的代码在很长一段时间内都没有受到影响。锁省略(OTOH)的硬件实现确实会引发陷阱(Intel TSX、IBM Power 8、S390/X…),因此这种破坏将很快在任何地方变得明显。