测量NUMA(非均匀内存访问).没有明显的不对称.为什么
Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?
我尝试测量NUMA的非对称内存访问效果,但失败了。
实验在Intel Xeon X5570 @ 2.93GHz, 2个cpu, 8核上执行。
在固定到内核0的线程上,我在内核0的NUMA节点上用numa_alloc_local分配一个大小为10,000,000字节的数组x。然后我迭代数组x 50次,并读写数组中的每个字节。测量完成50次迭代所花费的时间。然后,在服务器的每个其他核心上,我固定一个新线程,并再次测量执行50次读写迭代所花费的时间每一个字节数组 x 。
Array x是为了最小化缓存效果。我们想要测量的是当CPU不得不去RAM加载和存储时的速度,而不是当缓存有帮助时的速度。
在我的服务器中有两个NUMA节点,所以我希望内核在分配数组x的同一节点上具有亲和性更快的读写速度。我看不出来。
为什么?
也许NUMA只适用于> 8-12核的系统,正如我在其他地方看到的建议?
http://lse.sourceforge.net/numa/faq/numatest.cpp
#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
for(size_t i=0;i<bm.size;++i)
{
os << numa_bitmask_isbitset(&bm, i);
}
return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
void* y = numa_alloc_local(N);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)y)[j];
((char*)y)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
*x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)x)[j];
((char*)x)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
int numcpus = numa_num_task_cpus();
std::cout << "numa_available() " << numa_available() << std::endl;
numa_set_localalloc();
bitmask* bm = numa_bitmask_alloc(numcpus);
for (int i=0;i<=numa_max_node();++i)
{
numa_node_to_cpus(i, bm);
std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
}
numa_bitmask_free(bm);
void* x;
size_t N(10000000);
size_t M(50);
boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
t1.join();
for (size_t i(0);i<numcpus;++i)
{
boost::thread t2(boost::bind(&thread2, x, i, N, M));
t2.join();
}
numa_free(x, N);
return 0;
}
输出g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
./numatest
numa_available() 0 <-- NUMA is available on this system
numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0
Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928
对数组x进行50次迭代读写大约需要1.7秒,无论哪个内核正在进行读写。
更新:
我的cpu上的缓存大小是8Mb,所以也许10Mb的数组x不足以消除缓存影响。我尝试了100Mb数组x我已经尝试在我最内层的循环中使用__sync_synchronize()发出一个完整的内存围栏。它仍然没有揭示NUMA节点之间的任何不对称。 更新2:我已经尝试读写数组x与__sync_fetch_and_add()。还是什么都没有。
我想指出的第一件事是,您可能需要仔细检查每个节点上的内核。我不记得核心和节点是这样交错的。此外,由于HT,您应该有16个线程。(除非你禁用了)
的另一件事:
socket 1366 Xeon机器只有轻微的NUMA。所以很难看出区别。在4P Opterons上,NUMA效果更加明显。
在像您这样的系统上,节点到节点的带宽实际上比cpu到内存的带宽快。由于您的访问模式是完全顺序的,因此无论数据是否在本地,您都可以获得完整的带宽。更好的测量方法是延迟。尝试随机访问1gb的块,而不是按顺序流式传输。
最后一件事:
根据你的编译器优化的积极程度,你的循环可能会被优化掉,因为它什么也不做:
c = ((char*)x)[j];
((char*)x)[j] = c;
这样做可以保证它不会被编译器消除:
((char*)x)[j] += 1;
哈哈!神秘是对的!不知何故,硬件预取正在优化我的读/写。
如果它是一个缓存优化,那么强制内存屏障将失败的优化:
c = __sync_fetch_and_add(((char*)x) + j, 1);
但这没有任何区别。真正起作用的是将我的迭代器索引乘以' 1009,以击败预取优化:
*(((char*)x) + ((j * 1009) % N)) += 1;
随着这一变化,NUMA的不对称性被清楚地揭示出来:
numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114
至少我认为是这样的。
谢谢Mysticial !
EDIT: CONCLUSION ~133%
对于那些只是浏览这篇文章来获得NUMA性能特征大致概念的人来说,以下是根据我的测试得出的底线:
访问非本地NUMA节点的内存延迟大约是访问本地节点的1.33倍。
感谢此基准代码。我已经采取了你的"固定"版本,并将其改为纯C + OpenMP,并添加了一些测试,以了解内存系统在争用下的行为。你可以在这里找到新的代码。
以下是Quad Opteron的一些示例结果:
num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB
sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s
all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s
two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s
two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s
two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s
two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s
two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s
two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s
two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s
two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s
two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s
two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s
two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s
two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s
two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s
two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s
two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s
two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s
two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s
two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s
two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s
two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s
two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s
two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s
two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s
two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s
two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s
two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s
two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s
two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s
two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s
two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s
two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s
如果有人有进一步的改进,我很高兴听到他们。例如,在实际单位中,这些显然不是完美的带宽测量(可能会有一个——希望是恒定的——整数因子)。
一些注释:
- 查看系统的NUMA结构(在linux上),您可以使用hwloc库中的
lstopo
实用程序获得图形化概述。特别是,您将看到哪个内核编号是哪个NUMA节点(处理器套接字)的成员 -
char
可能不是测量最大RAM吞吐量的理想数据类型。我怀疑使用32位或64位数据类型,您可以通过相同数量的cpu周期获得更多的数据。 -
更一般地说,您还应该检查您的测量不受CPU速度的限制,而是受RAM速度的限制。例如,ramspeed实用程序在源代码中在一定程度上显式展开内部循环:
for(i = 0; i < blk/sizeof(UTL); i += 32) { b[i] = a[i]; b[i+1] = a[i+1]; ... b[i+30] = a[i+30]; b[i+31] = a[i+31]; }
EDIT:在支持的体系结构上,
ramsmp
实际上甚至使用"手写"的汇编代码用于这些循环 -
L1/L2/L3缓存效果:测量以GByte/s为单位的带宽作为块大小的函数具有指导意义。当根据读取数据的位置(缓存或主存)增加块大小时,您应该会看到大约四种不同的速度。你的处理器似乎有8 MByte的Level3(?)缓存,所以你的1000万字节可能只是大部分留在L3缓存(在一个处理器的所有核心之间共享)。
-
内存通道:处理器有3个内存通道。如果你的内存库是这样安装的,你可以利用他们所有(参见例如主板的手册),你可能想要在同一时间运行多个线程。我看到的效果是,当仅用一个线程读取时,渐近带宽接近单个内存模块(例如DDR-1600的12.8 GByte/s),而当运行多个线程时,渐近带宽接近内存通道数乘以单个内存模块的带宽。
您还可以使用numactl来选择在哪个节点上运行进程以及从哪里分配内存:
numactl --cpubind=0 --membind=1 <process>
我将此与LMbench结合使用以获得内存延迟数:
numactl --cpubind=0 --membind=0 ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1 ./lat_mem_rd -t 512
如果有人想尝试这个测试,这里是修改后的工作程序。我希望看到其他硬件的结果。在我的机器上运行Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
numatest.cpp
#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
for(size_t i=0;i<bm.size;++i)
{
os << numa_bitmask_isbitset(&bm, i);
}
return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
void* y = numa_alloc_local(N);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
*(((char*)y) + ((j * 1009) % N)) += 1;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
*x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
*(((char*)x) + ((j * 1009) % N)) += 1;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
int numcpus = numa_num_task_cpus();
std::cout << "numa_available() " << numa_available() << std::endl;
numa_set_localalloc();
bitmask* bm = numa_bitmask_alloc(numcpus);
for (int i=0;i<=numa_max_node();++i)
{
numa_node_to_cpus(i, bm);
std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
}
numa_bitmask_free(bm);
void* x;
size_t N(10000000);
size_t M(5);
boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
t1.join();
for (size_t i(0);i<numcpus;++i)
{
boost::thread t2(boost::bind(&thread2, x, i, N, M));
t2.join();
}
numa_free(x, N);
return 0;
}
- 为什么不;名字在地图上是按顺序排列的吗
- 为什么不能修改对象中的值?另外,我如何改进此链表?
- 为什么不调用移动构造函数?(默认情况下只有构造器,没有别的)
- C++ 基本 CTOR 说明 - 为什么不调用赋值/复制构造函数
- 为什么不递增?(构造函数)
- 为什么不允许成员函数和非成员函数之间的函数重载?
- 为什么不允许使用可变长度数组作为向量元素?
- C++:为什么不调用移动构造函数?
- 在 C++ 中声明 const 对象需要用户定义的默认构造函数.如果我有一个可变成员变量,为什么不呢?
- 为什么不能用常量表达式声明数组?
- 为什么不能直接引用作用域枚举类成员,而不能为无作用域枚举生成类成员?
- C++ queue.front();为什么不从第一个元素开始呢?
- 为什么不允许这种交叉广播?
- 通过构造函数方法输出的类到类类型转换是 5500 为什么不是 5555
- 为什么不能通过在错误输入后设置 std::cin.clear() 来使用 std::cin?
- 为什么不支持 Xcode 1.5?
- 为什么不能使用带有模板的 lambda
- 为什么不需要在 C++20 中的依赖类型之前指定"typename"?
- 为什么num_get和num_put不对称
- 测量NUMA(非均匀内存访问).没有明显的不对称.为什么