测量NUMA(非均匀内存访问).没有明显的不对称.为什么

Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

本文关键字:为什么 不对称 访问 NUMA 内存 测量      更新时间:2023-10-16

我尝试测量NUMA的非对称内存访问效果,但失败了。

实验

在Intel Xeon X5570 @ 2.93GHz, 2个cpu, 8核上执行。

在固定到内核0的线程上,我在内核0的NUMA节点上用numa_alloc_local分配一个大小为10,000,000字节的数组x。然后我迭代数组x 50次,并读写数组中的每个字节。测量完成50次迭代所花费的时间。

然后,在服务器的每个其他核心上,我固定一个新线程,并再次测量执行50次读写迭代所花费的时间每一个字节数组 x

Array x是为了最小化缓存效果。我们想要测量的是当CPU不得不去RAM加载和存储时的速度,而不是当缓存有帮助时的速度。

在我的服务器中有两个NUMA节点,所以我希望内核在分配数组x的同一节点上具有亲和性更快的读写速度。我看不出来。

为什么?

也许NUMA只适用于> 8-12核的系统,正如我在其他地方看到的建议?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);
    void* y = numa_alloc_local(N);
    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }
    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
    *x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);
    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }
    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();
    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);
    void* x;
    size_t N(10000000);
    size_t M(50);
    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();
    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }
    numa_free(x, N);
    return 0;
}

输出
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
./numatest
numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0
Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

对数组x进行50次迭代读写大约需要1.7秒,无论哪个内核正在进行读写。

更新:

我的cpu上的缓存大小是8Mb,所以也许10Mb的数组x不足以消除缓存影响。我尝试了100Mb数组x我已经尝试在我最内层的循环中使用__sync_synchronize()发出一个完整的内存围栏。它仍然没有揭示NUMA节点之间的任何不对称。 更新2:

我已经尝试读写数组x与__sync_fetch_and_add()。还是什么都没有。

我想指出的第一件事是,您可能需要仔细检查每个节点上的内核。我不记得核心和节点是这样交错的。此外,由于HT,您应该有16个线程。(除非你禁用了)

的另一件事:

socket 1366 Xeon机器只有轻微的NUMA。所以很难看出区别。在4P Opterons上,NUMA效果更加明显。

在像您这样的系统上,节点到节点的带宽实际上比cpu到内存的带宽快。由于您的访问模式是完全顺序的,因此无论数据是否在本地,您都可以获得完整的带宽。更好的测量方法是延迟。尝试随机访问1gb的块,而不是按顺序流式传输。

最后一件事:

根据你的编译器优化的积极程度,你的循环可能会被优化掉,因为它什么也不做:

c = ((char*)x)[j];
((char*)x)[j] = c;

这样做可以保证它不会被编译器消除:

((char*)x)[j] += 1;

哈哈!神秘是对的!不知何故,硬件预取正在优化我的读/写。

如果它是一个缓存优化,那么强制内存屏障将失败的优化:

c = __sync_fetch_and_add(((char*)x) + j, 1);

但这没有任何区别。真正起作用的是将我的迭代器索引乘以' 1009,以击败预取优化:

*(((char*)x) + ((j * 1009) % N)) += 1;

随着这一变化,NUMA的不对称性被清楚地揭示出来:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

至少我认为是这样的。

谢谢Mysticial !

EDIT: CONCLUSION ~133%

对于那些只是浏览这篇文章来获得NUMA性能特征大致概念的人来说,以下是根据我的测试得出的底线:

访问非本地NUMA节点的内存延迟大约是访问本地节点的1.33倍。

感谢此基准代码。我已经采取了你的"固定"版本,并将其改为纯C + OpenMP,并添加了一些测试,以了解内存系统在争用下的行为。你可以在这里找到新的代码。

以下是Quad Opteron的一些示例结果:

num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB
sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s
all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s
two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s
two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s
two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s
two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s
two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s
two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s
two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s
two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s
two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s
two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s
two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s
two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s
two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s
two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s
two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s
two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s
two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s
two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s
two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s
two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s
two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s
two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s
two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s
two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s
two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s
two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s
two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s
two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s
two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s
two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s
two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s

如果有人有进一步的改进,我很高兴听到他们。例如,在实际单位中,这些显然不是完美的带宽测量(可能会有一个——希望是恒定的——整数因子)。

一些注释:

  • 查看系统的NUMA结构(在linux上),您可以使用hwloc库中的lstopo实用程序获得图形化概述。特别是,您将看到哪个内核编号是哪个NUMA节点(处理器套接字)的成员
  • char可能不是测量最大RAM吞吐量的理想数据类型。我怀疑使用32位或64位数据类型,您可以通过相同数量的cpu周期获得更多的数据。
  • 更一般地说,您还应该检查您的测量不受CPU速度的限制,而是受RAM速度的限制。例如,ramspeed实用程序在源代码中在一定程度上显式展开内部循环:

    for(i = 0; i < blk/sizeof(UTL); i += 32) {
        b[i] = a[i];        b[i+1] = a[i+1];
        ...
        b[i+30] = a[i+30];  b[i+31] = a[i+31];
    }
    

    EDIT:在支持的体系结构上,ramsmp实际上甚至使用"手写"的汇编代码用于这些循环

  • L1/L2/L3缓存效果:测量以GByte/s为单位的带宽作为块大小的函数具有指导意义。当根据读取数据的位置(缓存或主存)增加块大小时,您应该会看到大约四种不同的速度。你的处理器似乎有8 MByte的Level3(?)缓存,所以你的1000万字节可能只是大部分留在L3缓存(在一个处理器的所有核心之间共享)。

  • 内存通道:处理器有3个内存通道。如果你的内存库是这样安装的,你可以利用他们所有(参见例如主板的手册),你可能想要在同一时间运行多个线程。我看到的效果是,当仅用一个线程读取时,渐近带宽接近单个内存模块(例如DDR-1600的12.8 GByte/s),而当运行多个线程时,渐近带宽接近内存通道数乘以单个内存模块的带宽。

您还可以使用numactl来选择在哪个节点上运行进程以及从哪里分配内存:

numactl --cpubind=0 --membind=1 <process>

我将此与LMbench结合使用以获得内存延迟数:

numactl --cpubind=0 --membind=0  ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1  ./lat_mem_rd -t 512

如果有人想尝试这个测试,这里是修改后的工作程序。我希望看到其他硬件的结果。在我的机器上运行Linux 2.6.34-12-desktop, GCC 4.5.0, Boost 1.47.

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);
    void* y = numa_alloc_local(N);
    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)y) + ((j * 1009) % N)) += 1;
        }
    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
    *x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);
    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            *(((char*)x) + ((j * 1009) % N)) += 1;
        }
    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();
    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);
    void* x;
    size_t N(10000000);
    size_t M(5);
    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();
    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }
    numa_free(x, N);
    return 0;
}