我可以在不影响性能的情况下将此宏更改为内联函数吗

Can I change this macro to an inline function without a performance hit?

本文关键字：函数影响性能情况下我可以更新时间：2023-10-16

我在这里找到了一个非常快速的整数平方根函数，由Mark Crowne。至少在我的机器上使用GCC，它显然是我测试过的最快的整数平方根函数(包括标准库中Hacker‘s Delight、本页和floor(sqrt(((中的函数(。

在清理了一些格式，重命名了一个变量，并使用了固定宽度的类型之后，它看起来是这样的：

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;
    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }
    #define INNER_ISQRT(s)                              
    do                                                  
    {                                                   
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    
        if (val >= temp)                                
        {                                               
            root += 1 << ((s)-1);                       
            val -= temp;                                
        }                                               
    } while(0)
    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);
    #undef INNER_ISQRT
    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

INNER_ISQRT宏并不太邪恶，因为它是本地的，并且在不再需要它之后立即未定义。尽管如此，原则上我还是想把它转换成一个内联函数。我在一些地方读过断言(包括GCC文档(，内联函数和宏"一样快"，但我在转换时遇到了问题，没有速度问题。

我当前的迭代是这样的(注意always_inline属性，我加入它是为了更好地度量(：

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}
//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;
    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }
    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);
    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

无论我做什么，内联函数总是比宏慢。对于-O2版本的(2^28-1(迭代，宏版本的时间通常在2.92s左右，而内联版本的时间则通常在3.25s左右。编辑：我之前说过2^32-1迭代，但我忘了我已经更改了它。它们需要更长的时间才能达到全范围。

编译器可能只是愚蠢，拒绝内联它(再次注意always_inline属性！(，但如果是这样，那么无论如何，宏版本都会更可取。(我试着检查程序集，但作为程序的一部分，它太复杂了。当然，当我只编译函数时，优化器忽略了所有内容，而且由于GCC不熟悉，我在将它作为库编译时遇到了问题。(

简言之，有没有一种方法可以在没有速度命中的情况下将其写成内联？(我还没有介绍过，但sqrt是应该始终快速进行的基本操作之一，因为我可能在许多其他程序中使用它，而不仅仅是我目前感兴趣的程序。此外，我只是好奇。(

我甚至尝试过使用模板来"烘焙"常数值，但我感觉其他两个参数更有可能导致命中(宏可以避免这种情况，因为它直接使用局部变量(。。。好吧，要么就是编译器顽固地拒绝内联。

更新：下面的user1034749在将这两个函数放在单独的文件中并编译它们时，从它们获得了相同的程序集输出。我试过他的精确命令行，结果和他一样。无论出于何种意图和目的，这个问题都得到了解决。

然而，我仍然想知道为什么我的测量结果会有所不同。显然，我的测量代码或原始构建过程导致了情况的不同。我会在下面发布代码。有人知道交易是什么吗？也许我的编译器实际上在main((函数的循环中内联了整个mcrowne_isqrt((函数，但它并没有内联其他版本的全部？

UPDATE 2(在测试代码之前压缩(：请注意，如果我交换测试的顺序并使内联版本优先，那么内联版本比宏版本快同样的数量。这是缓存问题，还是编译器内联一个调用而不内联另一个调用，或者什么？

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>
/*  Functions go here */
timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }
    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}

int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;
    timespec start, end;
    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;
    //  Results:
    std::cout << "Mark Crowne sqrt variant time:t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}

更新3：我仍然不知道如何在不根据测试顺序确定时间的情况下可靠地比较不同函数的时间。如果有任何提示，我将不胜感激！

然而，如果其他阅读本文的人对快速sqrt实现感兴趣，我应该提到：Mark Crowne的代码测试速度比我尝试过的任何其他纯C/C++版本都快(尽管测试存在可靠性问题(，但下面的SSE代码对于标量32位整数sqrt来说可能会快一点。它不能在不损失精度的情况下推广到完整的64位无符号整数输入(并且第一个有符号转换也必须由处理值>=2^63的内部加载来代替(：

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}

我用gcc 4.5.3尝试了您的代码。我修改了你的第二个版本的代码以匹配第一个版本，例如：

(1 << ((s) * 2 - 2)

与

(1 << ((s << 1) - 1)

是，s*2==s<lt；1，但是"-2"answers"-1"？

此外，我修改了您的类型，将uint32_t替换为"unsigned long"，因为在我的64位机器上，"long"不是32位数字。

然后我运行：

g++ -ggdb -O2 -march=native -c -pipe inline.cpp
g++ -ggdb -O2 -march=native -c -pipe macros.cpp
objdump -d inline.o > inline.s
objdump -d macros.o > macros.s

我可以用"-S"代替"-c"进行汇编，但我希望看到没有其他信息的汇编程序。

你知道吗
汇编程序完全相同，在第一个真理和第二个真理中。所以我认为你的时间测量是错误的。