作为参数传递的内联函数真的在 C/C++ 中内联执行吗?

Are inline functions passed as argument, really executed inline in C/C++?

本文关键字：C++ 执行真的参数传递函数更新时间：2023-10-16

我有一个很长的(迭代次数)的循环，我喜欢使个性化它的某些部分成为可能。代码如下所示：

function expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
for(i=0; i<VeryLargeN; i++){
element=elements[i]
// long computation that produce a boolean condition
if (condition){ 
do_true(element); 
}else{
do_false(element);
}
}
}

现在，问题是每次调用do_true和do_false时，由于堆栈的推送/弹出而存在开销，这会破坏代码的高性能。

为了解决这个问题，我可以简单地创建expensive_loop函数的多个副本，每个副本都有自己的do_true和do_false实现。这将使代码无法维护。

那么，如何制作迭代的内部部分，以便对其进行个性化设置，同时仍保持高性能呢？

请注意，该函数接受指向函数的指针，因此通过指针调用这些指针。如果expensive_loop和这些函数的定义可用且未违反编译器内联限制，则优化程序可以通过函数指针内联这些调用。

另一种选择是使此算法成为接受可调用对象(函数指针、带有调用运算符的对象、lambdas)的函数模板，就像标准算法一样。这样编译器可能会有更多的优化机会。例如：

template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) { 
// Original function body here.
}

g++有-Winline编译器开关：

-Winline
如果函数无法内联并且被声明为内联，则发出警告。即使使用此选项，编译器也不会警告系统标头中声明的内联函数失败。
编译器使用各种试探法来确定是否内联函数。例如，编译器会考虑要内联的函数的大小以及当前函数中已经完成的内联量。因此，源程序中看似微不足道的更改可能会导致-Winline生成的警告出现或消失。

当通过指针调用函数时，它可能不会警告函数未内联。

问题是函数地址(在do_true和do_false中实际设置的内容)直到链接时间才得到解决，此时没有太多优化机会。

如果在代码中显式设置这两个函数(即，函数本身不是来自外部库等)，则可以使用C++模板声明函数，以便编译器确切地知道此时要调用哪些函数。

struct function_one {
void operator()( int element ) {
}
};
extern int elements[];
extern bool condition();
template < typename DoTrue, typename DoFalse >
void expensive_loop(){
DoTrue do_true;
DoFalse do_false;
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){ 
do_true(element); // call DoTrue's operator()
}else{
do_false(element); // call DoFalse's operator()
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop<function_one,function_one>();
return 0;
}

编译器将为指定的 DoTrue 和 DoFalse 类型的每个组合实例化一个expensive_loop函数。如果您使用多个组合，它将增加可执行文件的大小，但每个组合都应该执行您的期望。

对于我显示的示例，请注意函数如何为空。编译器只是剥离函数调用并离开循环：

main:
push    rbx
mov     ebx, 50
.L2:
call    condition()
sub     ebx, 1
jne     .L2
xor     eax, eax
pop     rbx
ret

请参阅 https://godbolt.org/g/hV52Nn 中的示例

如示例中所示，使用函数指针可能不会内联函数调用。这是为main和expensive_loop程序生产的汇编程序，其中expensive_loop

// File A.cpp
void foo( int arg );
void bar( int arg );
extern bool condition();
extern int elements[];
void expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element);
}else{
do_false(element);
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop( foo, bar );
return 0;
}

以及通过参数传递的函数

// File B.cpp
#include <math.h>
int elements[50];
bool condition() {
return elements[0] == 1;
}
inline int foo( int arg ) {
return arg%3;
}
inline int bar( int arg ) {
return 1234%arg;
}

在不同的翻译单元中定义。

0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
400620:       41 55                   push   %r13
400622:       49 89 fd                mov    %rdi,%r13
400625:       41 54                   push   %r12
400627:       49 89 f4                mov    %rsi,%r12
40062a:       55                      push   %rbp
40062b:       53                      push   %rbx
40062c:       bb 60 10 60 00          mov    $0x601060,%ebx
400631:       48 83 ec 08             sub    $0x8,%rsp
400635:       eb 19                   jmp    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
400637:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
40063e:       00 00
400640:       48 83 c3 04             add    $0x4,%rbx
400644:       41 ff d5                callq  *%r13
400647:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
40064e:       74 1d                   je     40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
400650:       8b 2b                   mov    (%rbx),%ebp
400652:       e8 79 ff ff ff          callq  4005d0 <condition()>
400657:       84 c0                   test   %al,%al
400659:       89 ef                   mov    %ebp,%edi
40065b:       75 e3                   jne    400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
40065d:       48 83 c3 04             add    $0x4,%rbx
400661:       41 ff d4                callq  *%r12
400664:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
40066b:       75 e3                   jne    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
40066d:       48 83 c4 08             add    $0x8,%rsp
400671:       5b                      pop    %rbx
400672:       5d                      pop    %rbp
400673:       41 5c                   pop    %r12
400675:       41 5d                   pop    %r13
400677:       c3                      retq
400678:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
40067f:       00

您可以看到即使使用优化级别-O3调用仍如何执行：

400644:       41 ff d5                callq  *%r13