"static_cast<易失性空隙>"对优化器意味着什么？

What does `static_cast<volatile void>` mean for the optimizer?

本文关键字：优化什么 gt 意味着 static cast lt 易失性更新时间：2023-10-16

当人们试图在各种库中执行严格的基准测试时，我有时会看到这样的代码：

auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j)
volatile const auto __attribute__((unused)) c = std_set.count(i + j);
auto std_stop = std::chrono::steady_clock::now();

这里使用volatile是为了防止优化器注意到被测试代码的结果被丢弃，然后丢弃整个计算。

当测试中的代码没有返回值时，比如说它是void do_something(int)，然后有时我会看到这样的代码：

auto std_start = std::chrono::steady_clock::now();
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j)
static_cast<volatile void> (do_something(i + j));
auto std_stop = std::chrono::steady_clock::now();

这是volatile的正确用法吗？什么是volatile void？从编译器和标准的角度来看，它意味着什么？

在[dcl.type.cv]的标准(N4296)中，它说：

7[注意：volatile是实现的一个提示，可以避免涉及对象的激进优化因为对象的值可能会通过实现无法检测到的方式进行更改。此外对于某些实现，volatile可能表示需要特殊的硬件指令才能访问对象。有关详细语义，请参见1.9。一般来说，volatile的语义是在C++中与在C中相同。--尾注]

在第1.9节中，它指定了许多关于执行模型的指导，但就volatile而言，它是关于"访问volatile对象"的。我不清楚执行已转换为volatile void的语句意味着什么，假设我正确理解代码，以及如果产生任何优化障碍，会发生什么。

static_cast<volatile void> (foo())不能要求编译器在启用优化的情况下，在任何gcc/clang/MSVC/ICC中实际计算foo()。

#include <bitset>
void foo() {
for (int i = 0; i < 10000; ++i)
for (int j = 0; j < 10000; ++j) {
std::bitset<64> std_set(i + j);
//volatile const auto c = std_set.count();     // real work happens
static_cast<volatile void> (std_set.count());  // optimizes away
}
}

使用所有4个主要的x86编译器编译为ret。(MSVC为std::bitset::count()或其他什么的独立定义发出asm，但向下滚动为其foo()的琐碎定义。

(Matt Godbolt编译器资源管理器上的此示例和下一个示例的源代码+asm输出)

也许有些编译器中static_cast<volatile void>()确实做了一些事情，在这种情况下，写一个重复循环可能是一种更轻的方法，它不需要花费指令将结果存储到内存中，只需要计算它。

用tmp += foo()(或tmp |=)累加结果并从main()返回或用printf打印结果也很有用，而不是存储到volatile变量中。或者各种编译器特定的事情，比如使用空的内联asm语句来破坏编译器在不添加任何指令的情况下进行优化的能力。

参见Chandler Carruth的CppCon2015关于使用perf研究编译器优化的演讲，他在演讲中展示了GNU C的优化器转义函数。但他的escape()函数被编写为要求值在内存中(用"memory"clobber将asm作为void*传递给它)。我们不需要，我们只需要编译器在寄存器或内存中有值，甚至是立即常量。(它不太可能完全展开我们的循环，因为它不知道asm语句是零指令。)

这段代码编译到，只编译到popcnt，没有任何额外的存储，在gcc上。

// just force the value to be in memory, register, or even immediate
// instead of empty inline asm, use the operand in a comment so we can see what the compiler chose.  Absolutely no effect on optimization.
static void escape_integer(int a) {
asm volatile("# value = %0" : : "g"(a));
}
// simplified with just one inner loop
void test1() {
for (int i = 0; i < 10000; ++i) {
std::bitset<64> std_set(i);
int count = std_set.count();
escape_integer(count);
}
}

#gcc8.0 20171110 nightly -O3 -march=nehalem  (for popcnt instruction):
test1():
# value = 0              # it peels the first iteration with an immediate 0 for the inline asm.
mov     eax, 1
.L4:
popcnt  rdx, rax
# value = edx            # the inline-asm comment has the %0 filled in to show where gcc put the value
add     rax, 1
cmp     rax, 10000
jne     .L4
ret

Clang选择将值放在内存中以满足"g"约束，这是非常愚蠢的。但是，当您给clang一个内联asm约束(其中包括内存作为选项)时，clang确实倾向于这样做。因此，这并不比Chandler的escape函数更好。

# clang5.0 -O3 -march=nehalem
test1(): 
xor     eax, eax
#DEBUG_VALUE: i <- 0
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
popcnt  rcx, rax
mov     dword ptr [rsp - 4], ecx
# value = -4(%rsp)                # inline asm gets a value in memory
inc     rax
cmp     rax, 10000
jne     .LBB1_1
ret

带有-march=haswell的ICC18可以做到这一点：

test1():
xor       eax, eax                                      #30.16
..B2.2:                         # Preds ..B2.2 ..B2.1
# optimization report
# %s was not vectorized: ASM code cannot be vectorized
xor       rdx, rdx              # breaks popcnt's false dep on the destination
popcnt    rdx, rax                                      #475.16
inc       rax                                           #30.34
# value = edx
cmp       rax, 10000                                    #30.25
jl        ..B2.2        # Prob 99%                      #30.25
ret                                                     #35.1

奇怪的是，ICC使用了xor rdx,rdx而不是xor eax,eax。这浪费了一个REX前缀，并且不被认为是对Silvermont/KNL的依赖性破坏。