Arm GNU编译器:通过多余的强制转换优化的三进制生成的程序集

Arm GNU Compiler: Assembly generated from ternary optimized by superfluous cast

本文关键字：优化转换程序集编译器 GNU 多余 Arm 更新时间：2023-10-16

(更新为删除decltype并替换为static_cast，结果相同)

在代码示例"在宏MAX中添加强制转换"中，代码执行得更快。我不明白为什么它看起来应该是一样的。这发生在两个不同的ARM编译器GCC(以及较大代码库中的armclang)中。对此有任何想法都会很有帮助。

在下面的代码中，当定义WITH_CAST时，编译的结果得到了显著的改进(在我较大的代码库中得到了相同的结果)。表演的演员阵容似乎是多余的。我在Keil 5.25pre2中运行此程序(仅作为模拟器)。我使用Keil模拟器来检查性能速度，通过查看t1计时器以微秒为单位显示的内容。

代码片段：

#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (static_cast<mytype>(a)) : (static_cast<mytype>(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif

GNU Arm Tools Embedded v.7 2017-q4-主要版本。

编译器选项：-c-mcpu=cortex-m4-mthumb-gdwarf-2-MD-Wall-O-mapcs框架-mthumb-interwork-std=c++14-Ofast-I/RTE/_Target_1-IC:/Keil_v525pre/ARM/PACK/ARM/CMSI/5.2.0/CMSIS/Include-IC:/Keil_v525 pre/ARM/PACK/ARM/CSIS/5.2.0/Device/ARM/ARMCM4/Include-I"C:/程序文件(x86/GNU工具ARM嵌入式/7 2017-q4-major/ARM none eabi/include/c++/7.2.1"-I"c:/程序文件(x86)/NGNU工具ARMEmbedded/7 2017-q4-major/aarm none eabi/include+c++/7.2.1/ARM none-eabi"-D_UVISION_VERSION="525"-D_GCC-D_GCC_VERSION="721"-D-RTE_-DARMCM4-Wa，-alhms="*.lst"-o*.o

汇编程序选项：-mcpu=cortex-m4-mthumb--gdwarf-2-mthumb交织--MD.d-I/RTE/_Target_1-IC:/Keil_v525pre/ARM/PACK/ARM/CMSI/5.2.0/CMSIS/Include-IC:/Keil_v525 pre/ARM/PACK/ARM/CSIS/5.2.0/Device/ARM/ARMCM4/Include-I"C:/程序文件(x86/GNU工具ARM嵌入式/7 2017-q4-major/ARM none eabi/include/c++/7.2.1"-I"c:/程序文件(x86)/GNU工具ARMEmbedded/7 2017-q4-major/aarm none eabi/include+c++/7.2.1/ARM none-eabi"-alhms=">.lst"-o*.o

链接器选项：-T/RTE/Device/ARMCM4/gcc_arm.ld-mcpu=cortex-m4-mthumb-mthumb交互-Wl，-Map="./Optimization.Map"-o优化.elf*.o-lm

#include <cstdlib>
#include <cstring>
#include <cstdint>
#define WITH_CAST
struct mytype {
uint32_t value;
__attribute__((const, always_inline)) constexpr friend bool operator>(const mytype & t, const mytype & a) {
return t.value > a.value;
}
};
static mytype output_buf [32];
static mytype * output_memory_ptr = output_buf;
static mytype * volatile * output_memory_tmpp = &output_memory_ptr;
static mytype input_buf [32];
static mytype * input_memory_ptr = input_buf;
static mytype * volatile * input_memory_tmpp = &input_memory_ptr;
#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (static_cast<mytype>(a)) : (static_cast<mytype>(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif
int main (void) {
const mytype * input = *input_memory_tmpp;
mytype * output = *output_memory_tmpp;
mytype p = input[0];
mytype c = input[1];
mytype pc = MAX(p, c);
output[0] = pc;
for (int i = 1; i < 31; i ++) {
mytype n = input[i + 1];
mytype cn = MAX(c, n);
output[i] = MAX(pc, cn);
p = c;
c = n;
pc = cn;
}
output[31] = pc;
}

C++0x规范中的报价：

decltype(e)表示的类型定义如下：

--如果e是一个未加括号的id表达式或类成员访问(5.2.5)，decltype(e)是由e命名的实体的类型。如果没有这样的实体，或者e命名了一组重载函数，则程序格式错误；

--否则，如果e是函数调用(5.2.2)或重载运算符的调用(e周围的括号被忽略)，则decltype(e)是静态选择的函数的返回类型；

--否则，如果e是左值，则decltype(e)是T&，其中T是e的类型；

--否则，decltype(e)就是e.的类型

我想引用(T&)的使用会使它更有效率。

从"想要速度"的讨论结果来看？不要(总是)传递价值。

只涉及lvalues，在没有移动语义的情况下，"pass-by-value"版本会通过复制构造创建一个额外的对象。

因此，使用"decltype"(即"通过引用传递")提高了代码的效率。