为什么这个版本的strcmp更慢?

Why is this version of strcmp slower?

本文关键字：strcmp 更慢版本为什么更新时间：2023-10-16

我一直在尝试在某些条件下提高strcmp的性能。但是，不幸的是，我什至无法像库实现一样执行普通strcmp的实现。

我看到了一个类似的问题，但答案说区别在于编译器优化了字符串文字的比较。我的测试不使用字符串文字。

这是实现(比较.cpp(

int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
if (*a == '') return 0;
a++;
b++;
}
return *b - *a;
}

下面是测试驱动程序(驱动程序.cpp(：

#include "comparisons.h"
#include <array>
#include <chrono>
#include <iostream>
void init_string(char* str, int nChars) {
// 10% of strings will be equal, and 90% of strings will have one char different.
// This way, many strings will share long prefixes so strcmp has to exercise a bit.
// Using random strings still shows the custom implementation as slower (just less so).
str[nChars - 1] = '';
for (int i = 0; i < nChars - 1; i++)
str[i] = (i % 94) + 32;
if (rand() % 10 != 0)
str[rand() % (nChars - 1)] = 'x';
}
int main(int argc, char** argv) {
srand(1234);
// Pre-generate some strings to compare.
const int kSampleSize = 100;
std::array<char[1024], kSampleSize> strings;
for (int i = 0; i < kSampleSize; i++)
init_string(strings[i], kSampleSize);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp(strings[i], strings[j]);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp        - " << (end - start).count() << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp_custom(strings[i], strings[j]);
end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}

还有我的制作文件：

CC=clang++
test: driver.o comparisons.o
$(CC) -o test driver.o comparisons.o
# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
$(CC) -c -o driver.o -std=c++11 -O0 driver.cpp
# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
$(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp
clean:
rm comparisons.o driver.o test

根据这个答案的建议，我在单独的编译单元中编译了我的比较函数，并关闭了优化，并在关闭优化的情况下编译了驱动程序，但我仍然得到了大约 5 倍的减速。

strcmp        - 154519
strcmp_custom - 506282

我也尝试复制 FreeBSD 实现，但得到了类似的结果。

我想知道我的绩效衡量是否忽略了什么。还是标准库实现做了一些更花哨的事情？

我不知道你拥有哪个标准库，但只是为了让你了解 C 库维护者对优化字符串原语的认真程度，GNU libc 在 x86-64 上使用的默认strcmp是两千行手动优化的汇编语言，从 2.24 版开始。对于SSSE3和SSE4.2指令集扩展可用，有单独的，也是手动优化的版本。 (该文件中相当多的复杂性似乎是因为相同的源代码用于生成其他几个函数;机器代码最终"只有"1120条指令。 2.24 大约在一年前发布，此后进行了更多的工作。

他们遇到了这么多麻烦，因为其中一个字符串原语通常是配置文件中最热的函数。

我x86_64 Linuxglibcv2.2.5 的反汇编摘录：

0000000000089cd0 <strcmp@@GLIBC_2.2.5>:
89cd0:   48 8b 15 99 a1 33 00    mov    0x33a199(%rip),%rdx        # 3c3e70 <_IO_file_jumps@@GLIBC_2.2.5+0x790>
89cd7:   48 8d 05 92 58 01 00    lea    0x15892(%rip),%rax        # 9f570 <strerror_l@@GLIBC_2.6+0x200>
89cde:   f7 82 b0 00 00 00 10    testl  $0x10,0xb0(%rdx)
89ce5:   00 00 00 
89ce8:   75 1a                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
89cea:   48 8d 05 9f 48 0c 00    lea    0xc489f(%rip),%rax        # 14e590 <__nss_passwd_lookup@@GLIBC_2.2.5+0x9c30>
89cf1:   f7 82 80 00 00 00 00    testl  $0x200,0x80(%rdx)
89cf8:   02 00 00 
89cfb:   75 07                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
89cfd:   48 8d 05 0c 00 00 00    lea    0xc(%rip),%rax        # 89d10 <strcmp@@GLIBC_2.2.5+0x40>
89d04:   c3                      retq
89d05:   90                      nop
89d06:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
89d0d:   00 00 00 
89d10:   89 f1                   mov    %esi,%ecx
89d12:   89 f8                   mov    %edi,%eax
89d14:   48 83 e1 3f             and    $0x3f,%rcx
89d18:   48 83 e0 3f             and    $0x3f,%rax
89d1c:   83 f9 30                cmp    $0x30,%ecx
89d1f:   77 3f                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
89d21:   83 f8 30                cmp    $0x30,%eax
89d24:   77 3a                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
89d26:   66 0f 12 0f             movlpd (%rdi),%xmm1
89d2a:   66 0f 12 16             movlpd (%rsi),%xmm2
89d2e:   66 0f 16 4f 08          movhpd 0x8(%rdi),%xmm1
89d33:   66 0f 16 56 08          movhpd 0x8(%rsi),%xmm2
89d38:   66 0f ef c0             pxor   %xmm0,%xmm0
89d3c:   66 0f 74 c1             pcmpeqb %xmm1,%xmm0
89d40:   66 0f 74 ca             pcmpeqb %xmm2,%xmm1
89d44:   66 0f f8 c8             psubb  %xmm0,%xmm1
89d48:   66 0f d7 d1             pmovmskb %xmm1,%edx
89d4c:   81 ea ff ff 00 00       sub    $0xffff,%edx
...

真正的是1183行组装，在检测系统特征和矢量化指令方面有很多潜在的聪明。 libc 维护者知道，他们可以通过优化应用程序调用数千次的一些函数来获得优势。

为了进行比较，您的版本-O3：

comparisons.o:     file format elf64-x86-64

Disassembly of section .text:
0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
0:   8a 0e                   mov    (%rsi),%cl
2:   8a 07                   mov    (%rdi),%al
4:   38 c1                   cmp    %al,%cl
6:   75 1e                   jne    26 <_Z13strcmp_customPKcS0_+0x26>
if (*a == '') return 0;
8:   48 ff c6                inc    %rsi
b:   48 ff c7                inc    %rdi
e:   66 90                   xchg   %ax,%ax
10:   31 c0                   xor    %eax,%eax
12:   84 c9                   test   %cl,%cl
14:   74 18                   je     2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
16:   0f b6 0e                movzbl (%rsi),%ecx
19:   0f b6 07                movzbl (%rdi),%eax
1c:   48 ff c6                inc    %rsi
1f:   48 ff c7                inc    %rdi
22:   38 c1                   cmp    %al,%cl
24:   74 ea                   je     10 <_Z13strcmp_customPKcS0_+0x10>
26:   0f be d0                movsbl %al,%edx
29:   0f be c1                movsbl %cl,%eax
if (*a == '') return 0;
a++;
b++;
}
return *b - *a;
2c:   29 d0                   sub    %edx,%eax
}
2e:   c3                      retq