迭代时的性能(缓存未命中）

The performance at iteration (cache miss)

本文关键字：缓存性能迭代更新时间：2023-10-16

我发现，当代替使用变量(i)来向上计数CCD_ 1。

感谢一些评论，以下是一些附加信息：(1)我使用Visual Studio C++编译器；(2) 我在发布模式下编译并进行了优化-O2:)

控制台的图像

如果变量i递增，则迭代采用

5875ms:

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或5723ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

如果使用std::vector<Data>::Iterator进行迭代，则迭代将采用

29ms:

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

或110ms:

std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

为什么其他迭代要快得多？

我想知道，对变量I的迭代(数据在内存中的不同位置)与对变量I(数据在存储器中并置)的迭代一样快。数据在内存中相邻这一事实应该会减少缓存未命中，并且这适用于std::vector<Data>::Iterator的迭代，为什么不适用于另一个呢？还是我敢，29到110米的距离不是欠债的缓存失误？

整个程序看起来是这样的：

#include <iostream>
#include <chrono>
#include <vector>
#include <string>
class StopWatch
{
public:
void start() {
this->t1 = std::chrono::high_resolution_clock::now();
}
void stop() {
this->t2 = std::chrono::high_resolution_clock::now();
this->diff = t2 - t1;
}
void printSpanAsMs(std::string startText = "time span") {
long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
(diff).count();
std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
}
private:
std::chrono::high_resolution_clock::time_point t1, t2;
std::chrono::high_resolution_clock::duration   diff;
} stopWatch;
struct Data {
int x, y;
};
const unsigned long MAX_DATA = 20000000;
void test1()
{
std::cout << "1. Test n Use i to iterate through the vector" << 
std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each 
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec2[i]->x = 0;
vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (unsigned i = 0U; i < MAX_DATA; ++i) {
delete vec2[i];
vec2[i] = nullptr;
}
}
void test2()
{
std::cout << "2. Test n Use std::vector<T>::iteraror to iterate through 
the vector" << std::endl;
std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (auto& it : vec) {
it.x = 0;
it.y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each 
other");
//////////////////////////////////////////////////
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
vec2.push_back(new Data());
stopWatch.start();
for (auto& it : vec2) {
it->x = 0;
it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");
for (auto& it : vec2) {
delete it;
it = nullptr;
}
}
int main()
{
test1();
test2();
system("PAUSE");
return 0;
}

为什么其他迭代要快得多？

原因是MSVC 2017无法对其进行适当优化。

在第一种情况下，它完全无法优化环路：

for (unsigned i = 0U; i < MAX_DATA; ++i) {
vec[i].x = 0;
vec[i].y = 0;
}

生成的代码(实时演示)：

xor      r9d, r9d
mov      eax, r9d
$LL4@test1:
mov      rdx, QWORD PTR [rcx]
lea      rax, QWORD PTR [rax+16]
mov      DWORD PTR [rax+rdx-16], r9d
mov      rdx, QWORD PTR [rcx]
mov      DWORD PTR [rax+rdx-12], r9d
mov      rdx, QWORD PTR [rcx]
mov      DWORD PTR [rax+rdx-8], r9d
mov      rdx, QWORD PTR [rcx]
mov      DWORD PTR [rax+rdx-4], r9d
sub      r8, 1
jne      SHORT $LL4@test1

用size_t i替换unsigned i或将索引访问提升到引用中都没有帮助(演示)。

唯一有帮助的是使用迭代器，就像你已经发现的那样：

for (auto& it : vec) {
it.x = 0;
it.y = 0;
}

生成的代码(实时演示)：

xor      ecx, ecx
npad     2
$LL4@test2:
mov      QWORD PTR [rax], rcx
add      rax, 8
cmp      rax, rdx
jne      SHORT $LL4@test2

clang在这两种情况下都只调用memset。

这个故事的寓意是：如果您关心性能，请查看生成的代码。向供应商报告问题。