为什么省略push_back会使循环运行变慢?

Why does omitting the push_back make the loop run slower?

本文关键字：运行循环 push back 为什么省更新时间：2023-10-16

考虑这个程序，我在Cygwin上使用gcc 5.4.0和命令行g++ -std=c++14 -Wall -pedantic -O2 timing.cpp -o timing编译。

#include <chrono>
#include <iostream>
#include <string>
#include <vector>
std::string generateitem()
{
    return "a";
}
int main()
{
    std::vector<std::string> items;
    std::chrono::steady_clock clk;
    auto start(clk.now());
    std::string item;
    for (int i = 0; i < 3000000; ++i)
    {
        item = generateitem();
        items.push_back(item); // *********
    }
    auto stop(clk.now());
    std::cout
        << std::chrono::duration_cast<std::chrono::milliseconds>
            (stop-start).count()
        << " msn";
}

我一直得到大约500毫秒的报告时间。但是，如果我注释掉星号行，从而将push_back省略到vector，则报告的时间约为700 ms.

为什么不推到vector使循环运行更慢?

我现在已经运行了测试，问题是在push_back版本中，item字符串没有被释放。将代码更改为:

#include <chrono>
#include <iostream>
#include <string>
#include <vector>
std::string generateitem()
{
    return "a";
}
int main()
{
    std::chrono::steady_clock clk;
    auto start(clk.now());
{
    std::vector<std::string> items;
    std::string item;
    for (int i = 0; i < 3000000; ++i)
    {
        item = generateitem();
        items.push_back(item); // *********
    }
}
    auto stop(clk.now());
    std::cout
        << std::chrono::duration_cast<std::chrono::milliseconds>
            (stop-start).count()
        << " msn";
}

给出了在我的CygWin机器上两个选项大约相同时间的预期行为，因为我们这次测量了所有的释放。

为了进一步解释，原始代码基本上是:

allocate items
start clock
repeat 3000000 times
    allocate std::string("a")
    move std::string("a") to end of items array
stop clock
deallocate 3000000 strings

所以，性能是由3000000个分配决定的。现在，如果我们注释掉push_back()，我们得到:

allocate items
start clock
repeat 3000000 times
    allocate std::string("a")
    deallocate std::string("a")
stop clock

现在我们测量了3000000个分配和3000000个释放，所以很明显，它实际上会慢一些。我建议将items向量释放移到时间跨度中，这意味着我们可以使用push_back():

start clock
allocate items
repeat 3000000 times
    allocate std::string("a")
    move std::string("a") to end of items array
deallocate 3000000 strings
stop clock

或不含push_back():

start clock
allocate items
repeat 3000000 times
    allocate std::string("a")
    deallocate std::string("a")
deallocate empty array
stop clock

所以，这两种方式我们都测量了3000000个分配和释放，所以代码将花费基本相同的时间。

感谢Ken Y-N的回答，我现在可以给出我自己问题的完整答案了。

代码在一个为std::string实现写时复制的标准库版本上再次编译。也就是说，当复制字符串时，字符串内容的缓冲区不重复，并且两个字符串对象使用相同的缓冲区。只有当其中一个字符串被写入时才会发生重复。因此，已分配字符串缓冲区的生命周期如下:

在generateitem函数中创建
它是由generateitem函数通过RVO产生的
分配给item。(这是一个移动操作，因为std::string是临时的)
调用push_back复制std::string，但不复制缓冲区。现在有两个std::string共享一个缓冲区
在循环的下一次迭代中，下一个字符串被移动到item中。现在，唯一使用缓冲区的std::string对象是矢量中的对象。
当vector在main完成时被销毁时，所有缓冲区的引用计数降为0，因此它们被释放。

因此，在测量的时间内没有任何缓冲区被释放。

如果我们消除对push_back的调用，那么步骤4就不会发生。在步骤5中，缓冲区的引用计数下降到0，因此在测量的时间内，它被释放。这就解释了为什么测量时间会增加。

现在，根据文档，GCC 5应该已经用一个不使用copy-on-write的新类替换了copy-on-write字符串类。但是，显然，Cygwin默认情况下仍然使用旧版本。如果我们将-D_GLIBCXX_USE_CXX11_ABI=1添加到命令行中，我们将得到新的字符串类，这样我们就得到了我们期望的结果。