STL算法是否针对速度进行了优化

Are STL algorithms optimized for speed?

本文关键字：优化速度算法是否 STL 更新时间：2023-10-16

我在std：：向量上测试不同循环方式的速度。在下面的代码中，我考虑了5种方法来计算N=10000000个元素的向量的所有元素的总和：

使用迭代器
使用整数索引
使用整数索引，按因子2展开
使用整数索引，按因子4展开
使用std:：accumulate

代码是用g++编译的，用于编译的命令行是：

g++ -std=c++11 -O3 loop.cpp -o loop.exe

我运行了4次代码，测量了每种方法的时间，我得到了以下结果（时间以微秒为单位，给出了最大值和最小值）：

迭代次数：8002-8007
国际指数：8004-9003
展开2:6004-7005
展开4:4001-5004
累计：8005-9007

这些实验似乎表明：

使用迭代器和整数索引进行循环并没有太大区别，至少在完全优化的情况下是这样。
展开环路会得到的回报
令人惊讶的是，stl:：accumulate的性能更差。

虽然结论1和2是意料之中的，但数字3却令人惊讶。不是所有的书都说要使用STL算法而不是自己写循环吗？

我在测量时间的方式上，或者在解释结果的方式上犯了什么错误吗？如果你尝试下面给出的代码，你们会得到不同的场景吗？

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
using namespace std;
using namespace std::chrono;

int main()
{
    const int N = 10000000;
    vector<int> v(N);
    for (int i = 0; i<N; ++i)
        v[i] = i;
    //looping with iterators
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        long long int sum = 0;
        for (auto it = v.begin(); it != v.end(); ++it)
            sum+=*it;
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
        cout << duration << "microseconds  output = " << sum << " (Iterators)n";
    }
    //looping with integers
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        long long int sum = 0;
        for (int i = 0; i<N; ++i)
            sum+=v[i];
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
        cout << duration << "microseconds  output = " << sum << " (integer index)n";
    }
    //looping with integers (UNROLL 2)
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        long long int sum = 0;
        for (int i = 0; i<N; i+=2)
            sum+=v[i]+v[i+1];
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
        cout << duration << "microseconds  output = " << sum << " (integer index, UNROLL 2)n";
    }
    //looping with integers (UNROLL 4)
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        long long int sum = 0;
        for (int i = 0; i<N; i+=4)
            sum+=v[i]+v[i+1]+v[i+2]+v[i+3];
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
        cout << duration << "microseconds  output = " << sum << " (integer index, UNROLL 4)n";
    }
    //using std::accumulate
    {
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
        long long int sum = accumulate(v.begin(), v.end(), static_cast<long long int>(0));
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
        cout << duration << "microseconds  output = " << sum << " (std::accumulate)n";
    }
    return 0;
}

使用标准库算法的原因是而不是为了获得更好的效率，而是为了让您能够在更高的抽象级别上思考。

虽然在某些情况下，算法可能比你自己手工编写的代码更快，但这不是它们的目的。C++的一大优点是，当您有特定需求时，它可以绕过内置库。如果您的基准测试已经表明标准库正在导致严重的速度减慢，那么您可以自由探索经典的替代方案，如循环展开。对于大多数目的来说，这永远都不是必要的。

话虽如此，一个编写良好的标准库算法永远不会比你自己的直接实现慢得可怕，除非你利用了对数据细节的了解。

除了Mar，我认为在大多数情况下，STL并不比您自己的实现快，因为它是一系列相关问题的通用解决方案，但不是针对特定问题的，因此STL可能会考虑比实际需要更多的因素，因此效率较低。但有一个例外：stl:：sort使用了微妙的优化（可能是不同排序算法的混合），因此它比通常的实现更快。