为什么向量总是比C数组慢，至少在这种情况下

why vector is always slower than C array, at least in this case?

本文关键字：这种情况下数组向量为什么更新时间：2023-10-16

我试图使用Eratosthenes'Sieve算法找到所有不大于n的素数，并且我有以下代码，使用向量和C数组实现的筛，我发现几乎在所有时间，C数组总是更快。

使用向量:

int countPrimes_vector(int n) {                  
    int res = 0; 
    vector<char>bitmap(n);
    memset(&bitmap[0], '1', bitmap.size() * sizeof( bitmap[0]));
    //vector<bool>bitmap(n, true); Using this one is even slower!!
    for (int i = 2; i<n; ++i){
        if(bitmap[i]=='1')++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = '0';
        }
    }
    return res;
}

使用C数组:

int countPrimes_array(int n) {  
    int res = 0; 
    bool * bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){
        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}

测试代码:

clock_t t;
t = clock();
int a;
for(int i=0; i<10; ++i)a = countPrimes_vector(8000000); 
t = clock() - t;
cout<<"time for vector = "<<t<<endl;
t = clock();
int b;
for(int i=0; i<10; ++i)b = countPrimes_array(8000000); 
t = clock() - t;
cout<<"time for array = "<<t<<endl;

输出:

 time for vector = 32460000
 time for array = 29840000

我已经测试了很多次，C数组总是更快。这背后的原因是什么?

我经常听说vector和C数组的性能是一样的，vector应该一直使用作为标准容器。这种说法是正确的，或者至少一般来说是正确的吗?在什么情况下应该优先使用C数组?

编辑:

如下注释所示，打开优化-O2或-O3(最初是用g++ test.cpp编译的)后，vector和C数组之间的时差不再有效，在某些情况下vector比C数组更快。

您的比较包含不一致性，这可以解释差异，另一个因素可能是编译没有充分优化的结果。一些实现在STL的调试版本中有很多额外的代码，例如，MSVC对向量元素访问进行边界检查，这会在调试版本中显著降低速度。

下面的代码显示了两者之间更接近的性能，区别可能只是缺少样本(ideone的超时限制为5s)。

#include <vector>
#include <cmath>
#include <cstring>
int countPrimes_vector(int n) {  
    int res = 0; 
    std::vector<bool> bitmap(n, true);
    for (int i = 2; i<n; ++i){
        if(bitmap[i])
          ++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    return res;
}
int countPrimes_carray(int n) {  
    int res = 0; 
    bool* bitmap = new bool[n];
    memset(bitmap, true, sizeof(bool) * n);
    for (int i = 2; i<n; ++i){
        if(bitmap[i])++res;
        if(sqrt(n)>i)
        {
             for(int j = i*i; j < n; j += i) bitmap[j] = false;
        }
    }
    delete []bitmap;
    return res;
}
#include <chrono>
#include <iostream>
using namespace std;
void test(const char* description, int (*fn)(int))
{
    using clock = std::chrono::steady_clock;
    using ms = std::chrono::milliseconds;
    auto start = clock::now();
    int a;
    for(int i=0; i<9; ++i)
        a = countPrimes_vector(8000000); 
    auto end = clock::now();
    auto diff = std::chrono::duration_cast<ms>(end - start);
    std::cout << "time for " << description << " = " << diff.count() << "msn";
}
int main()
{
    test("carray", countPrimes_carray);
    test("vector", countPrimes_vector);
}

实时演示:http://ideone.com/0Y9gQx

time for carray = 2251ms
time for vector = 2254ms

虽然在一些运行中，阵列慢了1-2毫秒。同样，在共享资源上，这是不够的示例。

—EDIT—

在你的主要评论中，你问"为什么优化可以有所作为"。

std::vector<bool> v = { 1, 2, 3 };
bool b[] = { 1, 2, 3 };

我们有两个包含3个元素的"数组"，所以考虑如下:

v[10]; // illegal!
b[10]; // illegal!

STL的调试版本通常可以在运行时(在某些情况下，在编译时)捕获此错误。数组访问可能只会导致坏数据或崩溃。

此外，STL是使用许多像size()这样的小成员函数调用来实现的，因为vector是一个类，所以[]实际上是通过函数调用(operator[])来实现的。

编译器可以消除其中的许多，但这是优化。如果你不优化，那么像

std::vector<int> v;
v[10];

的作用大致如下:

int* data() { return M_.data_; }
v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(data() + idx);
}

，即使data是一个"内联"函数，为了使调试更容易，未优化的代码将其保留为这样。当您使用优化进行构建时，编译器认识到这些函数的实现是如此微不足道，以至于可以将它们的实现替换为调用站点，并且它很快将上述所有操作简化为更类似于数组访问的操作。

例如，在上面的例子中，它可能首先将operator[]缩减为

v.operator[](size_t idx = 10) {
    if (idx >= this->size()) {
        raise exception("invalid [] access");
    }
    return *(M_.data_ + idx);
}

由于没有调试的编译可能会删除边界检查，因此它变成

v.operator[](size_t idx = 10) {
    return *(M_.data_ + idx);
}

现在内联可以减少

x = v[1];

x = *(v.M_.data_ + 1); // comparable to v.M_.data_[1];

是一个很小的惩罚。c数组涉及内存中的数据块和一个局部变量，该局部变量适合指向该块的寄存器，您的引用直接相对于它:

对于vector，你有一个vector对象，它是一个指向数据的指针，一个大小和容量变量:

vector<T>  // pseudo code
{
    T* ptr;
    size_t size;
    size_t capacity;
}

如果对机器指令进行计数，则vector将有3个变量需要初始化和管理。

x = v[1];

给定上述向量的近似，你是在说:

T* ptr = v.data();
x = ptr[1];

，但是编译器在进行优化构建时通常足够聪明，可以识别出它可以在循环之前执行第一行，但这往往会消耗一个寄存器。

T* ptr = v.data(); // in debug, function call, otherwise inlined.
for ... {
    x = ptr[1];
}

所以你可能会看到你的测试函数每次迭代更多的机器指令，或者在现代处理器上，可能是一纳秒或两纳秒的额外隔离时间。