标准堆栈性能问题

std stack performance issues

本文关键字：问题性能堆栈标准更新时间：2023-10-16

最近我试图做一些性能基准测试，比较std::stack<int, std::vector<int>>和我自己的堆栈的简单实现(使用预分配的内存)。现在我正在经历一些奇怪的行为。

我想问的第一件事是堆栈基准代码中的这一行：

//  std::vector<int> magicVector(10);

当我取消注释此行时，性能提高了约 17%(基准时间从 6.5 秒下降到 5.4 秒)。但是该行应该对程序的其余部分没有影响，因为它不会修改任何其他成员。此外，无论是 int 的向量还是双精度的向量都没关系......

我想问的第二件事是我的堆栈实现和std::stack之间的巨大性能差异。有人告诉我std::stack应该和我的堆栈一样快，但结果显示我的"FastStack"快两倍。

结果(未注释的性能增加线)：
堆栈 5.38979 堆栈 5.34406
堆栈 5.32404 堆栈 5.30519

快速堆栈 2.59635
快速堆栈 2.59204 快速堆栈 2.59713
快速堆栈 2.64814

这些结果来自VS2010的发布版本，其中包含/O2，/Ot，/Ob2和其他默认优化。我的 CPU 是带有默认时钟的英特尔 i5 3570k(一个线程为 3.6 GHz)。

我将所有代码放在一个文件中，以便任何人都可以轻松测试它。

#define _SECURE_SCL 0
#include <iostream>
#include <vector>
#include <stack>
#include <Windows.h>
using namespace std;
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    High Resolution Timer
//---------------------------------------------------------------------------------
class HRTimer
{
public:
HRTimer();
double GetFrequency(void);
void Start(void) ;
double Stop(void);
double GetTime();
private:
LARGE_INTEGER start;
LARGE_INTEGER stop;
double frequency;
};
HRTimer::HRTimer()
{
frequency = this->GetFrequency();
}
double HRTimer::GetFrequency(void)
{
LARGE_INTEGER proc_freq;
if (!::QueryPerformanceFrequency(&proc_freq))
return -1;
return proc_freq.QuadPart;
}
void HRTimer::Start(void)
{
DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
::QueryPerformanceCounter(&start);
::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}
double HRTimer::Stop(void)
{
DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
::QueryPerformanceCounter(&stop);
::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
return ((stop.QuadPart - start.QuadPart) / frequency);
} 
double HRTimer::GetTime()
{
LARGE_INTEGER time;
::QueryPerformanceCounter(&time);
return time.QuadPart / frequency;
}
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Should be faster than std::stack
//---------------------------------------------------------------------------------
template <class T>
class FastStack
{
public:
T* st;
int allocationSize;
int lastIndex;
public:
FastStack(int stackSize);
~FastStack();
inline void resize(int newSize);
inline void push(T x);
inline void pop();
inline T getAndRemove();
inline T getLast();
inline void clear();
};
template <class T>
FastStack<T>::FastStack( int stackSize )
{
st = NULL;
this->allocationSize = stackSize;
st = new T[stackSize];
lastIndex = -1;
}
template <class T>
FastStack<T>::~FastStack()
{
delete [] st;
}
template <class T>
void FastStack<T>::clear()
{
lastIndex = -1;
}
template <class T>
T FastStack<T>::getLast()
{
return st[lastIndex];
}
template <class T>
T FastStack<T>::getAndRemove()
{
return st[lastIndex--];
}
template <class T>
void FastStack<T>::pop()
{
--lastIndex;
}
template <class T>
void FastStack<T>::push( T x )
{
st[++lastIndex] = x;
}
template <class T>
void FastStack<T>::resize( int newSize )
{
if (st != NULL)
delete [] st;
st = new T[newSize];
}
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//---------------------------------------------------------------------------------
//  Purpose:    Benchmark of std::stack and FastStack
//---------------------------------------------------------------------------------

int main(int argc, char *argv[])
{
#if 1
for (int it = 0; it < 4; it++)
{
std::stack<int, std::vector<int>> bStack;
int x;
for (int i = 0; i < 100; i++)   // after this two loops, bStack's capacity will be 141 so there will be no more reallocating
bStack.push(i);
for (int i = 0; i < 100; i++)
bStack.pop();
//  std::vector<int> magicVector(10);           // when you uncomment this line, performance will magically rise about 18%
HRTimer timer;
timer.Start();
for (int i = 0; i < 2000000000; i++)
{
bStack.push(i);
x = bStack.top();
if (i % 100 == 0 && i != 0)
for (int j = 0; j < 100; j++)
bStack.pop();
}
double totalTime = timer.Stop();
cout << "stack " << totalTime << endl;
}
#endif
//------------------------------------------------------------------------------------
#if 1
for (int it = 0; it < 4; it++)
{
FastStack<int> fstack(200);
int x;
HRTimer timer;
timer.Start();
for (int i = 0; i < 2000000000; i++)
{
fstack.push(i);
x = fstack.getLast();
if (i % 100 == 0 && i != 0)
for (int j = 0; j < 100; j++)
fstack.pop();
}
double totalTime = timer.Stop();
cout << "FastStack " << totalTime << endl;
}
#endif
cout << "Done";
cin.get();
return 0;
}

.
编辑：由于每个人都在谈论我对堆栈的非常糟糕的实现，我想把事情做好。我在几分钟内创建了该堆栈，并且只实现了当前需要的几个功能。它从来都不是 std：：stack :) 的替代品或保存以在所有情况下使用。唯一的目标是实现最大速度和正确的结果。我很抱歉这个误会...我只想知道几个答案...

你的方法实现都坏了。忽略复制构造函数和其他缺少的操作，如果你推送太多，你的push会调用 UB，并且你的resize明显被破坏，因为它不会复制以前的数据并且它不是异常安全的，你的推送不是异常安全的，你调用了太多的副本，你的getAndRemove不是异常安全的，你没有破坏弹出的元素，你没有正确地构造新元素，只分配它们，你在创建时不必要地默认构造，可能还有更多我没有找到。

基本上，你的类在每一个可以想象的方面都是极其不安全的，一畢帽子就会破坏用户的数据，调用T上所有错误的函数，并且在任何地方抛出异常的那一刻就会在角落里哭泣。

这是一大堆坏事，它比std::stack"更快"的事实完全无关紧要，因为你所证明的只是，如果你不必满足要求，你可以

随心所欲地走，我们都知道了。从根本上说，正如 sbi 所说，你显然不了解std::stack的语义，也不了解异常安全等重要的C++方面，而你的代码无法正常工作的方式是使其执行速度更快的原因。你还有很长的路要走，我的朋友。

许多评论(甚至答案)都集中在实施中的风险上。然而，问题仍然存在。

正如下面直接展示的那样，纠正感知到的代码缺陷不会改变任何关于性能的重要内容。

以下是OP的代码修改为(A)安全，(B)支持与std::stack相同的操作，以及(C)也为std::stack保留缓冲区空间，以便为那些错误地认为这些东西对性能很重要的人澄清事情：

#define _SECURE_SCL 0
#define _SCL_SECURE_NO_WARNINGS
#include <algorithm>        // std::swap
#include <iostream>
#include <vector>
#include <stack>
#include <stddef.h>         // ptrdiff_t
#include <type_traits>      // std::is_pod
using namespace std;
#undef UNICODE
#define UNICODE
#include <Windows.h>
typedef ptrdiff_t   Size;
typedef Size        Index;
template< class Type, class Container >
void reserve( Size const newBufSize, std::stack< Type, Container >& st )
{
struct Access: std::stack< Type, Container >
{
static Container& container( std::stack< Type, Container >& st )
{
return st.*&Access::c;
}
};
Access::container( st ).reserve( newBufSize );
}
class HighResolutionTimer
{
public:
HighResolutionTimer();
double GetFrequency() const;
void Start() ;
double Stop();
double GetTime() const;
private:
LARGE_INTEGER start;
LARGE_INTEGER stop;
double frequency;
};
HighResolutionTimer::HighResolutionTimer()
{
frequency = GetFrequency();
}
double HighResolutionTimer::GetFrequency() const
{
LARGE_INTEGER proc_freq;
if (!::QueryPerformanceFrequency(&proc_freq))
return -1;
return static_cast< double >( proc_freq.QuadPart );
}
void HighResolutionTimer::Start()
{
DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
::QueryPerformanceCounter(&start);
::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
}
double HighResolutionTimer::Stop()
{
DWORD_PTR oldmask = ::SetThreadAffinityMask(::GetCurrentThread(), 0);
::QueryPerformanceCounter(&stop);
::SetThreadAffinityMask(::GetCurrentThread(), oldmask);
return ((stop.QuadPart - start.QuadPart) / frequency);
} 
double HighResolutionTimer::GetTime() const
{
LARGE_INTEGER time;
::QueryPerformanceCounter(&time);
return time.QuadPart / frequency;
}
template< class Type, bool elemTypeIsPOD = !!std::is_pod< Type >::value >
class FastStack;
template< class Type >
class FastStack< Type, true >
{
private:
Type*   st_;
Index   lastIndex_;
Size    capacity_;
public:
Size const size() const { return lastIndex_ + 1; }
Size const capacity() const { return capacity_; }
void reserve( Size const newCapacity )
{
if( newCapacity > capacity_ )
{
FastStack< Type >( *this, newCapacity ).swapWith( *this );
}
}
void push( Type const& x )
{
if( size() == capacity() )
{
reserve( 2*capacity() );
}
st_[++lastIndex_] = x;
}
void pop()
{
--lastIndex_;
}
Type top() const
{
return st_[lastIndex_];
}
void swapWith( FastStack& other ) throw()
{
using std::swap;
swap( st_, other.st_ );
swap( lastIndex_, other.lastIndex_ );
swap( capacity_, other.capacity_ );
}
void operator=( FastStack other )
{
other.swapWith( *this );
}
~FastStack()
{
delete[] st_;
}
FastStack( Size const aCapacity = 0 )
: st_( new Type[aCapacity] )
, capacity_( aCapacity )
{
lastIndex_ = -1;
}
FastStack( FastStack const& other, int const newBufSize = -1 )
{
capacity_ = (newBufSize < other.size()? other.size(): newBufSize);
st_ = new Type[capacity_];
lastIndex_ = other.lastIndex_;
copy( other.st_, other.st_ + other.size(), st_ );   // Can't throw for POD.
}
};
template< class Type >
void reserve( Size const newCapacity, FastStack< Type >& st )
{
st.reserve( newCapacity );
}
template< class StackType >
void test( char const* const description )
{
for( int it = 0; it < 4; ++it )
{
StackType st;
reserve( 200, st );
// after this two loops, st's capacity will be 141 so there will be no more reallocating
for( int i = 0; i < 100; ++i ) { st.push( i ); }
for( int i = 0; i < 100; ++i ) { st.pop(); }
// when you uncomment this line, std::stack performance will magically rise about 18%
// std::vector<int> magicVector(10);
HighResolutionTimer timer;
timer.Start();
for( Index i = 0; i < 1000000000; ++i )
{
st.push( i );
(void) st.top();
if( i % 100 == 0 && i != 0 )
{
for( int j = 0; j < 100; ++j ) { st.pop(); }
}
}
double const totalTime = timer.Stop();
wcout << description << ": "  << totalTime << endl;
}
}
int main()
{
typedef stack< Index, vector< Index > > SStack;
typedef FastStack< Index >              FStack;
test< SStack >( "std::stack" );
test< FStack >( "FastStack" );
cout << "Done";
}

这款慢如糖蜜的三星RC530笔记本电脑的结果：

[D：\dev\test\so\12704314]> 标准：：堆栈：3.21319 标准：：堆栈：3.16456 标准：：堆栈：3.23298 标准：：堆栈：3.20854 快速堆栈：1.97636 快速堆栈：1.97958 快速堆栈：2.12977 快速堆栈：2.13507 做 [D：\dev\test\so\12704314]> _

同样适用于视觉C++。

现在让我们看一个典型的std::vector::push_back实现，它由std::stack<T, std::vector<T>>::push调用(顺便说一下，我知道只有 3 个程序员曾经使用过这种缩进风格，即 PJP、Petzold 和我自己;我现在，从 1998 年左右开始，就认为这太可怕了！

void push_back(const value_type& _Val)
{   // insert element at end
if (_Inside(_STD addressof(_Val)))
{   // push back an element
size_type _Idx = _STD addressof(_Val) - this->_Myfirst;
if (this->_Mylast == this->_Myend)
_Reserve(1);
_Orphan_range(this->_Mylast, this->_Mylast);
this->_Getal().construct(this->_Mylast,
this->_Myfirst[_Idx]);
++this->_Mylast;
}
else
{   // push back a non-element
if (this->_Mylast == this->_Myend)
_Reserve(1);
_Orphan_range(this->_Mylast, this->_Mylast);
this->_Getal().construct(this->_Mylast,
_Val);
++this->_Mylast;
}
}

我怀疑所测量的低效率至少部分在于那里发生的所有事情，也许这也是自动生成的安全检查的问题。

对于调试版本，std::stack性能非常糟糕，以至于我放弃了等待任何结果。

编辑：根据Xeo在下面的评论，我更新了push以检查缓冲区重新分配情况下的"自推"，将其分解为单独的函数：

void push( Type const& x )
{
if( size() == capacity() )
{
reserveAndPush( x );
}
st_[++lastIndex_] = x;
}

神秘的是，尽管在此测试中从未调用reserveAndPush，但它会影响性能 - 由于代码大小不适合缓存？

[D：\dev\test\so\12704314]> 标准：：堆栈：3.21623 标准：：堆栈：3.30501 标准：：堆栈：3.24337 标准：：堆栈：3.27711 快速堆栈：2.52791 快速堆栈：2.44621 快速堆栈：2.44759 快速堆栈：2.47287 做 [D：\dev\test\so\12704314]> _

EDIT 2：DeadMG表明代码一定有问题。我相信问题是一个缺失的return，加上表达式计算新的大小(两次零仍然是零)。他还指出我忘了给reserveAndPush看。应该是：

void reserveAndPush( Type const& x )
{
Type const xVal = x;
reserve( capacity_ == 0? 1 : 2*capacity_ );
push( xVal );
}
void push( Type const& x )
{
if( size() == capacity() )
{
return reserveAndPush( x );    // <-- The crucial "return".
}
st_[++lastIndex_] = x;
}

与使用std::vector的std::stack相反，您的堆栈在空间不足时不会重新分配，而只是炸毁地球。但是，分配会极大地消耗性能，因此跳过它肯定会提高性能。

但是，在您的位置上，我会抓住网络上漂浮的陈旧static_vector实现之一，并将其塞进std::stack中以代替std::vector。这样，您就可以跳过所有对性能要求很高的动态内存处理，但是您有一个有效的堆栈实现，其中包含一个用于内存处理的容器，该容器很可能比您提出的要好得多。