std::lower_bound slower for std::vector than std::map::find

本文关键字：std than find vector map slower lower bound for 更新时间：2023-10-16

我编写了一个类，作为顺序容器（std::vector/std::queue/std::list）的包装器，以具有std::map的接口，从而在使用少量小对象时提高性能。考虑到现有的算法，编码非常简单。这段代码显然是从我的完整代码中高度精简的，但显示了问题所在。

template <class key_, 
          class mapped_, 
          class traits_ = std::less<key_>,
          class undertype_ = std::vector<std::pair<key_,mapped_> >
         >
class associative
{
public:
    typedef traits_ key_compare;
    typedef key_ key_type;
    typedef mapped_ mapped_type;
    typedef std::pair<const key_type, mapped_type> value_type;
    typedef typename undertype_::allocator_type allocator_type;
    typedef typename allocator_type::template rebind<value_type>::other value_allocator_type;
    typedef typename undertype_::const_iterator const_iterator;
    class value_compare {
        key_compare pred_;
    public:
        inline value_compare(key_compare pred=key_compare()) : pred_(pred) {}
        inline bool operator()(const value_type& left, const value_type& right) const {return pred_(left.first,right.first);}
        inline bool operator()(const value_type& left, const key_type& right) const {return pred_(left.first,right);}
        inline bool operator()(const key_type& left, const value_type& right) const {return pred_(left,right.first);}
        inline bool operator()(const key_type& left, const key_type& right) const {return pred_(left,right);}
        inline key_compare key_comp( ) const {return pred_;}
    };
    class iterator  {
    public:       
        typedef typename value_allocator_type::difference_type difference_type;
        typedef typename value_allocator_type::value_type value_type;
        typedef typename value_allocator_type::reference reference;
        typedef typename value_allocator_type::pointer pointer;
        typedef std::bidirectional_iterator_tag iterator_category;
        inline iterator(const typename undertype_::iterator& rhs) : data(rhs) {}
    inline reference operator*() const { return reinterpret_cast<reference>(*data);}
        inline pointer operator->() const {return reinterpret_cast<pointer>(structure_dereference_operator(data));}
        operator const_iterator&() const {return data;}
    protected:
        typename undertype_::iterator data;
    };
    template<class input_iterator>
    inline associative(input_iterator first, input_iterator last) : internal_(first, last), comp_() 
    {if (std::is_sorted(internal_.begin(), internal_.end())==false) std::sort(internal_.begin(), internal_.end(), comp_);}
inline iterator find(const key_type& key) {
    iterator i = std::lower_bound(internal_.begin(), internal_.end(), key, comp_);
    return (comp_(key,*i) ? internal_.end() : i);
}
protected:
    undertype_ internal_;
    value_compare comp_;
};

SSCCEhttp://ideone.com/Ufn7r，完整代码位于http://ideone.com/MQr0Z（注意：作为IdeOne的结果时间非常不稳定，可能是由于服务器负载，并且没有清楚地显示有问题的结果）

我用std::string和POD进行了测试，POD从4到128字节，范围从8到2000个元素，MSVC10。

我预计（1）从一个范围内创建小对象，（2）随机插入/擦除少量小对象，以及（3）查找所有对象会有更高的性能。令人惊讶的是，对于所有测试，从一个范围内创建矢量的速度明显快于，对于随机擦除的速度更快，这取决于高达2048字节的大小（512个4字节对象或128个16字节对象等）。然而，最令人震惊的是，对于所有POD，使用std::lower_bound的std::vector比std::map::find慢。4字节和8字节POD的差异很小，但对于128字节POD，std::vector慢了36%！然而，对于std::string，std::vector平均快6%。

我觉得排序后的std::vector上的std::lower_bound应该优于std::map，因为它具有更好的缓存位置/更小的内存大小，而且由于map可能不完全平衡，或者在最坏的情况下它应该匹配std::map，但我想不出std::map应该更快的任何原因。我唯一的想法是谓词在某种程度上减慢了它的速度，但我不知道怎么做。所以问题是：排序的std::vector上的std::lower_bound怎么会比std::map（在MSVC10中）表现更好

[EDIT]我已经确认，std::vector<std::pair<4BYTEPOD,4BYTEPOD>>上的std::lower_bound使用的比较平均少于std::map<4BYTEPOD,4BYTEPOD>::find（0-0.25），但我的实现仍然慢了26%。

我在http://ideone.com/41iKt它去除了所有不需要的绒毛，并清楚地表明，经过分类的vector上的find比map慢约15%。

这是一个更有趣的问题！在讨论我到目前为止的发现之前，让我指出associative::find()函数的行为与std::map::find()不同：如果没有找到密钥，则前者返回下界，而后者返回end()。为了解决这个问题，associative::find()需要更改为类似以下内容：

auto rc = std::lower_bound(this->internal_.begin(), this->internal_.end(), key, this->comp_);
return rc != this->internal_.end() && !this->comp_(key, rc->first)? rc: this->internal_.end();

现在我们更有可能比较苹果和苹果（我现在还没有验证逻辑是否真的正确），让我们继续研究性能。我不太相信用于测试性能的方法真的成立，但我现在坚持使用它，我肯定可以提高associative容器的性能。我不认为我在代码中发现了所有的性能问题，但至少取得了一些进展。最大的问题是注意到associative中使用的比较函数非常糟糕，因为它一直在复制。这使这个容器在某种程度上处于不利地位。如果你现在正在检查比较器，你可能看不到它，因为看起来这个比较器是通过引用的！这个问题实际上相当微妙：底层容器的value_type为std::pair<key_type, mapped_type>，但比较器将std::pair<key_type const, mapped_type>作为参数！修复此问题似乎可以大大提高关联容器的性能。

为了实现一个比较器类，它没有机会完全匹配参数，我使用一个简单的助手来检测一个类型是否是std::pair<L, R>:

template <typename>               struct is_pair                  { enum { value = false }; };
template <typename F, typename S> struct is_pair<std::pair<F, S>> { enum { value = true }; };

然后我把比较器换成了这个稍微复杂一点的：

class value_compare {
    key_compare pred_;
public:
    inline value_compare(key_compare pred=key_compare()) : pred_(pred) {}
    template <typename L, typename R>
    inline typename std::enable_if<is_pair<L>::value && is_pair<R>::value, bool>::type
    operator()(L const& left, R const& right) const {
        return pred_(left.first,right.first);
    }
    template <typename L, typename R>
    inline typename std::enable_if<is_pair<L>::value && !is_pair<R>::value, bool>::type
    operator()(L const& left, R const& right) const {
        return pred_(left.first,right);
    }
    template <typename L, typename R>
    inline typename std::enable_if<!is_pair<L>::value && is_pair<R>::value, bool>::type
    operator()(L const& left, R const& right) const {
        return pred_(left,right.first);
    }
    template <typename L, typename R>
    inline typename std::enable_if<!is_pair<L>::value && !is_pair<R>::value, bool>::type
    operator()(L const& left, R const& right) const {
        return pred_(left,right);
    }
    inline key_compare key_comp( ) const {return pred_;}
};

这通常会使这两种方法更加接近。考虑到我认为std::vector<T>和lower_bound()的方法应该比使用std::map<K, T>要好得多，我觉得调查还没有结束。

附录：

重新思考一下这个练习，我发现了为什么我对谓词类的实现感到不舒服：它是复杂的方式！这可以通过而不是使用std::enable_if进行更改来简单得多：这很好地将代码简化为更易于阅读的内容。关键是要获得密钥：

template <typename Key>
Key const& get_key(Key const& value)                  { return value; }
template <typename Key,  typename Value>
Key const& get_key(std::pair<Key, Value> const& pair) { return pair.first; }

通过这种从一个值或一对值中获取"键"的实现，谓词对象只需定义一个非常简单的函数调用运算符：

template <typename L, typename R>
bool operator()(L const& l, R const& r)
{
    return this->pred_(get_key<key_type>(l), get_key<key_type>(r));
}

不过，这也有一个小技巧：需要将预期的key_type传递给get_key()函数。如果没有这一点，谓词在key_type本身就是对象的std::pair<F, S>的情况下就不起作用。

我有一个猜测。首先，lower_bound必须执行log2（n）比较，不管怎样。这意味着它永远不会提前停止（就像find一样）。其次，对于大于特定大小的数据类型，在向量的任何指针运算中都必须涉及乘法运算。而对于映射，它只是从内存中加载一个4（或64位上的8）字节值的指针。

x86有一些很好的指令，可以在索引计算过程中快速乘以2的幂。但它们只适用于二次方，因为它们是为类整数实体的数组编制索引而设计的。对于较大的数字，它实际上必须使用一个明显较慢的整数乘法指令。

当你做lower_bound时，你必须精确地做这些乘法的log2（n）。但对于find，它可以在较小的数字处被截断，用于一半的值。这意味着它们对lower_bound的影响将比任何其他方法都大得多。

顺便说一句。。。在我看来，::std::map应该实现为一个B树，其中每个节点都是一个页面大小。虚拟内存将其设置为，基本上每个具有显著大数据结构的程序最终都会在内存压力下调出该结构的部分。每个节点只存储一个值可能会产生几乎最坏的情况，在这种情况下，对于log2（n）深度的每次比较，您必须在整个页面中进行分页，其中如果使用b-tree，则最坏的分页情况将是log x（n）页面，其中x是每个节点的值数。

这也有减少缓存线边界的不良影响的好副作用。将有一个（键、值）元组大小和缓存行大小的LCM。在一个节点中有多个（键、值）对会设置LCM，因此这种LCM更有可能发生，X对将正好占用Y条缓存线。但是，如果每个节点只包含一对，则基本上不会发生这种情况，除非节点大小是缓存行大小的精确倍数。