当只有相等可用时排序

Sort when only equality is available

本文关键字：排序更新时间：2023-10-16

假设我们有一对向量:

std::vector<std::pair<A,B>> v;

只定义了类型A的相等性:

bool operator==(A const & lhs, A const & rhs) { ... }

你将如何排序，它与相同的first元素的所有对将结束接近?为了清楚起见，我希望实现的输出应该与以下内容相同:

std::unordered_multimap<A,B> m(v.begin(),v.end());
std::copy(m.begin(),m.end(),v.begin());

但是，如果可能的话，我想:

就地分拣。
避免为相等定义哈希函数。

编辑:附加的具体信息。

在我的情况下元素的数量不是特别大(我期望N = 10~1000)，虽然我必须重复这个排序多次(~400)作为一个更大的算法的一部分，和数据类型被称为A是相当大的(它包含了一个unordered_map与~20 std::pair<uint32_t,uint32_t>在其中，这是结构阻止我发明一个排序，并使其难以建立一个哈希函数)

第一个选项:`cluster()`和`sort_within()`

@MadScienceDreams的手写双循环可以写成O(N * K)复杂度的cluster()算法，包含N元素和K集群。它反复调用std::partition(使用c++ 14风格的泛型lambdas，通过编写自己的函数对象，很容易适应c++ 1，甚至c++ 98风格):

template<class FwdIt, class Equal = std::equal_to<>>
void cluster(FwdIt first, FwdIt last, Equal eq = Equal{}) 
{
    for (auto it = first; it != last; /* increment inside loop */)
        it = std::partition(it, last, [=](auto const& elem){
            return eq(elem, *it);    
        });    
}

在输入vector<std::pair>上调用为

cluster(begin(v), end(v), [](auto const& L, auto const& R){
    return L.first == R.first;
});

下一个要写的算法是sort_within，它接受两个谓词:一个相等和一个比较函数对象，并重复调用std::find_if_not来找到当前范围的末尾，然后调用std::sort来在该范围内排序:

template<class RndIt, class Equal = std::equal_to<>, class Compare = std::less<>>
void sort_within(RndIt first, RndIt last, Equal eq = Equal{}, Compare cmp = Compare{})
{
    for (auto it = first; it != last; /* increment inside loop */) {
        auto next = std::find_if_not(it, last, [=](auto const& elem){
            return eq(elem, *it);
        });
        std::sort(it, next, cmp);
        it = next;
    }
}

对于已经聚集的输入，您可以将其称为:

sort_within(begin(v), end(v), 
    [](auto const& L, auto const& R){ return L.first == R.first; },
    [](auto const& L, auto const& R){ return L.second < R.second; }
);

现场示例显示了使用std::pair<int, int>的一些实际数据。

第二个选项:用户定义的比较

即使A上没有定义operator<，您也可以自己定义它。这里有两种选择。首先，如果A是可哈希的，您可以定义

bool operator<(A const& L, A const& R)
{
    return std::hash<A>()(L) < std::hash<A>()(R);
}

直接写std::sort(begin(v), end(v))。如果你不想在单独的存储中缓存所有唯一的哈希值，你将有O(N log N)调用std::hash。

第二，如果A是不可哈希的，但有数据成员getter x(), y()和z()，唯一地确定A上的相等性:你可以做

bool operator<(A const& L, A const& R)
{
    return std::tie(L.x(), L.y(), L.z()) < std::tie(R.x(), R.y(), R.z());
}

你也可以直接写std::sort(begin(v), end(v))

如果你能想出一个函数，为每个唯一的元素分配一个唯一的数字，那么你可以用这个唯一的数字构建二级数组，然后对二级数组和主数组进行排序，例如通过归并排序。

但在这种情况下，你需要一个函数，分配给每个唯一的元素一个唯一的数字，即哈希函数没有冲突。我想这应该不成问题。

且该解的渐近性，若哈希函数有O(1)，则构建次数组为O(N)，与主数组排序为O(NlogN)。总结O(N + NlogN) = O(N logN)这个解决方案的缺点是需要双倍的内存。

总之，这个解决方案的主要意义是将元素快速转换为可以快速比较的元素。

一个合适的算法是

for (int i = 0; i < n-2; i++)
{
   for (int j = i+2; j < n; j++)
   {
      if (v[j].first == v[i].first)
      {
         std::swap(v[j],v[i+1]);
         i++;
      }
 }

可能有一种更优雅的方式来编写循环，但这是O(n*m)，其中n是元素的数量，m是键的数量。所以如果m比n小得多(最好的情况是所有的键都是相同的)这个可以用O(n)来近似。最坏情况下，键的个数= n，所以这是O(n^2)我不知道你对键数的期望是多少，所以我不能真正计算平均情况，但很可能是O(n^2)对于平均情况也是如此。

对于少量的键，这可能比无序multimap更快，但您必须测量才能发现。

注意:簇的顺序完全随机。

编辑:(在部分集群情况下更有效，不改变复杂性)

for (int i = 0; i < n-2; i++)
{
   for(;i<n-2 && v[i+1].first==v[i].first; i++){}
   for (int j = i+2; j < n; j++)
   {
      if (v[j].first == v[i].first)
      {
         std::swap(v[j],v[i+1]);
         i++;
      }
 }

编辑2:在/u/MrPisarik的评论中，删除了内部循环中的冗余i检查

我很惊讶没有人建议使用std::partition。它使解决方案美观、优雅且通用:

template<typename BidirIt, typename BinaryPredicate>
void equivalence_partition(BidirIt first, BidirIt last, BinaryPredicate p) {
  using element_type = typename std::decay<decltype(*first)>::type;
  if(first == last) {
    return;
  }
  auto new_first = std::partition
    (first, last, [=](element_type const &rhs) { return p(*first, rhs); });
  equivalence_partition(new_first, last, p);
}
template<typename BidirIt>
void equivalence_partition(BidirIt first, BidirIt last) {
  using element_type = typename std::decay<decltype(*first)>::type;
  equivalence_partition(first, last, std::equal_to<element_type>());
}

例子。

当只有相等可用时排序

Sort when only equality is available

第一个选项:cluster()和sort_within()

第二个选项:用户定义的比较

第一个选项:`cluster()`和`sort_within()`