std::向量差异

std::vector differences

本文关键字：向量 std 更新时间：2023-10-16

如何确定两个向量的差异是什么?

我有vector<int> v1和vector<int> v2;

我正在寻找的是一个仅包含仅在v1或v2中的元素的vector<int> vDifferences。

是否有一个标准的方法来做到这一点?

以下是完整且正确的答案。在使用set_symmetric_difference算法之前，源范围必须对进行排序:

  using namespace std; // For brevity, don't do this in your own code...
  vector<int> v1;
  vector<int> v2;
  // ... Populate v1 and v2
  // For the set_symmetric_difference algorithm to work, 
  // the source ranges must be ordered!    
  vector<int> sortedV1(v1);
  vector<int> sortedV2(v2);
  sort(sortedV1.begin(),sortedV1.end());
  sort(sortedV2.begin(),sortedV2.end());
  // Now that we have sorted ranges (i.e., containers), find the differences    
  vector<int> vDifferences;
  set_symmetric_difference(
    sortedV1.begin(),
    sortedV1.end(),
    sortedV2.begin(),
    sortedV2.end(),
    back_inserter(vDifferences));
  // ... do something with the differences

应该注意的是，排序是一个昂贵的操作(即，对于普通STL实现，O(n log n))。特别是对于一个或两个容器都非常大的情况(例如，数百万个整数或更多)，基于算法复杂性，使用哈希表的不同算法可能更可取。以下是该算法的高级描述:

将每个容器装入哈希表。
如果两个容器大小不同，则在步骤3中使用较小的容器对应的哈希表进行遍历。否则，将使用第一个哈希表。
遍历在步骤2中选择的哈希表，检查每个项是否在两个哈希表中都存在。如果是，那就把它从他们俩身上移开。哈希表变小的原因是首选遍历是因为无论容器大小如何，哈希表查找平均O(1)。因此，遍历时间是n的线性函数(即O(n))，其中n是正在遍历的哈希表的大小。
取哈希表中剩余项的并集，并将结果存储在差异中容器。

c++ 11通过标准化unordered_multiset容器为我们提供了这样的解决方案。我还采用了auto关键字的新用法进行显式初始化，以使以下基于哈希表的解决方案更简洁:

using namespace std; // For brevity, don't do this in your own code...
// The remove_common_items function template removes some and / or all of the
// items that appear in both of the multisets that are passed to it. It uses the
// items in the first multiset as the criteria for the multi-presence test.
template <typename tVal>
void remove_common_items(unordered_multiset<tVal> &ms1, 
                         unordered_multiset<tVal> &ms2)
{
  // Go through the first hash table
  for (auto cims1=ms1.cbegin();cims1!=ms1.cend();)
  {
    // Find the current item in the second hash table
    auto cims2=ms2.find(*cims1);
    // Is it present?
    if (cims2!=ms2.end())
    {
      // If so, remove it from both hash tables
      cims1=ms1.erase(cims1);
      ms2.erase(cims2);
    }
    else // If not
      ++cims1; // Move on to the next item
  }
}
int main()
{
  vector<int> v1;
  vector<int> v2;
  // ... Populate v1 and v2
  // Create two hash tables that contain the values
  // from their respective initial containers    
  unordered_multiset<int> ms1(v1.begin(),v1.end());
  unordered_multiset<int> ms2(v2.begin(),v2.end());
  // Remove common items from both containers based on the smallest
  if (v1.size()<=v2.size)
    remove_common_items(ms1,ms2);
  else
    remove_common_items(ms2,ms1);
  // Create a vector of the union of the remaining items
  vector<int> vDifferences(ms1.begin(),ms1.end());
  vDifferences.insert(vDifferences.end(),ms2.begin(),ms2.end());
  // ... do something with the differences
}

为了确定哪种解决方案更适合特定情况，分析两种算法将是一个明智的做法。尽管基于哈希表的解决方案是0 (n)，但它需要更多的代码，并且每个发现的副本(即哈希表删除)需要做更多的工作。它还(遗憾地)使用自定义差分函数，而不是标准的STL算法。

需要注意的是，这两种解决方案显示差异的顺序很可能与元素在原始容器中出现的顺序大不相同。有一种方法可以绕过这个问题，即使用哈希表解决方案的一种变体。下面是高级描述(仅在步骤4中与前面的解决方案不同):

将每个容器装入哈希表。
如果两个容器大小不同，则在步骤3中将使用较小的哈希表进行遍历。否则，将使用第一个。
遍历在步骤2中选择的哈希表，检查每个项是否在两个哈希表中都存在。如果是，那就把它从他们俩身上移开。
要形成差异容器，按顺序遍历原始容器(即第一个容器在第二个容器之前)。在每个容器各自的哈希表中查找每个项。如果找到，则将该项添加到difference容器中，并从其哈希表中删除。不存在于各自哈希表中的项将被跳过。因此，只有出现在哈希表中的项才会出现在差异容器中，它们出现的顺序将保持与原始容器中的顺序相同，因为这些容器决定了最终遍历的顺序。

为了保持原来的顺序，步骤4变得比之前的解决方案更昂贵，特别是当移除的项目数量很高时。这是因为:

所有物品将通过在各自的哈希表中进行存在性测试，第二次测试是否有资格出现在不同的容器中。
当差异容器形成时，散列表将每次删除一个剩余的项，作为项目1的差异测试中存在的的一部分。

您是否希望 v1和v2中的元素是唯一的且不在其他序列中?这听起来像std::set_symmetric_difference

复制范围[first1,last1)中不存在的元素在[first2, last2]范围内，以及该范围内的元素不存在于[first1, last1)到范围内的[first2,last2]范围从result开始。构造范围内的元素排序。