合并一个类的两个数组

Merging two arrays of a class

本文关键字：两个数组一个合并更新时间：2023-10-16

我有两个类为Record的数组。类Record的定义如下

class Record{
char* string; //the word string
int count; //frequency word appears
}

这是两个已定义(已初始化)的数组

Record recordarray1=new Record[9000000];  //contains 9000000 unsorted Records
Record recordarray2=new Record[8000000]  //contains 8000000 unsorted Records

的目的是找到两个数组之间匹配的字符串，并将它们添加到一个新数组中，并将它们的计数加在一起，如果有字符串不在另一个数组中，则只添加到新数组中。为此，我尝试先对两个数组进行排序(按字符串按字母顺序排列)，然后比较recordarray2，如果字符串匹配，则推进recordarray2的索引，否则推进recordarray1的索引，直到找到一个。如果没有找到，则将其添加到新数组中。

不幸的是这个方法太慢了，使用STL排序本身需要20秒以上的时间。有没有一种更快的标准排序方法，我错过了?

如果我理解正确的话，你的算法应该采取O( nlogn + mlogm[对两个数组进行排序]+ n + m[通过数组并比较])。
这可能不是什么优化，但您可以尝试对其中一个数组进行排序，并使用二分搜索来检查另一个数组的元素是否存在。因此，现在它应该使用O( n[将一个数组复制为新数组]+ nlogn[对其进行排序]+ mlogn[对第二个数组的元素进行二进制搜索])。

HTH

排序对象可能很昂贵，所以我会尽量避免这种情况。

一种更快的方法可能是使用std::hash_map为每个数组创建索引，将字符串as作为索引，数组索引作为值。你会得到两个可以一次迭代的容器。较小值的迭代器将继续前进，直到找到匹配或其他指向较小值的迭代器。这将使您得到一个可预测的迭代计数。

可能的解决方案是使用unordered_map。算法如下:

Put the first array into the map, using strings as keys and count as values. 
For each member in the second array, check it against containment in the map.
    If it exists there
        Put the record into the new array, combining counts
        Remove the record from the map
    Else
        Put the record into the new array
Iterate throug the remaining recors in the map and put the in to the new array.

该算法的复杂度约为O(n+m)

我觉得不需要排序。您可以使用以下算法:

recordarray1

搜索recordarray2中的元素。如果找到该元素，则在新数组中增加count。同时设置recordarray2[N]::count为负值;这样就不会在步骤3中再次检查
放置来自的所有元素recordarray2没有计数设置为负值为new数组中。如果为负，count为遇到了就把它改成积极的。

注意:这个算法不会考虑在同一个数组中是否有相似的字符串元素。另外，不要使用string作为变量名。因为它也是一个类型名std::string