c++中有效的集并和交

Efficient set union and intersection in C++

本文关键字:有效 c++      更新时间:2023-10-16

给定两个集合set1和set2,我需要通过它们的并集计算它们相交的比率。到目前为止,我有以下代码:

double ratio(const set<string>& set1, const set<string>& set2)
{
    if( set1.size() == 0 || set2.size() == 0 )
        return 0;
    set<string>::const_iterator iter;
    set<string>::const_iterator iter2;
    set<string> unionset;
    // compute intersection and union
    int len = 0;
    for (iter = set1.begin(); iter != set1.end(); iter++) 
    {
        unionset.insert(*iter);
        if( set2.count(*iter) )
            len++;
    }
    for (iter = set2.begin(); iter != set2.end(); iter++) 
        unionset.insert(*iter);
    return (double)len / (double)unionset.size();   
}

它似乎很慢(我调用函数大约3M次,总是用不同的集合)。另一方面,python版本要快得多

def ratio(set1, set2):
    if not set1 or not set2:
        return 0
    return len(set1.intersection(set2)) / len(set1.union(set2))

关于如何改进c++版本(可能,不使用Boost)的任何想法?

可以在线性时间内完成,不需要增加内存:

double ratio(const std::set<string>& set1, const std::set<string>& set2)
{
    if (set1.empty() || set2.empty()) {
        return 0.;
    }
    std::set<string>::const_iterator iter1 = set1.begin();
    std::set<string>::const_iterator iter2 = set2.begin();
    int union_len = 0;
    int intersection_len = 0;
    while (iter1 != set1.end() && iter2 != set2.end()) 
    {
        ++union_len;
        if (*iter1 < *iter2) {
            ++iter1;
        } else if (*iter2 < *iter1) {
            ++iter2;
        } else { // *iter1 == *iter2
            ++intersection_len;
            ++iter1;
            ++iter2;
        }
    }
    union_len += std::distance(iter1, set1.end());
    union_len += std::distance(iter2, set2.end());
    return static_cast<double>(intersection_len) / union_len;
}

实际上不需要构造联合集。在Python术语中,len(s1.union(s2)) == len(s1) + len(s2) - len(s1.intersection(s2));并集的大小等于s1s2的大小之和,减去被计数两次的元素数量,即相交中的元素数量。因此,您可以执行

for (const string &s : set1) {
    len += set2.count(s);
}
return ((double) len) / (set1.size() + set2.size() - len)