整数的快速集合并集

Fast set union of integer

本文关键字：合并集合整数更新时间：2023-10-16

我需要对有序的整数集进行大量的联合(我希望避免重复，但如果有的话也可以)。

这是目前为止性能最好的代码:

// some code added for better understanding
std::vector< std::pair<std::string, std::vector<unsigned int> > vec_map;
vec_map.push_back(std::make_pair("hi", std::vector<unsigned int>({1, 12, 1450});
vec_map.push_back(std::make_pair("stackoverflow", std::vector<unsigned int>({42, 1200, 14500});
std::vector<unsigned int> match(const std::string & token){
    auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
    auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());
    std::vector<unsigned int> result;
    for(; lower != upper; ++lower){
        std::vector<unsigned int> other = lower->second;
        result.insert(result.end(), other.begin(), other.end());
    }
    std::sort(result.begin(), result.end()); // This function eats 99% of my running time
    return result;
}

valgrind(使用callgrind工具)告诉我，我花了99%的时间做排序。

这是我到目前为止所尝试的:

使用std::set(非常糟糕的性能)
使用std::set_union(性能差)
使用std::push_heap维护堆(慢50%)

有希望以某种方式获得一些性能吗?我可以改变我的容器和使用boost，也许一些其他的库(取决于它的许可证)。

EDIT整数可以大到10 000 000EDIT 2给出了一些我如何使用它的例子，因为有些混乱

这看起来像是一个多路合并的实例。根据输入(配置文件和时间!)，最好的算法可能是您所拥有的，或者是通过从所有容器中选择最小整数来增量构建结果的算法，或者是更复杂的算法。

自定义合并排序可能提供少量帮助。

#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <iostream>
#include <climits>
typedef std::multimap<std::string, std::vector<unsigned int> > vec_map_type;
vec_map_type vec_map;
struct comp {
    bool operator()(const std::string& lhs, const std::pair<std::string, std::vector<unsigned int> >& rhs) const
    { return lhs < rhs.first; }
    bool operator()(const std::pair<std::string, std::vector<unsigned int> >& lhs, const std::string& rhs) const
    { return lhs.first < rhs; }
};
typedef comp comp2;
    std::vector<unsigned int> match(const std::string & token){
        auto lower = std::lower_bound(vec_map.begin(), vec_map.end(), token, comp2());
        auto upper = std::upper_bound(vec_map.begin(), vec_map.end(), token, comp());
        unsigned int num_vecs = std::distance(lower, upper);
        typedef std::vector<unsigned int>::const_iterator iter_type;
        std::vector<iter_type> curs;
        curs.reserve(num_vecs);
        std::vector<iter_type> ends;
        ends.reserve(num_vecs);
        std::vector<unsigned int> result;
        unsigned int result_count = 0;
        //keep track of current position and ends
        for(; lower != upper; ++lower){
            const std::vector<unsigned int> &other = lower->second;
            curs.push_back(other.cbegin());
            ends.push_back(other.cend());
            result_count += other.size();
        }
        result.reserve(result_count);
        //merge sort
        unsigned int last = UINT_MAX;
        if (result_count) {
            while(true) {
                //find which current position points to lowest number
                unsigned int least=0;
                for(unsigned int i=0; i< num_vecs; ++i ){
                    if (curs[i] != ends[i] && (curs[least]==ends[least] || *curs[i]<*curs[least]))
                        least = i;
                } 
                if (curs[least] == ends[least])
                    break;
                //push back lowest number and increase that vectors current position
                if( *curs[least] != last || result.size()==0) {
                    last = *curs[least];
                    result.push_back(last);
                            }
                ++curs[least];
            }
        }
        return result;
    }
    int main() {
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>(10, 10)));
        std::vector<unsigned int> t;
        t.push_back(1); t.push_back(2); t.push_back(11); t.push_back(12);
        vec_map.insert(vec_map_type::value_type("apple", t));
        vec_map.insert(vec_map_type::value_type("apple", std::vector<unsigned int>()));
        std::vector<unsigned int> res = match("apple");
        for(unsigned int i=0; i<res.size(); ++i)
            std::cout << res[i] << ' ';
        return 0;
    }

http://ideone.com/1rYTi

替代解决方案:

方法std::sort(如果它是基于快速排序)是非常好的排序非排序向量(logN)，是更好的排序向量，但如果你的向量是倒排有O(N^2)。在执行联合操作时，可能会有很多操作数，其中第一个操作数的值比后面的操作数的值大。

我会尝试以下操作(我假设输入向量中的元素已经排序):

根据其他人的建议，在开始填充之前，您应该保留结果矢量所需的大小。
当std::distance(lower, upper) == 1时，无需进行并集，只需复制单操作数的内容即可。
对联合的操作数进行排序，可能是按大小排序(先大一点)，或者如果范围不重叠(部分重叠只是按第一个值)，以便在下一步中最大化已经排序的元素数量。最好的策略可能是同时考虑联合的每个操作数的SIZE和RANGE。
如果有几个操作数，每个操作数都有很多元素，继续在结果向量的后面追加元素，但是在追加每个向量(从第二个)之后，您可以尝试合并(std::inplace_merge)旧内容与追加的内容，这也会为您删除重复元素。
如果操作数的数量很大(与元素的总数相比)，那么您应该继续使用之前的排序策略，但在排序后调用std::unique来重复数据删除。在这种情况下，您应该根据所包含的元素范围进行排序。

如果元素的数量在可能的int s范围中所占的百分比相对较大，那么您可能会从本质上简化的"散列连接"(使用DB术语)中获得不错的性能。

(如果整数的数量相对于可能值的范围来说相对较少，这可能不是最好的方法。)

本质上，我们制作一个巨大的位图，然后只在与输入int对应的索引上设置标志，最后基于这些标志重建结果:

#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>
template <typename ForwardIterator>
std::vector<int> IntSetUnion(
    ForwardIterator begin1,
    ForwardIterator end1,
    ForwardIterator begin2,
    ForwardIterator end2
) {
    int min = std::numeric_limits<int>::max();
    int max = std::numeric_limits<int>::min();
    for (auto i = begin1; i != end1; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }
    for (auto i = begin2; i != end2; ++i) {
        min = std::min(*i, min);
        max = std::max(*i, max);
    }
    if (min < std::numeric_limits<int>::max() && max > std::numeric_limits<int>::min()) {
        std::vector<int>::size_type result_size = 0;
        std::vector<bool> bitmap(max - min + 1, false);
        for (auto i = begin1; i != end1; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }
        for (auto i = begin2; i != end2; ++i) {
            const std::vector<bool>::size_type index = *i - min;
            if (!bitmap[index]) {
                ++result_size;
                bitmap[index] = true;
            }
        }
        std::vector<int> result;
        result.reserve(result_size);
        for (std::vector<bool>::size_type index = 0; index != bitmap.size(); ++index)
            if (bitmap[index])
                result.push_back(index + min);
        return result;
    }
    return std::vector<int>();
}
void main() {
    // Basic sanity test...
    {
        std::vector<int> v1;
        v1.push_back(2);
        v1.push_back(2000);
        v1.push_back(229013);
        v1.push_back(-2243);
        v1.push_back(-530);
        std::vector<int> v2;
        v1.push_back(2);
        v2.push_back(243);
        v2.push_back(90120);
        v2.push_back(329013);
        v2.push_back(-530);
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        for (auto i = result.begin(); i != result.end(); ++i)
            std::cout << *i << std::endl;
    }
    // Benchmark...
    {
        const auto count = 10000000;
        std::vector<int> v1(count);
        std::vector<int> v2(count);
        for (int i = 0; i != count; ++i) {
            v1[i] = i;
            v2[i] = i - count / 2;
        }
        std::random_shuffle(v1.begin(), v1.end());
        std::random_shuffle(v2.begin(), v2.end());
        auto start_time = clock();
        auto result = IntSetUnion(v1.begin(), v1.end(), v2.begin(), v2.end());
        auto end_time = clock();
        std::cout << "Time: " << (((double)end_time - start_time) / CLOCKS_PER_SEC) << std::endl;
        std::cout << "Union element count: " << result.size() << std::endl;
    }
}

这打印…

Time: 0.402

…在我的机器上。

如果您想从std::vector<int>以外的其他东西获得输入int s，您可以实现自己的迭代器并将其传递给IntSetUnion。

您对范围有限的整数进行排序，这是可以使用基数排序的极少数情况之一。不幸的是，要知道这是否优于广义排序，唯一的方法就是尝试一下。