与子字符串相比，两个字符串集之间的相交

Intersection Between Two String Sets with Substring Compare

本文关键字：字符串两个之间更新时间：2023-10-16

我知道这是自行车的脱落，由于我出现的天真解决方案，比A.size * B.size * comp_substr的复杂性更好吗？

    std::copy_if(devices.cbegin(), devices.cend(),
                          std::back_inserter(ports),
                          [&comport_keys] (const auto& v) {
        return std::any_of(comport_keys.begin(),comport_keys.end(), [&v](auto& k) {
           return v.find(k) != std::string::npos;
        });
    });

仅在b是A字符串的情况下，使用std::set_intersection的情况很简单，(A.size + B.size) * comp_substr的复杂性将非常简单，如果必须在(n * log(n))之前对其进行排序，那么它甚至会更好，但是我不知道该如何为其编写比较函数，或者是两者。

    #define BOOST_TEST_MODULE My Test
    #include <boost/test/included/unit_test.hpp>
    #include <vector>
    #include <string>
    #include <algorithm>
    #include <iterator>
    #include <set>
    BOOST_AUTO_TEST_CASE(TEST) {
        std::vector<std::string> devices{
                "tty1",
                "ttyOfk",
                "ttyS05",
                "bsd",
        }, ports{};
        const std::set<std::string> comport_keys{
                "ttyS",
                "ttyO",
                "ttyUSB",
                "ttyACM",
                "ttyGS",
                "ttyMI",
                "ttymxc",
                "ttyAMA",
                "ttyTHS",
                "ircomm",
                "rfcomm",
                "tnt",
                "cu",
                "ser",
        };
        std::sort(devices.begin(), devices.end());
        std::set_intersection(devices.cbegin(), devices.cend(),
                              comport_keys.cbegin(), comport_keys.cend(),
                              std::back_inserter(ports),
                              [&comport_keys] (auto a, auto b) {
            return a.find(b) != std::string::npos; //This is wrong
        });
        const std::vector<std::string>test_set {
                "ttyOfk",
                "ttyS05",
        };
        BOOST_TEST(ports == test_set);
    }

说我们有两组字符串：A和B。B包含一组潜在的前缀。它具有B的所有潜在前缀。如果我们找到一个匹配的前缀，则将结果a存储在C中。微不足道的解决方案在O(| A | | B |(中起作用。您问：我们可以优化吗？

您说，B已经分类了。然后，我们可以在线性时间上在B上构建一个广义的前缀树，并用A中的每个字符串查询它，以在O(| A | | B(中求解。问题是，排序b需要O(| b | log | b |(，而树是非平凡的。

因此，我提供了一个简单的解决方案，该解决方案(| a | log | b |(比O(| a | | b |(更有效，如果| a |就像您的示例一样小。仍然假定B进行分类(排序实际上是这里的上限...(。

bool
validate_prefixes(const std::multiset<std::string>& keys) {
    auto itb = keys.begin(), it = itb;
    if(it == keys.end()) return false; //no keys
    for(++it; it != keys.end(); ++it) {
        if( (*it).find(*itb) != std::string::npos ) return false; //redundant keys
        itb++;
    }
    return true;
}
bool
copy_from_intersecting_prefixes(const std::vector<std::string>& data, 
                                std::multiset<std::string>& prefix_keys,
                                std::vector<std::string>& dest, bool check = false) {
    if(check && !validate_prefixes(prefix_keys)) return false;
    for(auto it_data = data.begin(); it_data != data.end(); ++it_data) {
        auto ptr = prefix_keys.insert(*it_data), ptrb = ptr;
        if(ptrb != prefix_keys.begin()) {  //if data is at the start, there is no prefix
            if( (*ptr).find(*(--ptrb)) != std::string::npos ) dest.push_back(*it_data);
        }
        prefix_keys.erase(ptr);
    } //Complexity: O(|data|) * O( log(|prefix_keys|) ) * O(substr) = loop*insert*find
    return check;
}
//.... in main()
std::multiset<std::string> tmp(comport_keys.begin(), comport_keys.end()); //copy const    
copy_from_intersecting_prefixes(devices, tmp, ports);

validate_prefixes执行先决条件。它检查我们是否至少有一个有效的前缀，并且密钥没有自我匹配。例如。我们可以拥有键cu和cu2，但是cu是cu2的前缀，因此它们不能既是有效的前缀， cu co_10太通用或cu2太具体了。如果我们尝试将cu3与cu匹配，并且cu2这是不一致的。这里validate_prefixes(comport_keys)返回true，但是可以自动检查它可能很不错。

copy_from_intersecting_prefixes进行实际要求的工作。它在A上迭代，并将A内部放置为B内。前缀小于前缀结尾，因此，如果存在相应的前缀，则将发生在A中。由于键没有自匹配，我们知道前缀将在B中为a之前。因此，我们将迭代器从a和比较中降低。请注意，前缀可能等于a，因此我们需要MultiSet。