使用 C++ 的排序对引用的子字符串进行排序？

Sorting referenced substrings using C++'s sort?

本文关键字：排序字符串引用 C++ 使用更新时间：2023-10-16

我有两个长字符串(大约一百万个字符(，我想从中生成后缀并进行排序，以便找到最长的共享子字符串，因为这比暴力破解所有可能的子字符串要快得多。我最熟悉 Python，但我的快速计算估计了 40 Tb 的后缀，所以我希望可以使用C++(建议(对每个主要不变字符串中的子字符串的引用进行排序。

我需要保留每个子字符串的索引才能在以后找到值和源字符串，因此关于我可以使用的数据结构类型的任何建议都将 1( 允许对引用字符串进行排序和 2( 跟踪原始索引将非常有帮助！

当前伪代码：

//Function to make vector? of structures that contain the reference to the string and the original index
int main() {
//Declare strings
string str1="This is a very long string with some repeats of strings."
string str2="This is another string with some repeats that is very long."
//Call function to make array
//Pass vector to sort(v.begin(), v.end), somehow telling it to deference?
//Process the output in multilayer loop to find the longest exact match
// "string with some repeats"
return 0;}

首先，您应该为此使用后缀树。但我会回答你最初的问题。

C++17 ：

注意：使用实验性功能

您可以使用std::string_view引用字符串而不复制。下面是一个示例代码：

//Declare string
char* str1 = "This is a very long string with some repeats of strings."
int main() {
    //Call function to make array
    vector<string_view> substrings;
    //example of adding substring [5,19) into vector
    substrings.push_back(string_view(str1 + 5, 19 - 5));
    //Pass vector to sort(v.begin(), v.end)
    sort(substrings.begin(), substrings.end());
    return 0;
}

C++17 之前的所有内容：

可以将自定义谓词与 sort 函数一起使用。与其让你的向量存储实际的字符串，不如让它存储包含索引的对。

下面是使其工作所需的代码示例：

//Declare string
string str1="This is a very long string with some repeats of strings."
bool pred(pair<int,int> a, pair<int,int> b){
    int substring1start=a.first,
        substring1end=a.second;
    int substring2start=b.first,
        substring2end=b.second;
    //use a for loop to manually compare substring1 and substring 2
    ...
    //return true if substring1 should go before substring2 in vector
    //otherwise return false
}
int main() {
    //Call function to make array
    vector<pair<int,int>> substrings;
    //example of adding substring [1,19) into vector
    substrings.push_back({1,19});
    //Pass vector to sort(v.begin(), v.end), passing custom predicate
    sort(substrings.begin(), substrings.end(), pred);
    return 0;
}

即使减少了内存使用量，程序仍然需要 40T 迭代才能运行(因为您需要比较字符串(。除非您使用某种哈希字符串比较算法。

你可以

使用std::string_view、std::hash和std::set的组合。

#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
std::string str1="This is a very long string with some repeats of strings.";
std::string str2="This is another string with some repeats that is very long.";
std::set<std::size_t> substringhashes;
std::vector<std::string_view> matches;
bool makeSubHashes(std::string& str, std::size_t lenght) {
    for (int pos=0; pos+lenght <= str.size(); ++pos) {
        std::string_view sv(str.data()+pos, lenght);
        auto hash = std::hash<std::string_view>()(sv);
        if (!substringhashes.insert(hash).second) {
            matches.push_back(sv);
            if (matches.size() > 99) // Optional break after finding the 100 longest matches
                return true;
        }
    }
    return false;
}
int main() {
    for (int lenght=std::min(str1.size(), str2.size()); lenght>0; --lenght) {
        if (makeSubHashes(str1, lenght) || makeSubHashes(str2, lenght))
            break;
    }
    for (auto& sv : matches) {
        std::cout << sv << std::endl;
    }
    return 0;
}

如果后缀的数量非常高，则std::set 有可能出现误报。它具有不同哈希的最大值std::size_t数，这通常是uint64。

它还开始搜索字符串最大长度的匹配项，也许更合理的方法是为后缀设置某种最大长度。

std::sort对

主内存中的数据进行排序。

如果可以将数据放入主存储器中，则可以使用std::sort对其进行排序。

否则不会。