如何有效地在元素之前和之后搜索关键短语

how do I efficiently search before AND after an element for a key phrase

本文关键字：搜索之后短语有效地元素更新时间：2023-10-16

我有一个非常大的数据集(从100,000个元素到250,000个元素(，我目前正在将数据存储在一个向量中，目的是搜索一组单词。给定一个短语(例如"on，para"(，该函数应该找到以给定短语开头的所有单词，并将所有匹配项推送到队列中。

为了找到初始单词，我

正在使用二叉搜索，这似乎效果很好，但是在找到初始单词后，我卡住了。我应该如何有效地在元素之前和之后迭代以找到所有相似的单词？输入按字母顺序排列，因此我知道所有其他可能的匹配将在返回元素之前或之后发生。我觉得<algorithm>中一定有一个功能，我可以利用。以下是相关代码的一部分：

二叉搜索功能：

int search(std::vector<std::string>& dict, std::string in)
{
    //for each element in the input vector
    //find all possible word matches and push onto the queue
    int first=0, last= dict.size() -1;
    while(first <= last)
    {
        int middle = (first+last)/2;
        std::string sub = (dict.at(middle)).substr(0,in.length());
        int comp = in.compare(sub);
        //if comp returns 0(found word matching case)
        if(comp == 0) {
            return middle;
        }
        //if not, take top half
        else if (comp > 0)
            first = middle + 1;
        //else go with the lower half
        else
            last = middle - 1;
    }
    //word not found... return failure
    return -1;
}

在main()

//for each element in our "find word" vector
for (int i = 0; i < input.size()-1; i++)
{
    // currently just finds initial word and displays
    int key = search(dictionary, input.at(i));
    std::cout << "search found " << dictionary.at(key) <<
                 "at key location " << key << std::endl;
}

> std：：lower_bound 并向前迭代(您也可以使用 std：：upper_bound(：

#include <algorithm>
#include <iostream>
#include <vector>
int main() {
    typedef std::vector<std::string> Dictionary;
    Dictionary dictionary = {
        "A", "AA", "B", "BB", "C", "CC"
    };
    std::string prefix("B");
    Dictionary::const_iterator pos = std::lower_bound(
        dictionary.begin(),
        dictionary.end(),
        prefix);
    for( ; pos != dictionary.end(); ++pos) {
        if(pos->compare(0, prefix.size(), prefix) == 0) {
            std::cout << "Match: " << *pos << std::endl;
        }
        else break;
    }
    return 0;
}

您需要构建索引的不是为每个短语，而是为任何子短语。从这个词开始。例如，对于字典字符串"New York"，您必须保留两个字符串的索引："New York"和"York"。请参阅我的自动完成演示，其中说明了这个想法：

http://olegh.cc.st/autocomplete.html

如您所见，此子系统可快速处理字典，其最大范围比 250K 元素大。当然，我不在那里使用二进制搜索，因为它很慢。我改用哈希。

有序向量(列表(当然是存储数据的一种方法，但保持项目井井有条有效率成本。而且您没有提到您的数组是静态的还是动态的。但是还有其他数据结构允许存储排序数据并具有非常好的查找时间。

哈希/映射 - 您可以将项目存储为哈希/映射并进行非常快速的查找，但查找下一个和上一个是有问题的。
二叉树/N 元树/B 树 - 非常好的动态插入/删除性能，以及良好的查找时间，并且树是有序的，因此查找下一个/上一个是稳定的。
布隆过滤器 - 有时您要做的就是检查某个项目是否在您的收藏中，而布隆过滤器的误报率非常低，因此它是一个不错的选择。

假设您将数据分解为短子序列(音节(，那么您可以有一个音节树，非常快速的查找，并且根据树是作为有序列表还是哈希/映射实现，您还可以找到下一个/上一个。