如何加快正则表达式搜索C++中大量潜在的大文件?

How to speed up regex searching for large quantity of potentially large files in C++?

本文关键字：文件正则表达式搜索 C++ 何加快更新时间：2023-10-16

我正在尝试制作一个程序，使用 excel 文档作为配置文件来读取用户输入的通配符文件和通配符字符串。例如，用户可能能够在 C：\Read*.txt 中输入，并且 C 驱动器中以 Read 开头的任何文件，然后在读取后的任何字符和文本文件都将包含在搜索中。

他们可以搜索消息：*，所有以"消息："开头并以任何字符序列结尾的字符串都将匹配。

到目前为止，它是一个工作程序，但问题是速度效率非常糟糕，我需要它能够搜索非常大的文件。我正在使用文件流和正则表达式类来执行此操作，但我不确定是什么花费了这么多时间。

我的代码中的大部分时间都花在下面的循环中(我只包含了 while 循环上方的行，以便您可以更好地理解我想要做什么)：

smatch matches;
vector<regex> expressions;
for (int i = 0; i < regex_patterns.size(); i++){expressions.emplace_back(regex_patterns.at(i));}
auto startTimer = high_resolution_clock::now();
// Open file and begin reading
ifstream stream1(filePath);
if (stream1.is_open())
{
int count = 0;
while (getline(stream1, line))
{
// Continue to next step if line is empty, no point in searching it.
if (line.size() == 0)
{
// Continue to next step if line is empty, no point in searching it.
continue;
}
// Loop through each search string, if match, save line number and line text,
for (int i = 0; i < expressions.size(); i++)
{
size_t found = regex_search(line, matches, expressions.at(i));
if (found == 1)
{
lineNumb.push_back(count);
lineTextToSave.push_back(line);
}
}
count = count + 1;
}
}
auto stopTimer = high_resolution_clock::now();
auto duration2 = duration_cast<milliseconds>(stopTimer - startTimer);
cout << "Time to search file: " << duration2.count() << "n";

有没有比这更好的文件搜索方法？我尝试查找很多东西，但到目前为止还没有找到我理解的编程示例。

按优先级顺序排列的一些想法：

您可以将所有正则表达式模式连接在一起以形成单个正则表达式，而不是在每行上匹配r正则表达式。这将使您的程序速度提高r倍。示例：(R1)|(R2)|(...)|(Rr)
确保在使用前编译正则表达式。
不要将最终.*添加到正则表达式模式中。

一些想法，但不可移植：

内存映射文件而不是读取 iostreams
考虑是否值得重新实现grep而不是通过popen()调用grep