如何分析文件中具有特定格式的行

How to parse for lines with specific format in a file

本文关键字：定格格式文件更新时间：2023-10-16

我最近尝试解析字幕文件以自行修改时间。格式非常简单，有效的行如下所示：

<arbitrary lines might include comments, blanks, random stuff>
<consecutively numbered ID here>
01:23:45,678 --> 01:23:47,910
<arbitrary lines might include comments, blanks, random stuff>

我怎样才能在C++中以一种优雅的方式做到这一点。我只想出了非常丑陋的解决方案。例如，要逐行读取文件，请在每个文件中搜索"-->"，然后使用find（"："）、find（"，"）和substr（）的序列在这一行上运行

不过，我觉得一定有更好的方法，例如以某种方式通过代币进行拆分。如果我仍然可以解析以下行，那将是理想的：

01 : 23    :45,678   -->  01:23:   45, 910

正确。最终结果应该是变量中的每个部分（hh、mm、ss、ms）。我不一定要求完全实现。一个一般的想法和适当的实用函数的参考是完全足够的。

您只需使用std::regex即可完成此操作。您可以定义，提取哪些令牌，正则表达式将为您执行这些操作。当然，您可以修改输入字符串。它仍然有效。您可以继续使用矢量中的数据。相当简单。

参见一些骨架代码示例：

#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also " and so on
std::string testData(R"#(01 : 23    :45,678   -->  01:23:   45, 910  ?")#");
std::regex re(R"#((bd+b))#");
int main(void)
{
    // Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
    std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };
    // For debug output. Print complete vector to std::cout
    std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, " "));
    return 0;
}