解析两种格式的文件并解析该行

Parsing file of two formats and parse the line as well

本文关键字:两种 格式 文件      更新时间:2023-10-16

我有一个巨大的文件,可以有以下两种格式的行:

格式1:

*1 <int_1/string_1>:<int/string> <int_2/string_2>:<int/string> <float>

格式2:

*1 <int/string>:<int/string> <float>

因此,上述格式的可能情况是:

*1 1:2 3:4 2.3
*1 1:foo 3:bar 2.3
*1 foo:1 bar:4 2.3
*1 foo:foo bar:bar 2.3
*1 foo:foo 2.3

从上述两个格式行中,我只需要为我的代码考虑"Format1"。在读取该大文件时,跳过"Format2"对应的行。在可能的情况下,我会考虑前 4 种情况,而不是最后一种,因为它与"Format2"匹配。所以,正则表达式应该是这样的:

(d+)(s+)(\*S+:S+)(s+)(\*S+:S+)(s+)(d+)

哪里

d is any digit. d+ is more than 1 digit.
s is space. s+ is more than 1 space.
S is anything non-space. S+ is anything more than 1 non-space.

在考虑了"Format1"行之后,我将不得不从中获取两个值:

int_1/string_1
int_2/string_2

你能做些什么来最好地处理它?

您可以先计算空格分隔字段的数量

struct Field {
int start, stop;
};
Field fields[4];
int i = 0, nf = 0;
while (s[i]) {
while (s[i] && isspace(s[i])) i++;
if (!s[i]) break;
int start = i;
while (s[i] && !isspace(s[i])) i++;
nf++;
if (nf == 5) break; // Too many fields
fields[nf-1].start = start;
fields[nf-1].stop = i;
}
if (nf == 4) {
// We got 4 fields, line could be acceptable
...
}

可能为要'1'的第一个字符添加预检查,'*'和空格可以加快跳过无效行的速度(如果它们很多)。

使用 boost

#include <iostream>
#include <array>
#include <vector>
#include <string>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>
int main() {
std::array<std::string, 5> x = { "*1 1:2 3:4 2.3",
"*1 1:foo 3:bar 2.3",
"*1 foo:1 bar:4 2.3",
"*1 foo:foo bar:bar 2.3",
"*1 foo:foo 2.3"
};
for (const auto& item : x) {
std::vector<std::string> Words;
// split based on <space> and :
boost::split(Words,item, boost::is_any_of(L" :"));
std::cout << item << std::endl;
// Only consider the Format1
if (Words.size() > 4) {
std::cout << Words[1] << ":" << Words[2] << std::endl;
std::cout << Words[3] << ":" << Words[4] << std::endl;
}
std::cout << std::endl;
}
return 0;
}

使用std::regex

int main() {
std::array<std::string, 5> x = { "*1 1:2 3:4 2.3",
"*1 1:foo 3:bar 2.3",
"*1 foo:1 bar:4 2.3",
"*1 foo:foo bar:bar 2.3",
"*1 foo:foo 2.3"
};
std::regex re(R"(*1s+(w+):(w+)s+(w+):(w+).*)");
for (const auto& item : x) {
std::smatch sm;
if (std::regex_match(item, sm, re)) {
std::cout << sm[1] << ":" << sm[2] << std::endl;
std::cout << sm[3] << ":" << sm[4] << std::endl;
}
}
return 0;
}