使用 Boost::Spirit 解析异构数据

Parsing heterogeneous data using Boost::Spirit

本文关键字:异构 数据 Spirit Boost 使用      更新时间:2023-10-16

我正在尝试弄清楚如何处理以下问题。

我有以下格式的结构:

struct Data
{
     time_t timestamp;
     string id;
     boost::optional<int> data1;
     boost::optional<string> data2;
     // etc...
};

这应该按照以下格式从单个行字符串中解析出来:

human_readable_timestamp;id;key1=value1 key2=value2.....

当然,键的顺序不必与结构中元素的顺序相匹配。

Boost::Spirit适合这种类型的数据吗?我该如何处理?我已经浏览了这些示例,但我无法从示例获得符合我要求的代码。

您可以使用排列解析器。我在这里做了一个非常相似的例子:

  • 使用 C++ 和 BOOST 读取 JSON 文件

如果您有重复键,那么使用Kleene*更有意义,也许

  1. 使用语义操作来分配属性/或/
  2. 使用属性自定义点分配结果
  3. 附言。另请查看 Spirit 存储库中的关键字解析器(使用函数提升 Qi 撰写规则)

如果你不想使用语义动作(Boost Spirit:"语义动作是邪恶的"?),你可以稍微调整结构,以便在使用data元素的排列时与自动合成的属性类型相匹配:

struct Data
{
    boost::posix_time::ptime timestamp;
    std::string id;
    struct Fields {
        boost::optional<int> data1;
        boost::optional<std::string> data2;
    } fields;
};

现在解析器可以只是:

    timestamp = stream;
    text  = lexeme [ '"' >> *~char_('"') >> '"' ];
    data1 = "key1" >> lit('=') >> int_;
    data2 = "key2" >> lit('=') >> text;
    id    = lexeme [ *~char_(';') ];
    start = timestamp >> ';' >> id >> ';' >> (data1 ^ data2);

更新

对评论,使其"有弹性"。我最终放弃了排列解析器,转而采用第一种编号方法(具有语义操作的 Kleene 星形方法)。

    id     = lexeme [ *~char_(';') ];
    auto data1 = bind(&Data::Fields::data1, _val);
    auto data2 = bind(&Data::Fields::data2, _val);
    other  = lexeme [ +(graph-'=') ] >> '=' >> (real_|int_|text);
    fields = *(
                ("key1" >> lit('=') >> int_) [ data1 = _1 ]
              | ("key2" >> lit('=') >> text) [ data2 = _1 ]
              | other
              );
    start  = timestamp >> ';' >> id >> -(';' >> fields);

这改变了以下方面:

    为了能够跳过"其他"字段
  • ,我需要为"其他"字段提出合理的语法:

    other  = lexeme [ +(graph-'=') ] >> '=' >> (real_|int_|text);
    

    (允许由除 = 以外的任何非空格组成的键,后跟=,后跟数字(急切)或文本)。

  • 我扩展了文本的概念以支持流行的引用/转义方案:

    text   = lexeme [ 
                '"' >> *('' >> char_ | ~char_('"')) >> '"'
              | "'" >> *('' >> char_ | ~char_("'")) >> "'"
              | *graph 
           ];
    
  • 允许重复相同的键(在这种情况下,它保留看到的最后一个有效值)。

  • 如果要禁止无效值,请将>> int_>> text替换为> int_> text(期望分析器)。

我用一些具有挑战性的案例扩展了测试用例:

    2015-Jan-26 00:00:00;id
    2015-Jan-26 14:59:24;id;key2="value"
    2015-Jan-26 14:59:24;id;key2="value" key1=42
    2015-Jan-26 14:59:24;id;key2="value" key1=42 something=awful __=4.74e-10 blarg;{blo;bloop='whatever 'ignor'ed' key2="new} "value""
    2015-Jan-26 14:59:24.123;id;key1=42 key2="value" 

它现在打印

----------------------------------------
Parsing '2015-Jan-26 00:00:00;id'
Parsing success
2015-Jan-26 00:00:00    id
data1: --
data2: --
----------------------------------------
Parsing '2015-Jan-26 14:59:24;id;key2="value"'
Parsing success
2015-Jan-26 14:59:24    id
data1: --
data2:  value
----------------------------------------
Parsing '2015-Jan-26 14:59:24;id;key2="value" key1=42'
Parsing success
2015-Jan-26 14:59:24    id
data1:  42
data2:  value
----------------------------------------
Parsing '2015-Jan-26 14:59:24;id;key2="value" key1=42 something=awful __=4.74e-10 blarg;{blo;bloop='whatever 'ignor'ed' key2="new} "value""'
Parsing success
2015-Jan-26 14:59:24    id
data1:  42
data2:  new} "value"
----------------------------------------
Parsing '2015-Jan-26 14:59:24.123;id;key1=42 key2="value" '
Parsing success
2015-Jan-26 14:59:24.123000 id
data1:  42
data2:  value

住在科里鲁

//#define BOOST_SPIRIT_DEBUG
#include <boost/optional/optional_io.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
struct Data
{
    boost::posix_time::ptime timestamp;
    std::string id;
    struct Fields {
        boost::optional<int> data1;
        boost::optional<std::string> data2;
    } fields;
};
BOOST_FUSION_ADAPT_STRUCT(Data::Fields,
        (boost::optional<int>, data1)
        (boost::optional<std::string>, data2)
    )
BOOST_FUSION_ADAPT_STRUCT(Data,
        (boost::posix_time::ptime, timestamp)
        (std::string, id)
        (Data::Fields, fields)
    )
template <typename It, typename Skipper = qi::space_type>
struct grammar : qi::grammar<It, Data(), Skipper> {
    grammar() : grammar::base_type(start) {
        using namespace qi;
        timestamp = stream;
        real_parser<double, strict_real_policies<double> > real_;
        text   = lexeme [ 
                    '"' >> *('' >> char_ | ~char_('"')) >> '"'
                  | "'" >> *('' >> char_ | ~char_("'")) >> "'"
                  | *graph 
               ];
        id     = lexeme [ *~char_(';') ];
        auto data1 = bind(&Data::Fields::data1, _val);
        auto data2 = bind(&Data::Fields::data2, _val);
        other  = lexeme [ +(graph-'=') ] >> '=' >> (real_|int_|text);
        fields = *(
                    ("key1" >> lit('=') >> int_) [ data1 = _1 ]
                  | ("key2" >> lit('=') >> text) [ data2 = _1 ]
                  | other
                  );
        start  = timestamp >> ';' >> id >> -(';' >> fields);
        BOOST_SPIRIT_DEBUG_NODES((timestamp)(id)(start)(text)(other)(fields))
    }
  private:
    qi::rule<It,                                 Skipper> other;
    qi::rule<It, std::string(),                  Skipper> text, id;
    qi::rule<It, boost::posix_time::ptime(),     Skipper> timestamp;
    qi::rule<It, Data::Fields(),                 Skipper> fields;
    qi::rule<It, Data(),                         Skipper> start;
};
int main() {
    using It = std::string::const_iterator;
    for (std::string const input : {
            "2015-Jan-26 00:00:00;id",
            "2015-Jan-26 14:59:24;id;key2="value"",
            "2015-Jan-26 14:59:24;id;key2="value" key1=42",
            "2015-Jan-26 14:59:24;id;key2="value" key1=42 something=awful __=4.74e-10 blarg;{blo;bloop='whatever 'ignor'ed' key2="new} \"value\""",
            "2015-Jan-26 14:59:24.123;id;key1=42 key2="value" ",
            })
    {
        std::cout << "----------------------------------------nParsing '" << input << "'n";
        It f(input.begin()), l(input.end());
        Data parsed;
        bool ok = qi::phrase_parse(f,l,grammar<It>(),qi::space,parsed);
        if (ok) {
            std::cout << "Parsing successn";
            std::cout << parsed.timestamp << "t" << parsed.id << "n";
            std::cout << "data1: " << parsed.fields.data1 << "n";
            std::cout << "data2: " << parsed.fields.data2 << "n";
        } else {
            std::cout << "Parsing failedn";
        }
        if (f!=l)
            std::cout << "Remaining unparsed: '" << std::string(f,l) << "'n";
    }
}