如何使 Boost.Spirit.Lex 令牌值成为匹配序列的子字符串(最好通过正则表达式匹配组）

How to make Boost.Spirit.Lex token value be a substring of matched sequence (preferably by regex matching group)

本文关键字：字符串正则表达式 Lex Spirit Boost 何使令牌更新时间：2023-10-16

我正在编写一个简单的表达式解析器。它建立在基于Boost.Spirit.Lex令牌的Boost.Spirit.Qi语法之上（1.56版的Boost）。

令牌定义如下：

using namespace boost::spirit;
template<
    typename lexer_t
>
struct tokens
    : lex::lexer<lexer_t>
{
    tokens()
        : /* ... */,
          variable("%(\w+)")
    {
        this->self =
            /* ... */ |
            variable;
    }
    /* ... */
    lex::token_def<std::string> variable;
};

现在我希望variable令牌值只是名称（匹配组(\w+)），没有前缀%符号。我该怎么做？

单独使用匹配组无济于事。静止值是完整的字符串，包括前缀 % 。

有没有办法强制使用匹配的组？

或者至少以某种方式在令牌的操作中引用它？

我也尝试使用这样的动作：

variable[lex::_val = std::string(lex::_start + 1, lex::_end)]

但它编译失败了。错误声称std::string构造函数重载都无法匹配参数：

(const boost::phoenix::actor<Expr>, const boost::spirit::lex::_end_type)

更简单

variable[lex::_val = std::string(lex::_start, lex::_end)]

编译失败。出于类似的原因，现在只有第一个参数类型boost::spirit::lex::_start_type。

最后我尝试了这个（即使它看起来像一个很大的浪费）：

lex::_val = std::string(lex::_val).erase(0, 1)

但这也无法编译。这次编译器无法从const boost::spirit::lex::_val_type转换为std::string。

有什么办法可以解决这个问题吗？

简单的解决方案

构造std::string属性值的正确形式如下：

variable[lex::_val = boost::phoenix::construct<std::string>(lex::_start + 1, lex::_end)]

完全按照jv_在他（或她）的评论中所建议的那样。

boost::phoenix::construct由<boost/phoenix/object/construct.hpp>标头提供。或者使用 <boost/phoenix.hpp> .

正则表达式解决方案

但是，上述解决方案仅在简单情况下有效。并且排除了从外部（特别是配置数据）提供模式的可能性。例如，由于将模式更改为%(\w+)%将需要更改值构造代码。

这就是为什么能够从定义令牌的正则表达式中引用捕获组会好得多的原因。

现在请注意，这仍然不完美，因为像%(\w+)%(\w+)%这样的奇怪情况仍然需要更改代码才能正确处理。这不仅可以通过配置令牌的正则表达式来解决，还可以通过从匹配范围形成值的方式来解决此问题。然而，这超出了问题的范围。在许多情况下，直接使用捕获组似乎足够灵活。

Sehe在其他地方的评论中指出，没有办法使用令牌正则表达式中的捕获组。更不用说令牌实际上只支持正则表达式的子集。（例如，在显着差异中，缺乏对命名捕获组或忽略它们的支持！

我自己在这一领域的实验也支持这一点。可悲的是，没有办法使用捕获组。但是有一个解决方法 - 您只需在操作中重新应用正则表达式即可。

操作获取捕获范围

为了使它有点模块化，让我们从最简单的任务开始 - 一个返回boost::iterator_range与指定捕获对应的令牌匹配部分的操作。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture
{
public:
    typedef lex::token_def<Attribute, Char, Idtype> token_type;
    typedef boost::basic_regex<Char> regex_type;
    explicit basic_get_capture(token_type const& token, int capture_index = 1)
        : token(token),
          regex(),
          capture_index(capture_index)
    {
    }
    template<typename Iterator, typename IdType, typename Context>
    boost::iterator_range<Iterator> operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
    {
        typedef boost::match_results<Iterator> match_results_type;
        match_results_type results;
        regex_match(first, last, results, get_regex());
        typename match_results_type::const_reference capture = results[capture_index];
        return boost::iterator_range<Iterator>(capture.first, capture.second);
    }
private:
    regex_type& get_regex()
    {
        if(regex.empty())
        {
            token_type::string_type const& regex_text = token.definition();
            regex.assign(regex_text);
        }
        return regex;
    }
    token_type const& token;
    regex_type regex;
    int capture_index;
};
template<typename Attribute, typename Char, typename Idtype>
basic_get_capture<Attribute, Char, Idtype> get_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture<Attribute, Char, Idtype>(token, capture_index);
}

该操作使用 Boost.Regex （包括 <boost/regex.hpp> ）。

以字符串形式获取捕获的操作

现在，由于捕获范围是一件好事，因为它不会为字符串分配任何新内存，因此毕竟是我们最终想要的字符串。因此，这里的另一个动作建立在前一个行动的基础上。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture_as_string
{
public:
    typedef basic_get_capture<Attribute, Char, Idtype> basic_get_capture_type;
    typedef typename basic_get_capture_type::token_type token_type;
    explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
        : get_capture_functor(token, capture_index)
    {
    }
    template<typename Iterator, typename IdType, typename Context>
    std::basic_string<Char> operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        boost::iterator_range<Iterator> const& capture = get_capture_functor(first, last, flag, id, context);
        return std::basic_string<Char>(capture.begin(), capture.end());
    }
private:
    basic_get_capture_type get_capture_functor;
};
template<typename Attribute, typename Char, typename Idtype>
basic_get_capture_as_string<Attribute, Char, Idtype> get_capture_as_string(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture_as_string<Attribute, Char, Idtype>(token, capture_index);
}

这里没有魔法。我们只是从简单操作返回的范围中std::basic_string。

从捕获中分配值的操作

返回值的操作对我们几乎没有用处。最终目标是从捕获中设置令牌值。这是通过最后一个操作完成的。

template<typename Attribute, typename Char, typename Idtype>
class basic_set_val_from_capture
{
public:
    typedef basic_get_capture_as_string<Attribute, Char, Idtype> basic_get_capture_as_string_type;
    typedef typename basic_get_capture_as_string_type::token_type token_type;
    explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
        : get_capture_as_string_functor(token, capture_index)
    {
    }
    template<typename Iterator, typename IdType, typename Context>
    void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        std::basic_string<Char> const& capture = get_capture_as_string_functor(first, last, flag, id, context);
        context.set_value(capture);
    }
private:
    basic_get_capture_as_string_type get_capture_as_string_functor;
};
template<typename Attribute, typename Char, typename Idtype>
basic_set_val_from_capture<Attribute, Char, Idtype> set_val_from_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_set_val_from_capture<Attribute, Char, Idtype>(token, capture_index);
}

讨论

这些操作的使用方式如下：

variable[set_val_from_capture(variable)]

（可选）可以提供第二个参数作为要使用的捕获索引。它默认为1在大多数情况下似乎合适。

创建函数

set_val_from_capture（或分别get_capture_as_string或get_capture）是用于从token_def中自动推导出模板参数的辅助函数。特别是我们需要的是Char类型来制作相应的正则表达式。

我不确定是否可以合理地避免这种情况，即使可以，也会使调用运算符变得非常复杂（特别是如果我们努力缓存正则表达式对象而不是每次都重新构建它）。我的怀疑主要来自不确定Char类型的token_def是否需要与标记化的序列字符类型相同。我假设它们不必相同。

重复令牌

该操作中绝对令人不快的部分是需要提供令牌本身作为重复的论据。

但是，如上所述的Char类型需要令牌，并且...获取正则表达式！

在我看来，至少在理论上我们可以根据对操作id参数（我们目前只是忽略）以某种方式"在运行时"获得令牌。但是，我未能找到任何方法如何根据令牌的标识符获取token_def，无论是从context参数还是词法分析器本身（可以通过创建函数作为this传递给操作）。

可重用

由于这些是操作，因此在更复杂的场景中，它们实际上不可重用（开箱即用）。例如，如果您不仅希望仅获取捕获，还希望将其转换为某个数值，则必须以这种方式编写另一个操作，而不是在令牌上进行复杂的操作。

起初我试图实现这样的事情：

variable[lex::_val = get_capture_as_string(variable)]

它似乎更灵活，因为您可以轻松地在其周围添加更多代码 - 例如将其包装在某些转换函数中。

但我没能做到。虽然我觉得我不够努力。了解更多关于Boost.Phoenix的信息肯定会在这里有很大帮助。

双重工作

所有这些解决方法都不会阻止我们做双重工作。无论是在正则表达式解析还是匹配。但正如开头提到的，似乎没有更好的方法（在不改变 Boost.Spirit 本身的情况下）。