使用迭代器进行内存高效词法分析

Memory efficient lexical analysis with iterators

本文关键字：内存高效词法分析迭代器更新时间：2023-10-16

我一直在尝试设计一个词法分析器(用于编程语言)，它不会在中间列表中累积令牌。这应该很简单，在c++中，我认为这是迭代器的一个很好的用途(顺便说一下，我不是c++专家)。话虽如此，我似乎找不到一个令人满意的解决办法。这是我在迭代器上下文中能想到的最符合逻辑的东西:

enum class symbol {
  IDENTIFIER,
  ...
};
struct token {
  symbol symbol;
  std::string::const_iterator lexeme_begin;
  std::string::const_iterator lexeme_end;
};
class lexer {
private:
  std::string::const_iterator begin_, end_;
public:
  lexer(
      std::string::const_iterator begin,
      std::string::const_iterator end) :
      begin_ {begin}, end_ {end} {};
  class iterator;
  iterator begin() {
    return {begin_, end_};
  }
  iterator end() {
    // Can't figure out what to do here.
  }
};
class lexer::iterator {
private:
  std::string::const_iterator begin_, end_, next_;
public:
  iterator(
      std::string::const_iterator begin,
      std::string::const_iterator end) :
      begin_ {begin}, end_ {end} {};
  iterator operator++() {
    if (_next == _end) {
      // Same problem as in lexer::end.
    }
    _begin = _next;
    return *this;
  }
  token operator*() {
    // Perform actual lexical analysis here.
  }
};

我希望能够做这样的事情:

for (auto token : lexer {"abc 123"}) {
  std::cout << token;
}

我的问题是:这是迭代器的适当使用吗?如果是，我该如何处理lexer::end()迭代器?我能想到的实现lexer::end()的唯一方法是返回lexer::iterator的一个特殊实例，但这对我来说似乎不是一个好的解决方案。另一件让我有点困扰的事情是，每个迭代器都必须包含相同的迭代器，直到字符串的末尾，尽管这似乎不是一个问题。

是的，它看起来是使用迭代器的合理位置。我还没见过这样的词法分析器，但它看起来很合理。

对于我来说，不反对将lexer的一个区分实例作为end()的值。对于词法分析器来说，返回一个实际上不在源代码中的EOF令牌通常很方便，它实际上就是end()的值。很可能是静态的

我真的不明白为什么你需要一个类和一个嵌套类来表达基本的标记化逻辑。您在这里向我们展示的内容缺少了实际交付令牌的所有有趣的内容，对于一个简单的令牌器和迭代器支持来说，这似乎是多余的。

我猜你的begin/end/next迭代器服务于指针到字符流的目的，所以我不理解each iterator has to contain the same iterator to the end of the string。

我最终没有使用自定义接口而不是迭代器来实现词法分析器。我的困惑是，我正在使用一个c++迭代器，好像它是一个生成器http://en.wikipedia.org/wiki/Generator_(computer_programming)。对于简单的输入解析，标准库提供了输入流(http://en.cppreference.com/w/cpp/io)，但是对于更健壮的解析，我建议使用解析库，比如boost's spirit，或者直接编写自己的自定义接口。