在解释器开发期间解析令牌问题

Parsing tokens issue during interpreter development

本文关键字：令牌问题解释器开发期更新时间：2023-10-16

我正在用C++构建一个代码解释器，当我让整个令牌逻辑正常工作时，我遇到了一个意想不到的问题。

用户在控制台中输入一个字符串，程序将所述字符串解析为不同的对象类型Token，问题是我这样做的方式如下：

void splitLine(string aLine) {
stringstream ss(aLine);
string stringToken, outp;
char delim = ' ';
// Break input string aLine into tokens and store them in rTokenBag
while (getline(ss, stringToken, delim)) { 
// assing value of stringToken parsed to t, this labes invalid tokens
Token t (readToken(stringToken)); 
R_Tokens.push_back(t);
}   
}

这里的问题是，如果解析收到一个字符串，比如说Hello World!它会将其拆分为 2 个令牌Hello并World!

主要目标是让代码将双引号识别为字符串令牌的开头，并将其完整存储(从"到")作为单个令牌。因此，如果我输入x = "hello world"它将x存储为令牌，然后接下来运行=作为令牌，然后hello world为令牌而不是拆分它

您可以使用C++14引用的操纵器。

#include <string>
#include <sstream>
#include <iomanip>
#include <iostream>
void splitLine(std::string aLine) {
std::istringstream iss(aLine);
std::string stringToken;
// Break input string aLine into tokens and store them in rTokenBag
while(iss >> std::quoted(stringToken)) {
std::cout << stringToken << "n";
}
}
int main() {
splitLine("Heloo world "single token" new tokens");
}

你真的不想通过在分隔符处拆分来标记编程语言。

适当的分词器将打开第一个字符以决定要读取的令牌类型，然后只要找到适合该令牌类型的字符，然后继续读取，然后在找到第一个不匹配字符时发出该标记(然后将用作下一个令牌的起点)。

这可能看起来像这样(假设it是一个istreambuf_iterator或其他一些逐字符迭代输入的迭代器)：

Token Tokenizer::next_token() {
if (isalpha(*it)) {
return read_identifier();
} else if(isdigit(*it)) {
return read_number();
} else if(*it == '"') {
return read_string();
} /* ... */
}
Token Tokenizer::read_string() {
// This should only be called when the current character is a "
assert(*it == '"');
it++;
string contents;
while(*it != '"') {
contents.push_back(*it);
it++;
}
return Token(TokenKind::StringToken, contents);
}

这不处理的是转义序列或我们到达文件末尾而没有看到第二个"的情况，但它应该给你基本的想法。

像std::quoted这样的东西可能会解决字符串文字的直接问题，但如果您希望x="hello world"以与x = "hello world"相同的方式进行标记化(您几乎肯定会这样做)，它将无济于事。

PS：您也可以先将整个源代码读入内存，然后让您的令牌包含指向源的索引或指针而不是字符串(因此，您只需在循环之前保存起始索引，然后返回Token(TokenKind::StringToken, start_index, current_index)，而不是contents变量)。哪一个更好，部分取决于你在解析器中做什么。如果您的解析器直接生成结果，并且您在处理后不需要保留令牌，则第一个令牌的内存效率更高，因为您永远不需要将整个源代码保存在内存中。如果创建 AST，则无论哪种方式，内存消耗都将大致相同，但第二个版本将允许您拥有一个大字符串而不是许多小字符串。

所以我终于想通了，我可以使用getline()来实现我的目标。

这个新代码按照我需要的方式运行和解析：

void splitLine(string aLine) {
stringstream ss(aLine);
string stringToken, outp;
char delim = ' ';
while (getline(ss, stringToken, delim)) { // Break line into tokens and store them in rTokenBag
//new code starts here
// if the current parse sub string starts with double quotes
if (stringToken[0] == '"' ) { 
string torzen;
// parse me the rest of ss until you find another double quotes
getline(ss, torzen, '"' ); 
// Give back the space cut form the initial getline(), add the parsed sub string from the second getline(), and add a double quote at the end that was cut by the second getline()
stringToken += ' ' + torzen + '"'; 
}
// And we can all continue with our lives 
Token t (readToken(stringToken)); // assing value of stringToken parsed to t, this labes invalid tokens
R_Tokens.push_back(t);
}

}

感谢所有回答和评论的人，你们都给了很大的帮助！