使用字符串流来标记具有不同delimeter的字符串

Using stringstream to tokenize a string with different delimeters

本文关键字:字符串 delimeter      更新时间:2023-10-16

如何使用字符串流来标记像这样的行。

[label]操作码[arg1][,arg2]

标签可能并不总是在那里,但如果没有,就会有空白。操作码始终存在,并且操作码和arg1之间有一个空格或制表符。然后,arg1和arg2之间没有空格,但用逗号分隔。

此外,一些空行上会有空白,因此需要丢弃。"#"是评论

例如:

#Sample Input
TOP  NoP
L   2,1
VAL  INT  0

这只是我将要阅读的文本文件的一个示例。因此,在第一行的标签中,将是TOP,操作码将=NOP,不传递任何参数。

我一直在努力,但我需要一种更简单的方式来标记,从我所看到的来看,字符串流似乎是我想使用的,所以如果有人能告诉我如何做到这一点,我真的很感激。

我一直在绞尽脑汁想如何做到这一点,只是为了向你表明,我不仅仅是在不工作的情况下提出要求,这是我目前的代码:

int counter = 0;
int i = 0;
int j = 0;
int p = 0;
while (getline(myFile, line, 'n'))
{

if (line[0] == '#')
{
continue;
}
if (line.length() == 0)
{
continue;
}
if (line.empty())
{
continue;
}
// If the first letter isn't a tab or space then it's a label
if (line[0] != 't' && line[0] != ' ')
{
string delimeters = "t ";
int current;
int next = -1;

current = next + 1;
next = line.find_first_of( delimeters, current);
label = line.substr( current, next - current );
Symtablelab[i] = label;
Symtablepos[i] = counter;
if(next>0)
{
current = next + 1;
next = line.find_first_of(delimeters, current);
opcode = line.substr(current, next - current);

if (opcode != "WORDS" && opcode != "INT")
{
counter += 3;
}
if (opcode == "INT")
{
counter++;
}
if (next > 0)
{
delimeters = ", nt";
current = next + 1;
next = line.find_first_of(delimeters, current);
arg1 = line.substr(current, next-current);
if (opcode == "WORDS")
{
counter += atoi(arg1.c_str());
}
}
if (next > 0)
{
delimeters ="n";
current = next +1;
next = line.find_first_of(delimeters,current);
arg2 = line.substr(current, next-current);
}
}
i++;
}
// If the first character is a tab or space then there is no label and we just need to get a counter
if (line[0] == 't' || line[0] == ' ')
{
string delimeters = "t n";
int current;
int next = -1;
current = next + 1;
next = line.find_first_of( delimeters, current);
label = line.substr( current, next - current );
if(next>=0)
{
current = next + 1;
next = line.find_first_of(delimeters, current);
opcode = line.substr(current, next - current);
if (opcode == "t" || opcode =="n"|| opcode ==" ")
{
continue;
}
if (opcode != "WORDS" && opcode != "INT")
{
counter += 3;
}
if (opcode == "INT")
{
counter++;
}

if (next > 0)
{
delimeters = ", nt";
current = next + 1;
next = line.find_first_of(delimeters, current);
arg1 = line.substr(current, next-current);
if (opcode == "WORDS")
{
counter += atoi(arg1.c_str());
}
}

if (next > 0)
{
delimeters ="nt ";
current = next +1;
next = line.find_first_of(delimeters,current);
arg2 = line.substr(current, next-current);
}
}
}
}
myFile.clear();
myFile.seekg(0, ios::beg);
while(getline(myFile, line))
{
if (line.empty())
{
continue;
}
if (line[0] == '#')
{
continue;
}
if (line.length() == 0)
{
continue;
}

// If the first letter isn't a tab or space then it's a label
if (line[0] != 't' && line[0] != ' ')
{
string delimeters = "t ";
int current;
int next = -1;

current = next + 1;
next = line.find_first_of( delimeters, current);
label = line.substr( current, next - current );

if(next>0)
{
current = next + 1;
next = line.find_first_of(delimeters, current);
opcode = line.substr(current, next - current);

if (next > 0)
{
delimeters = ", nt";
current = next + 1;
next = line.find_first_of(delimeters, current);
arg1 = line.substr(current, next-current);
}
if (next > 0)
{
delimeters ="nt ";
current = next +1;
next = line.find_first_of(delimeters,current);
arg2 = line.substr(current, next-current);
}
}
if (opcode == "INT")
{
memory[p] = arg1;
p++;
continue;
}
if (opcode == "HALT" || opcode == "NOP" || opcode == "P_REGS")
{
memory[p] = opcode;
p+=3;
continue;
}
if(opcode == "J" || opcode =="JEQR" || opcode == "JNE" || opcode == "JNER" || opcode == "JLT" || opcode == "JLTR" || opcode == "JGT" || opcode == "JGTR" || opcode == "JLE" || opcode == "JLER" || opcode == "JGE" || opcode == "JGER" || opcode == "JR")
{
memory[p] = opcode;
memory[p+1] = arg1;
p+=3;
continue;
}
if (opcode == "WORDS")
{
int l = atoi(arg1.c_str());
for (int k = 0; k <= l; k++)
{
memory[p+k] = "0";
}
p+=l;
continue;
}
else
{
memory[p] = opcode;
memory[p+1] = arg1;
memory[p+2] = arg2;
p+=3;
}
}
// If the first character is a tab or space then there is no label and we just need to get a counter        

if (line[0] == 't' || line[0] == ' ')
{
string delimeters = "t ";
int current;
int next = -1;
current = next + 1;
next = line.find_first_of( delimeters, current);
label = line.substr( current, next - current );
if(next>=0)
{
current = next + 1;
next = line.find_first_of(delimeters, current);
opcode = line.substr(current, next - current);
if (opcode == "t" || opcode =="n"|| opcode ==" "|| opcode == "")
{
continue;
}

if (next > 0)
{
delimeters = ", nt";
current = next + 1;
next = line.find_first_of(delimeters, current);
arg1 = line.substr(current, next-current);
}

if (next > 0)
{
delimeters ="nt ";
current = next +1;
next = line.find_first_of(delimeters,current);
arg2 = line.substr(current, next-current);
}
}
if (opcode == "INT")
{
memory[p] = arg1;
p++;
continue;
}
if (opcode == "HALT" || opcode == "NOP" || opcode == "P_REGS")
{
memory[p] = opcode;
p+=3;
continue;
}
if(opcode == "J" || opcode =="JEQR" || opcode == "JNE" || opcode == "JNER" || opcode == "JLT" || opcode == "JLTR" || opcode == "JGT" || opcode == "JGTR" || opcode == "JLE" || opcode == "JLER" || opcode == "JGE" || opcode == "JGER" || opcode == "JR")
{
memory[p] = opcode;
memory[p+1] = arg1;
p+=3;
continue;
}
if (opcode == "WORDS")
{
int l = atoi(arg1.c_str());
for (int k = 0; k <= l; k++)
{
memory[p+k] = "0";
}
p+=l;
continue;
}
else
{
memory[p] = opcode;
memory[p+1] = arg1;
memory[p+2] = arg2;
p+=3;
}
}
}

我显然想让这件事变得更好,所以任何帮助都将不胜感激。

在您疯狂地维护那些巨大的if状态或尝试学习Boost Spirit之前,让我们尝试编写一个非常简单的解析器。这是一个有点长的帖子,并且没有直接切中要害,所以请耐心等待。

首先,我们需要一个语法,它似乎非常简单:

line
label(optional)   opcode   argument-list(optional)
argument-list
argument
argument, argument-list

在英语中:一行代码由一个可选标签、一个操作码和一个可选参数列表组成。参数列表可以是单个参数(整数),也可以是后跟分隔符(逗号)的参数和另一个参数列表。

让我们首先定义两个数据结构。标签应该是唯一的(对吧?),所以我们会有一组字符串,这样我们就可以随时轻松地查找它们,如果发现重复的标签,可能会报告错误。下一个是字符串到size_t的映射,它充当有效操作码的符号表,以及每个操作码的预期参数数。

std::set<std::string> labels;
std::map<std::string, size_t> symbol_table = {
{ "INT", 1},
{ "NOP", 0},
{ "L",   2}
};

我不知道你的代码中memory到底是什么,但你计算偏移量来计算参数的方法似乎并不复杂。让我们定义一个数据结构,它可以优雅地容纳一行代码。我会这样做:

typedef std::vector<int> arg_list;
struct code_line {
code_line() : label(), opcode(), args() {}
std::string  label;      // labels are optional, so an empty string
// will mean absence of label
std::string  opcode;     // opcode, doh
arg_list     args;       // variable number of arguments, it can be empty, too.
// It needs to match with opcode, we'll deal with
// that later
};

语法错误是一种不容易恢复的特殊情况,所以让我们通过抛出异常来处理它们。我们的简单异常类可以如下所示:

struct syntax_error {
syntax_error(std::string m) : msg(m) { }
std::string msg;
};

标记化、词法分析和解析通常是分开的任务。但我想对于这个简单的例子,我们可以将标记化器和lexer组合在一个类中。我们已经知道语法是由哪些元素组成的,所以让我们写一个类,将输入作为文本,并从中提取语法元素

class token_stream {
std::istringstream stream; // stringstream for input
std::string buffer;        // a buffer for a token, more on this later
public:
token_stream(std::string str) : stream(str), buffer() { }
// these methods are self-explanatory
std::string get_label();
std::string get_opcode();
arg_list get_arglist();
// we're taking a kind of top-down approach with this,
// so let's forget about implementations for now
};

还有工作马,一个试图理解令牌的函数,如果一切顺利,它会返回一个code_line结构:

code_line parse(std::string line)
{
code_line temp;
token_stream stream(line);
// Again, self-explanatory, get a label, opcode and argument list from
// token stream.
temp.label = stream.get_label();
temp.opcode = stream.get_opcode();
temp.args = stream.get_arglist();
// Everything went fine so far, remember we said we'd be throwing exceptions
// in case of syntax errors.
// Now we can check if we got the correct number of arguments for the given opcode:
if (symbol_table[temp.opcode] != temp.args.size()) {
throw syntax_error("Wrong number of parameters.");
}
// The last thing, if there's a label in the line, we insert it in the table.
// We couldn't do that inside the get_label method, because at that time
// we didn't yet know if the rest of the line is sintactically valid and a
// exception thrown would have left us with a "dangling" label in the table.
if (!temp.label.empty()) labels.insert(temp.label);
return temp;
}

以下是我们如何使用所有这些:

int main()
{
std::string line;
std::vector<code_line> code;
while (std::getline(std::cin, line)) {
// empty line or a comment, ignore it
if (line.empty() || line[0] = '#') continue;
try {
code.push_back(parse(line));
} catch (syntax_error& e) {
std::cout << e.msg << 'n';
// Give up, try again, log... up to you.
}
}
}

如果输入被成功解析,我们现在得到了一个包含所有信息(标签、参数数量)的有效行向量,并且可以用它做任何我们喜欢的事情。IMO,这个代码将比你的代码更容易维护和扩展。例如,如果你需要引入一个新的操作码,只需在映射中创建另一个条目(symbol_table)。与您的if语句相比如何?:)

剩下的就是token_stream方法的实际实现。以下是我为get_label:所做的操作

std::string token_stream::get_label()
{
std::string temp;
// Unless the stream is empty (and it shouldn't be, we checked that in main),
// operator>> for std::string is unlikely to fail. It doesn't hurt to be robust
// with error checking, though
if (!(stream >> temp)) throw ("Fatal error, empty line, bad stream?");
// Ok, we got something. First we should check if the string consists of valid
// characters - you probably don't want punctuation characters and such in a label.
// I leave this part out for simplicity.
// Since labels are optional, we need to check if the token is an opcode.
// If that's the case, we return an empty (no) label.
if (symbol_table.find(temp) != symbol_table.end()) {
buffer = temp;
return "";
}
// Note that above is where that `buffer` member of token_stream class got used.
// If the token was an opcode, we needed to save it so get_opcode method can make
// use of it. The other option would be to put the string back in the underlying 
// stringstream, but that's more work and more code. This way, get_opcode needs   
// to check if there's anything in buffer and use it, or otherwise extract from
// the stringstream normally.
// Check if the label was used before:
if (labels.count(temp))
throw syntax_error("Label already used.");
return temp;
}

就这样。我把剩下的实现留给你们练习。希望能有所帮助。:)

您肯定需要诸如boost regex之类的正则表达式;或词法分析和解析工具,如lex/yacc、flex/bison或boost-srit。

保持字符串和流的复杂性是不值得的。