快速分析文件

Parse files the fast way?

本文关键字：文件更新时间：2023-10-16

我正在写一个图库，它应该读取最常见的图格式。一种格式包含这样的信息：

e 4 3
e 2 2
e 6 2
e 3 2
e 1 2
....

我想解析这些行。我环顾了一下stackoverflow，可以找到一个巧妙的解决方案。我目前使用这样的方法（文件是一个fstream）：

string line;
while(getline(file, line)) {
    if(!line.length()) continue; //skip empty lines
    stringstream parseline = stringstream(line);
    char identifier;
    parseline >> identifier; //Lese das erste zeichen
    if(identifier == 'e')   {
        int n, m;
        parseline >> n;
        parseline >> m;
        foo(n,m) //Here i handle the input
    }
}

它运行得很好，正如预期的那样，但今天当我用巨大的图形文件（50mb+）测试它时，我震惊地发现这个函数是整个程序中最糟糕的瓶颈：

我用来解析该行的字符串流几乎占总运行时的70%，getline命令占25%。该程序的其余部分仅使用5%。

有没有一种快速读取这些大文件的方法，可能会避免慢速字符串流和getline函数？

您可以跳过对字符串的双重缓冲，跳过对单个字符的解析，并使用strtoll解析整数，如下所示：

string line;
while(getline(file, line)) {
    if(!line.length()) continue; //skip empty lines
    if (line[0] == 'e') {
        char *ptr;
        int n = strtoll(line.c_str()+2, &ptr, 10);
        int m = strtoll(ptr+1, &ptr, 10);
        foo(n,m) //Here i handle the input
    }
}

在C++中，strtoll应该在<cstdlib>包含文件中。

mmap文件并将其作为单个大缓冲区进行处理。

如果您的系统缺少mmap，您可以尝试将文件read malloc 的缓冲区

理由：大部分时间都在从用户到系统的转换过程中，然后再调用C库。读取整个文件几乎消除了所有这些调用。