CSV解析器的性能瓶颈

Performance bottleneck with CSV parser

本文关键字：性能瓶颈 CSV 更新时间：2023-10-16

我当前的解析器如下-读取~10MB的CSV到STL向量需要~30s，这对我来说太慢了，因为我有超过100MB的需要在每次程序运行时读取。谁能给一些关于如何提高性能的建议?事实上，用C语言会更快吗?

int main() {
    std::vector<double> data;
    std::ifstream infile( "data.csv" );
    infile >> data;
    std::cin.get();
    return 0;
}
std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
    data.clear();
    // Reserve data vector
    std::string line, field;
    std::getline(ins, line);
    std::stringstream ssl(line), ssf;
    std::size_t rows = 1, cols = 0;
    while (std::getline(ssl, field, ',')) cols++;
    while (std::getline(ins, line)) rows++;
    std::cout << rows << " x " << cols << "n";
    ins.clear(); // clear bad state after eof
    ins.seekg(0);
    data.reserve(rows*cols);
    // Populate data
    double f = 0.0;
    while (std::getline(ins, line)) {
        ssl.str(line);
        ssl.clear();
        while (std::getline(ssl, field, ',')) {
            ssf.str(field);
            ssf.clear();
            ssf >> f;
            data.push_back(f);
        }
    }
    return ins;
}

注:我也有openMP在我的处置，内容最终将用于GPGPU计算与CUDA。

只需读取一次文件就可以节省一半的时间。

虽然调整vector的大小是有益的，但它永远不会主导运行时，因为I/O总是会慢一些。

另一个可能的优化是在没有字符串流的情况下读取。比如(未经测试)

int c = 0;
while (ins >> f) {
    data.push_back(f);
    if (++c < cols) {
        char comma;
        ins >> comma; // skip comma
    } else {
        c = 0; // end of line, start next line
    }
}

如果您可以省略,并仅用空格分隔值，则可以是偶数

while (ins >> f)
    data.push_back(f);

或

std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
          std::back_inserter(data));

在我的机器上，您的保留代码大约需要1.1秒，您的填充代码需要8.5秒。

ios

添加std:::: sync_with_stdio(假);对我的编译器没有影响。

下面的C代码耗时2.3秒。

int i = 0;
int j = 0;
while( true ) {
    float x;
    j = fscanf( file, "%f", & x );
    if( j == EOF ) break;
    data[i++] = x;
    // skip ',' or 'n'
    int ch = getc(file);
}

尝试呼叫

std::ios::sync_with_stdio(false);

在程序开始处。这禁用了cin/cout和scanf/printf之间的同步(据说相当慢)(我自己从未尝试过，但经常看到建议，例如这里)。注意，如果你这样做，你不能在你的程序中混合c++风格和C风格的IO。

(另外，Olaf Dietsche关于只读取文件一次的说法是完全正确的。)

显然，文件IO是一个坏主意，只需将整个文件映射到内存中，访问CSV文件作为一个连续的vm块，这只会引起几个系统调用