开始为数据拖网例行程序提供建议

Starting advice for data-trawling routine

本文关键字：程序数据开始更新时间：2023-10-16

我正在寻找一些关于创建数据拖网例程的最有效方法的一般建议。我有基本的c++知识。

我需要创建一个例程来搜索具有以下格式(示例)的文本文件:

4515397   404.4    62.5  1607.0     2.4     0.9 ...
4515398   404.4    62.3  1607.0     3.4     1.2 ...
4515399   404.4    62.2  1608.0     4.6     0.8 ...
4515400   405.1    62.2  1612.0     5.8     0.2 ...
4515401   405.9    62.2  1615.0     6.9    -0.8 ...
4515402   406.8    62.2  1617.0     8.0    -2.7 ...
4515403   406.7    62.1  1616.0     9.0    -5.3 ...

在上面的例子中，当列5和6都小于4时，我想导出列2和列3的平均值。实际上，我对第1、4或7列中的值不感兴趣(省略号正是它们在文件本身中的显示方式)。

使问题更加复杂的是，文件中偶尔会出现随机的文本字符串，就像这样(这些可以丢弃):

4522787   429.6    34.4  2024.0    .       .    ...
4522788   429.9    34.2  2022.0    .       .    ...
4522789   429.9    34.1  2022.0    .       .    ...
EFIX R   4522633    4522789 157   427.9    36.8    2009
4522790   429.3    34.2  2021.0    .       .    ...
END 4522791     SAMPLES EVENTS  RES   23.91   23.82
MSG 4522799 TRIAL_RESULT 0
MSG 4522799 TRIAL OK

最后，每个文本文件包含五组数据，我打算对其中的值取平均值。这5个数据集中的每一个都用这样的线隔开:

MSG 4502281 START_GRAB

和

MSG 4512283 END_GRAB

超出这些范围的所有内容都可以丢弃。

因此，作为一个相对缺乏经验的程序员，我开始寻找实现目标的最有效方法。我最好的方法是什么?也就是说，对于这种任务来说，c++是否不必要地复杂?也许已经有一个实用程序可以做这种数据拖网搜索?

我现在突然想到，我可以使用Microsoft Excel脚本为我做这件事。我想知道你对这个问题的看法。

我将从朴素的方法开始，看看我能走多远:

#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
int main()
{
  std::ifstream infile("thefile.txt");
  if (!infile) { return 0; }
  std::vector<double> v2, v3;
  std::string line;
  while (std::getline(infile, line))
  {
    int id;
    double col1, col2, col3, col4, col5, col6;
    std::istringstream iss(line);
    if (iss >> id >> col1 >> col2 >> col3 >> col4 >> col5 >> col6)
    {
       // we only get here if the first token is an integer!
       if (col5 < 4.0 && col6 < 4.0)
       {
         v2.push_back(col2);
         v3.push_back(col3);
       }
    }
    else
    {
      iss.clear(); // clear error
      std::string id;
      if (iss >> id && id == "MSG")
      {
        // process the special block
      }
    }
  }
  // now compute the average of the v2 and v3:
  double av2 = std::accumulate(v2.begin(), v2.end(), 0) / double(v2.size());
  double av3 = std::accumulate(v3.begin(), v3.end(), 0) / double(v3.size());
}

如果你想用c++解决这个问题，我强烈推荐Boost regex

基本上，您需要三个正则表达式:一个用于START_GRAB，一个用于有效负载行，一个用于END_GRAB行。编写正则表达式并不太难。网上有很多教程，你可以在这里尝试你的正则表达式:

http://gskinner.com/RegExr/