C++ - 按照特定条件对双向量进行分组的有效方法

C++ - Efficient way to group double vectors following a certain criteria

本文关键字：方法有效向量特定条件 C++ 更新时间：2023-10-16

我使用以下方案将对象列表保存在类似 CSV 的文件中：

[值11],...,[值1n]，[标签1]

[值21],...,[值2n]，[标签2]

。

[valuen1],...,[valuenn]，[labeln]

（每行是一个对象，即双精度向量和相应的标签）。我想将它们收集在具有特定自定义条件的组中（即该组所有对象的第 n 和第（n+1）个位置的值相同）。我需要以最有效的方式做到这一点，因为文本文件包含数百个对象。我使用的是C++编程语言。

为此，首先，我将所有CSV行加载到一个简单的自定义容器中（使用getObject，getLabel和导入方法）。然后我使用以下代码读取它们并分组。"verifyGroupRequirements"是一个函数，如果满足组条件，则返回true，否则返回false。

for (size_t i = 0; i < ObjectsList.getSize(); ++i) {
  MyObject currentObj;
  currentObj.attributes = ObjectsList.getObject(i);
  currentObj.label = ObjectsList.getLabel(i);
  if (i == 0) {
    // Sequence initialization with the first object
    ObjectsGroup currentGroup = ObjectsGroup();
    currentGroup.objectsList.push_back(currentObj);
    tmpGroupList.push_back(currentGroup);
  } else {
    // if it is not the first pattern, then we check sequence conditions
    list<ObjectsGroup>::iterator it5;
    for (it5 = tmpGroupList.begin(); it5 != tmpGroupList.end(); ++it5) {
      bool AddObjectToGroupRequirements =
        verifyGroupRequirements(it5->objectsList.back(), currentObj) & 
        ( (it5->objectsList.size() < maxNumberOfObjectsPerGroup) |
        (maxNumberOfObjectsPerGroup == 0) );
      if (AddObjectToGroupRequirements) {
        // Object added to the group
        it5->objectsList.push_back(currentObj);
        break;
      } else {
        // If we can't find a group which satisfy those conditions and we
        // arrived at the end of the list of groups, then we create a new
        // group with that object.
        size_t gg = std::distance(it5, tmpGroupList.end());
        if (gg == 1) {
          ObjectsGroup tmp1 = ObjectsGroup();
          tmp1.objectsList.push_back(currentObj);
          tmpGroupList.push_back(tmp1);
          break;
        }
      }
    }
  }
  if (maxNumberOfObjectsPerGroup > 0) {
    // With a for loop we can take all the elements of 
    // tmpGroupList which have reached the maximum size
    list<ObjectsGroup>::iterator it2;
    for (it2 = tmpGroupList.begin(); it2 != tmpGroupList.end(); ++it2) {
      if (it2->objectsList.size() == maxNumberOfObjectsPerGroup)
        finalGroupList.push_back(*it2);
    }
    // Since tmpGroupList is a list we can use remove_if to remove them
    tmpGroupList.remove_if(rmCondition);
  }
}
if (maxNumberOfObjectsPerGroup == 0) 
  finalGroupList = vector<ObjectsGroup> (tmpGroupList.begin(), tmpGroupList.end());
else {
  list<ObjectsGroup>::iterator it6;
  for (it6 = tmpGroupList.begin(); it6 != tmpGroupList.end(); ++it6)
    finalGroupList.push_back(*it6);
}

其中 tmpGroupList 是一个list<MyObject>，finalGroupList 是一个vector<MyObject>，rmCondition 是一个布尔函数，如果 ObjectsGroup 的大小大于固定值，则返回 true。MyObject 和 ObjectsGroup 是两种简单的数据结构，按以下方式编写：

// Data structure of the single object
class MyObject {
  public:
    MyObject(
          unsigned short int &spaceToReserve,
          double &defaultContent,
          string &lab) {
      attributes = vector<double>(spaceToReserve, defaultContent);
      label = lab;
    }
    vector<double> attributes;
    string label;
};
// Data structure of a group of object
class ObjectsGroup {
  public:
    list<MyObject> objectsList;
    double health;
};

这段代码似乎有效，但它真的很慢。正如我之前所说，由于我必须将其应用于大量对象，因此有没有办法改进它并使其更快？谢谢。

[编辑] 我试图实现的是制作对象组，其中每个对象都是一个vector<double>（从CSV文件获取）。所以我在这里要问的是，有没有比上面代码示例中公开的更有效的方法在组中收集这些类型的对象？

[编辑2] 我需要使用所有这些向量制作组。

所以，我正在阅读你的问题...

。我想以一定的习俗分组收集它们标准（即所有第 n 位和第（n+1）位的值相同该组的对象）...

好的，我读了这部分，并继续阅读...

。我需要以最有效的方式做到这一点，因为文本文件包含数百个物体...

我还和你在一起，很有道理。

。为此，首先我加载所有 CSV 行...

{砰} {碰撞} {响亮的爆炸声}

好的，我在那里停止阅读，并没有太注意问题的其余部分，包括大型代码示例。这是因为我们从一开始就有一个基本问题：

1）你说你的意图通常是只阅读一个小的这个巨大的CSV文件的一部分，以及...

2） ...为此，您将整个CSV文件加载到相当复杂的数据结构中。

这两种说法是相互矛盾的。您正在从文件中读取大量值。您正在为每个值创建一个对象。根据你问题的前提，你将拥有大量的这些对象。但是，当一切都说完了，你只会看其中的一小部分，其余的扔掉吗？

您正在做大量工作，大概耗尽了大量内存和 CPU 周期，加载了庞大的数据集，却忽略了大部分数据。您想知道为什么会遇到性能问题？对我来说似乎很干巴巴。

这样做的替代方法是什么？好吧，让我们把整个问题翻过来，一点一点地处理它。让我们读取一个 CSV 文件，一次一行，解析 CSV 格式文件中的值，并将生成的字符串传递给 lambda。

像这样：

template<typename Callback> void parse_csv_lines(std::ifstream &i,
                                                 Callback &&callback)
{
    std::string line;
    while (1)
    {
        line.clear();
        std::getline(i, line);
        // Deal with missing newline on the last line...
        if (i.eof() && line.empty())
             break;
        std::vector<std::string> words;
        // At this point, you'll take this "line", and split it apart, at
        // the commas, into the individual words. Parsing a CSV-
        // formatted file. Not very exciting, you're doing this
        // already, the algorithm is boring to implement, you know
        // how to do it, so let's say you replace this entire comment
        // with your boiler-plate CSV parsing logic from your existing
        // code
        callback(words);
    }
}

好的，现在我们已经完成了解析 CSV 文件的任务。现在，假设我们想完成您在问题开头设置的任务，抓住每个第 n 个和第 n+1 个位置。所以。。。

void do_something_with_n_and_nplus1_words(size_t n)
{
    std::ifstream input_file("input_file.csv");
    // Insert code to check if input_file.is_open(), and if not, do
    // whatever
    parse_csv_lines(input_file,
                    [n]
                    (const auto &words)
                    {
                       // So now, grab words[n] and words[n+1]
                       // (after checking, of course, for a malformed
                       // CSV file with fewer than "n+2" values)
                       // and do whatever you want with them.
                    });
}

就是这样。现在，您最终只需读取 CSV 文件，并执行从每个 CSV 文件中提取第 n 个和第 n+1 个值所需的绝对最少工作量。我想，想出一种工作量更少的方法将是相当困难的（当然，除了与CSV解析和单词缓冲区相关的微优化;或者也许放弃std::ifstream的开销，而是mmap-ing整个文件，然后通过扫描其mmap-ed内容来解析它，类似的东西），我想。

对于其他类似的一次性任务，只需要 CSV 文件中的少量值，只需编写适当的 lambda 即可提取它们。

也许，您需要从大型CSV文件中检索两个或多个值子集，并且您可能希望读取一次CSV文件？好吧，很难给出最好的一般方法。这些情况中的每一种都需要单独分析，以选择最佳方法。