如何改进与字符串相关的迭代器性能

How to Improve a string related iterator performance?

本文关键字:迭代器 性能 何改进 字符串      更新时间:2023-10-16

我制作了一个迭代器,从istream<string>实例中读取行并将其转换为vector<pair<string, string> >的行。我可以通过减少字符串副本(或其他任何相关内容)来优化它以更好地执行吗?

这里是代码的摘要,但我也有一个完整的版本,其中包含一些简单的测试和makefile

typedef std::vector<std::pair<std::string, std::string> > sentence;
class my_iterator : public std::iterator<std::input_iterator_tag, sentence> {
  std::istream* is;
  std::string line;
  sentence s;
  void advance() {
    std::getline(*is, line);
    convert();
  }
  void convert() {
    std::istringstream iss(line);
    s.clear();
    for (std::istream_iterator<std::string> it(iss), end; it != end; ++it) {
      std::string token = *it;
      size_t idx = token.find_last_of('/');
      std::string word = token.substr(0, idx);
      std::string part_of_speech = token.substr(idx, token.size());
      s.push_back(std::pair<std::string, std::string>(word, part_of_speech));
    }
  }
  public:
    my_iterator& operator++() {
      assert(is && !is->eof());
      if (is && !is->eof())
        advance();
      if (is->eof())
        is = NULL;
      return *this;
    }
    sentence operator*() const {
      return s;
    }
    const sentence* operator->() const {
      return &s;
    }
    bool operator==(const my_iterator& rhs) const {
      return is == rhs.is;
    }
    /* some more boilerplate constructors, etc */
};

输入看起来像这样:

Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./.
There/EX is/VBZ no/DT asbestos/NN in/IN our/PRP$ products/NNS now/RB ./. ''/''

一些观察结果:

  1. 它是逐行读取文件还是逐字读取文件并不重要,因为你只会逐字处理它。所以用for(std::string word; std::cin >> word; ) ...逐字逐句地读
  2. 每个单词都有一个word/part_of_speach的形式。如果将word/part_of_speach分解为两个std::string对象,则会占用更多空间,因为每个字符串都存储一个大小和一个零终止符。您可以仅在需要时按需拆分word/part_of_speach,或者将最后一个/的位置存储在该字符串中,如std::pair<std::string, size_t>

例如:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <vector>
#include <string>
#include <boost/iterator/transform_iterator.hpp>
int main() {
    // Store word/part_of_speach tokens.
    std::vector<std::string> tokens(
          std::istream_iterator<std::string>(std::cin)
        , std::istream_iterator<std::string>()
        );
    // Now process them as word and part_of_speach.
    auto break_token = [](std::string const& token) { return std::make_pair(&token, token.find_last_of('/')); };
    std::for_each(
          boost::make_transform_iterator(tokens.begin(), break_token)
        , boost::make_transform_iterator(tokens.end(), break_token)
        , [](std::pair<std::string const*, size_t> const& broken_token) {
              (std::cout << "word: ").write(broken_token.first->data(), broken_token.second);
              std::cout << " part_of_speach: " << broken_token.first->data() + broken_token.second;
              std::cout << 'n';
          }
        );
}

输出:

$ echo -e "Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./. There/EX is/VBZ no/DT asbestos/NN in/IN our/PRP$ products/NNS now/RB ./. ''/''" | ./test
word: Mr. part_of_speach: /NNP
word: Vinken part_of_speach: /NNP
word: is part_of_speach: /VBZ
word: chairman part_of_speach: /NN
word: of part_of_speach: /IN
word: Elsevier part_of_speach: /NNP
word: N.V. part_of_speach: /NNP
word: , part_of_speach: /,
word: the part_of_speach: /DT
word: Dutch part_of_speach: /NNP
word: publishing part_of_speach: /VBG
word: group part_of_speach: /NN
word: . part_of_speach: /.
word: There part_of_speach: /EX
word: is part_of_speach: /VBZ
word: no part_of_speach: /DT
word: asbestos part_of_speach: /NN
word: in part_of_speach: /IN
word: our part_of_speach: /PRP$
word: products part_of_speach: /NNS
word: now part_of_speach: /RB
word: . part_of_speach: /.
word: '' part_of_speach: /''

您将字符串推送到向量中-现在,虽然这并不是特别糟糕,但一旦您达到当前分配的向量大小的末尾,它就会为您调整大小。这可能是一个性能上的成功。一个可以通过预先调整向量大小(如果你知道将出现多少单词)或使用不同的集合(如队列)轻松解决的问题。

你可以尝试的另一件事是减少临时性——你将数据从*it复制到一个字符串中,然后标记它,你可以通过直接对流数据进行操作来减少这种复制——尽管你需要一个不同的例程来读取标记。

推到向量上也会复制字符串,您可以使用unique_ptr<>为了保存这些字符串,您可以使用单个字符串,或者在向量中构造第一个字符串并进行处理。