在字符串中迭代单词的最有效方法

Most efficient way to iterate over words in a string

本文关键字：有效方法单词字符串迭代更新时间：2023-10-16

如果我想在字符串中迭代单个单词（由空格隔开），那么明显的解决方案将是：

std::istringstream s(myString);
std::string word;
while (s >> word)
    do things

但是这非常低效。在初始化字符串流时复制整个字符串，然后将每个提取的单词一次复制到word变量（第二次复制整个字符串）。有没有一种方法可以改进它，而无需在每个角色上手动迭代？

在大多数情况下，复制代表了总体成本的很小比例，因此，具有干净，高度可读的代码变得更加重要。在极少数情况下，当时间分子告诉您复制创建瓶颈时，您可以在标准库的一些帮助下迭代字符中的字符。

您可以采用的一种方法是使用std::string::find_first_of和std::string::find_first_not_of成员函数进行迭代，例如：

const std::string s = "quick tt brown t fox jumps over thenlazy dog";
const std::string ws = " trn";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

演示。

不幸的是，代码变得更加困难，因此您应该避免此更改，或者至少将其推迟到重新添加为止。

使用boost字符串算法，我们可以如下编写。循环不涉及字符串的任何复制。

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
int main()
{
    std::string s = "stack over   flow";
    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;
    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'n";
    }
    return 0;
}

使其C 11-ish

，由于如今的成对迭代器是如此古老的学校，因此我们可以使用boost.Range来定义一些通用的辅助功能。这些最终使我们能够使用范围循环：

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>
template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;
template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}
template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}
int main()
{
    std::string str = "stack toverrn  flow";
    for( const auto& substr : make_token_range( str, boost::is_any_of( " trn" ) ) )
    {
        std::cout << "word: '" << substr << "'n";
    }
    return 0;
}

演示：

http://coliru.stacked-crooked.com/a/a/2f4b3d34086cc6ec

如果要尽可能快地拥有它，则需要退回到良好的旧C函数strtok()（或其线程安全伴侣strtok_r()）：

const char* kWhiteSpace = " tvnr";    //whatever you call white space
char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

提防这将使myString的内容降低：它可以通过用终止的null字节替换第一个定界符字符，然后将指针返回到代币的开始。毕竟这是旧的C函数。

然而，这种弱点也是它的强度：它不执行任何副本，也不会分配任何动态内存（这可能是示例代码中最耗时的东西）。因此，您找不到击败strtok()的速度的本机C 方法。

启用字符串怎么样？您可以查看此帖子以获取更多信息。

在这篇文章中，有一个详细的答案，内容涉及如何将字符串分成令牌中。在此答案中，也许您可以使用迭代器和复制算法检查第二种方法。