以 C++ 为单位计算字符串中的唯一单词

Count unique words in a string in C++

本文关键字：唯一单词字符串 C++ 为单位计算更新时间：2023-10-16

我想计算字符串's'中有多少个唯一单词，其中标点符号和换行符(n)分隔每个单词。到目前为止，我已经使用逻辑或运算符来检查字符串中有多少个单词分隔符，并在结果中添加 1 以获得字符串 s 中的单词数。

我当前的代码返回 12 作为字数。由于"ab"，"AB"，"aB"，"Ab"(与"zzzz"相同)都是相同的而不是唯一的，我怎么能忽略单词的变体呢？我点击了链接：http://www.cplusplus.com/reference/algorithm/unique/，但引用计算向量中的唯一项目。但是，我使用的是字符串而不是矢量。

这是我的代码：

#include <iostream>
#include <string>
using namespace std;
bool isWordSeparator(char & c) {
return c == ' ' || c == '-' || c == 'n' || c == '?' || c == '.' || c == ','
|| c == '?' || c == '!' || c == ':' || c == ';';
}
int countWords(string s) {
int wordCount = 0;
if (s.empty()) {
return 0;
}
for (int x = 0; x < s.length(); x++) {
if (isWordSeparator(s.at(x))) {
wordCount++;
return wordCount+1;
int main() {
string s = "abnAb!aB?AB:ab.AB;abnABnZZZZ zzzz Zzzznzzzz";
int number_of_words = countWords(s);
cout << "Number of Words: " << number_of_words  << endl;
return 0;
}

使代码不区分大小写所需的内容是tolower()。
您可以使用std::transform将其应用于原始字符串：

std::transform(s.begin(), s.end(), s.begin(), ::tolower);

但是，我应该补充一点，您当前的代码更接近 C 而不是 C++，也许您应该查看标准库提供的内容。

我建议istringstream+istream_iterator用于标记化，unique_copy或set用于摆脱重复项，如下所示：https://ideone.com/nb4BEH

您可以创建一组字符串，保存最后一个分隔符的位置(从 0 开始)并使用substring提取单词，然后将其insert到集合中。完成后，只需返回集合的大小。

您可以使用string::split使整个操作更容易 - 它会为您标记字符串。您所要做的就是将返回数组中的所有元素插入到集合中，然后再次返回其大小。

编辑：根据评论，您需要一个自定义比较器来忽略大小写进行比较。

首先，我建议像这样重写isWordSeparator：

bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}

由于您当前的实现无法处理所有标点符号和空格，例如t或+。

此外，当isWordSeparator为真时递增wordCount是不正确的，例如，如果您有类似?!.

因此，一种不太容易出错的方法是用空格替换所有分隔符，然后迭代将它们插入(无序)集合的单词：

#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>
int countWords(std::string s) {
std::transform(s.begin(), s.end(), s.begin(), [](char c) { 
if (isWordSeparator(c)) {
return ' ';
}
return std::tolower(c);
});
std::unordered_set<std::string> uniqWords;
std::stringstream ss(s);
std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));
return uniqWords.size();
}

将字符串拆分为单词时，将所有单词插入std::set中。这将摆脱重复项。然后只需调用set::size()即可获得唯一单词的数量。

我在我的解决方案中使用了 boost 字符串算法库中的boost::split()函数，因为现在几乎是标准的。代码注释中的解释...

#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;
// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split(). 
struct IgnoreCaseCompare { 
bool operator()( const std::string& a, const std::string& b ) const {
return boost::ilexicographical_compare( a, b );
}
};
int main()
{
string s = "abnAb!aB?AB:ab.AB;abnABnZZZZ zzzz Zzzznzzzz";
// Define a set that will contain only unique strings, ignoring case.
set< string, IgnoreCaseCompare > words;
// Split the string by using your isWordSeparator function
// to define the delimiters. token_compress_on collapses multiple
// consecutive delimiters into only one. 
boost::split( words, s, isWordSeparator, boost::token_compress_on );
// Now the set contains only the unique words.
cout << "Number of Words: " << words.size() << endl;
for( auto& w : words )
cout << w << endl;
return 0;
}

演示：http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8

你可以考虑SQLite c++包装器