使用 Boost.Locale 库检索代码点

Retrieving code points using Boost.Locale library

本文关键字：代码检索 Boost Locale 使用更新时间：2023-10-16

从给定的Unicode字符串中，我想检索构成该字符串的代码点列表。为此，我从 Boost 的角色迭代示例中复制了以下示例：

#include <boost/locale.hpp>
using namespace boost::locale::boundary;
int main()
{
    boost::locale::generator gen;
    std::string text = "To be or not to be";
    // Create mapping of text for token iterator using global locale.
    ssegment_index map(character, text.begin(), text.end(), gen("en_US.UTF-8"));
    // Print all "words" -- chunks of word boundary
    for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
        std::cout <<"""<< * it << "", ";
    }
    std::cout << std::endl;
    return 0;
}

它返回我的字符（根据 Boost 的文档与代码点不同），如下所示：

"T", "o", " ", "b", "e", " ", "o", "r", " ", "n", "o", "t", " ", "t", "o", " ", "b", "e",

我读到使用 boost：：locale：：util：：base_converter 类中的 to_unicode 函数可以检索给定字符串的代码点。但我不确定如何。我尝试了以下代码，但没有帮助：

for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
    std::cout << """ << * it << "", ";
    boost::locale::util::base_converter encoder_decoder;
    virtual uint32_t test1 = encoder_decoder.to_unicode(it->begin(), it->end() );
}

它返回"类型不匹配"错误。我认为to_unicode()函数的参数一定不同

我正在考虑仅使用 Boost 来检索代码点，而不是像这里或这里这样的现有解决方案，因为 Boost 提供了许多有用的函数来识别各种语言的换行符、断字符等。

要获取代码点，您可以使用boost::u8_to_u32_iterator 。这是因为 UTF-32 字符等于其代码点。

#include <boost/regex/pending/unicode_iterator.hpp>
#include <string>
#include <iostream>
void printCodepoints(std::string input) {
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it)
        std::cout <<"""<< * it << "", ";
}
int main() {
    printCodepoints("Hello World!");
    return 0;
}