如何使用ICU将Unicode代码点转换为C++中的字符

How to convert a Unicode code point to characters in C++ using ICU?

本文关键字：C++ 字符转换 ICU 何使用 Unicode 代码更新时间：2023-10-16

不知怎么的，我在谷歌上找不到答案。可能我在搜索时使用了错误的术语。我试图执行一个简单的任务，将代表字符的数字转换为字符本身，如下表所示：http://unicode-table.com/en/#0460

例如，如果我的数字是47（即"\"），我只需将47放在char中，然后使用cout打印它，我就会在控制台中看到一个反斜杠（对于低于256的数字来说没有问题）。

但如果我的数字是1120，字符应该是"Ѡ"（拉丁语中的omega）。我假设它由几个字符表示（当它打印到屏幕上时，cout会知道将其转换为"Ѡ"）。

我如何获得这些代表"Ѡ"的"几个字符"？

我有一个名为ICU的库，我使用UTF-8。

您所称的Unicode编号通常被称为代码点。如果您想使用C++和Unicode字符串，ICU提供了一个ICU:：UnicodeString类。你可以在这里找到文档。

要创建包含单个字符的UnicodeString，可以使用在UChar32中获取代码点的构造函数：

icu::UnicodeString::UnicodeString(UChar32 ch)

然后可以调用toUTF8String方法将字符串转换为UTF-8。

示例程序：

#include <iostream>
#include <string>
#include <unicode/unistr.h>
int main() {
    icu::UnicodeString uni_str((UChar32)1120);
    std::string str;
    uni_str.toUTF8String(str);
    std::cout << str << std::endl;
    return 0;
}

在像Debian这样的Linux系统上，你可以用编译这个程序

g++ so.cc -o so -licuuc

如果您的终端支持UTF-8，这将打印一个omega字符。

另一种选择是只使用标准组件。以下示例将Unicode代码点视为std::u32string，并将其返回为std::string。

创建具有Unicode代码点的std::u32string很简单：

方法1：使用大括号init（调用`initializer_list ctor）

std::u32string u1{codePointNumber};
// For example:
std::u32string u1{305}; // 305 is 'ı'

方法2：使用运算符+=

std::u32string u2{}; // Empty string
// For example:
u2 += 305;

要将std::u32string转换为std::string，可以使用<locale>标头中的std::wstring_convert：

#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string U32ToStr(const std::u32string& str)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(str);
}
int main()
{
    std::u32string u1{305};
    std::cout << U32ToStr(u1) << "n";
    return 0;
}

来自goldbold 的示例1

请注意，std::wstring_convert在C++17及更高版本中已弃用（但尚未删除），因此如果您使用的是较新版本的C++，则可能需要使用其他方法。