如何在C++中将像"u94b1"这样的字符串转换为一个真实字符？

How can I convert string like "u94b1" to one real character in C++?

本文关键字：转换字符真实一个字符串 C++ u94b1 更新时间：2023-10-16

我们知道在字符串文字中，"\u94b1"将被转换为一个字符，在这种情况下是一个中文单词"钱"。但是，如果它实际上是字符串中的 6 个字符，比如说"\"、"u"、"9"、"4"、"b"、"1"，我如何手动将其转换为字符？

例如：

string s1;
string s2 = "u94b1";
cin >> s1;            //here I input u94b1
cout << s1 << endl;   //here output u94b1
cout << s2 << endl;   //and here output 钱

我想转换s1，以便cout << s1 << endl;也会输出钱.

有什么建议吗？

实际上转换有点复杂。

string s2 = "u94b1";

实际上相当于：

char cs2 = { 0xe9, 0x92, 0xb1, 0}; string s2 = cs2;

这意味着您正在初始化构成金钱的 UTF8 表示形式的 3 个字符 - 您只需检查s2.c_str()以确保这一点。

因此，要处理 6 个原始字符 ''、'u'、'9'、'

4'、'b'、'1'，您必须首先从string s1 = "\u94b1";中提取wchar_t（阅读时得到的内容）。这很简单，只需跳过前两个字符并将其读为十六进制：

unsigned int ui;
std::istringstream is(s1.c_str() + 2);
is >> hex >> ui;

ui现在0x94b1.

现在，如果您有一个兼容 C++11 的系统，您可以使用 std::convert_utf8 进行转换：

wchar_t wc = ui;
std::codecvt_utf8<wchar_t> conv;
const wchar_t *wnext;
char *next;
char cbuf[4] = {0}; // initialize the buffer to 0 to have a terminating null
std::mbstate_t state;
conv.out(state, &wc, &wc + 1, wnext, cbuf, cbuf+4, next);

cbuf现在包含 utf8 中代表钱的 3 个字符和一个终止 null，您可以最终执行以下操作：

string s3 = cbuf;
cout << s3 << endl;

为此，

您可以编写代码来检查字符串是否包含反斜杠、字母 u 和四个十六进制数字，并将其转换为 Unicode 代码点。然后，您的 std：：string 实现可能采用 UTF-8，因此您将该代码点转换为 1、2 或 3 个 UTF-8 字节。

对于加分，请弄清楚如何在基本平面之外输入码位。

使用 utfcpp（仅限标头），您可以执行以下操作：

#include </usr/include/utf8.h>
#include <cstdint>
#include <iostream>
std::string replace_utf8_escape_sequences(const std::string& str) {
    std::string result;
    std::string::size_type first = 0;
    std::string::size_type last = 0;
    while(true) {
        // Find an escape position
        last = str.find("\u", last);
        if(last == std::string::npos) {
            result.append(str.begin() + first, str.end());
            break;
        }
        // Extract a 4 digit hexadecimal
        const char* hex = str.data() + last + 2;
        char* hex_end;
        std::uint_fast32_t code = std::strtoul(hex, &hex_end, 16);
        std::string::size_type hex_size = hex_end - hex;
        // Append the leading and converted string
        if(hex_size != 4) last = last + 2 + hex_size;
        else {
            result.append(str.begin() + first, str.begin() + last);
            try {
                utf8::utf16to8(&code, &code + 1, std::back_inserter(result));
            }
            catch(const utf8::exception&) {
                // Error Handling
                result.clear();
                break;
            }
            first = last = last + 2 + 4;
        }
    }
    return result;
}
int main()
{
    std::string source = "What is the meaning of '\u94b1'  '\u94b1' '\u94b1' '\u94b1' ?";
    std::string target = replace_utf8_escape_sequences(source);
    std::cout << "Conversion from "" << source << "" to "" << target << ""n";
}