使用Boost将UTF-16BE转换为UTF-8.Locale会产生垃圾

UTF-16BE to UTF-8 using Boost.Locale yields garbage

本文关键字：Locale UTF-8 Boost UTF-16BE 转换使用更新时间：2023-10-16

我正在使用一个返回UTF-16BE字符串的API。我需要将它们转换为UTF-8，以便在UI中显示（而UI又接受char*缓冲区）。为此，我决定使用boost::locale::conv::utf_to_utf()并编写一个转换例程：

// defined by the API
typedef uint16_t t_wchar_t;
typedef std::basic_string<t_wchar_t> t_wstring;
char* ToUtf8(const t_wstring &utf16)
{
    // print out the input buffer, using printfs instead of cout because I have to
    printf("t_wchar_t = %zu, wchar_t = %zu, char = %zun", 
            sizeof(t_wchar_t), sizeof(wchar_t), sizeof(char));
    const t_wchar_t *inBuf = utf16.c_str();
    const size_t inSize = utf16.size();
    // buf2str is my debugging function for printing buffers as raw bytes
    printf("UTF16 size: %zu, buf: %sn", inSize, 
            buf2str(inBuf, inSize).c_str());
    // make a copy of the input buffer, prepend a BE BOM 
    // (didn't work without it, does not work with it either)
    t_wchar_t *workBuf = new t_wchar_t[inSize + 1];
    workBuf[0] = 0xfeff;
    std::memcpy(workBuf + 1, inBuf, inSize * sizeof(t_wchar_t));
    printf("Workbuf: %sn", buf2str(workBuf, inSize + 1).c_str());
    // perform conversion, print out the result buffer
    const string utf8Str = boost::locale::conv::utf_to_utf<char>(workBuf, 
            workBuf + inSize + 1);
    const size_t utf8Size = utf8Str.size();
    printf("UTF8 size: %zu, buf: %sn", utf8Size, 
            buf2str(utf8Str.c_str(), utf8Size).c_str());
    // allocate a char buffer, copy the result there and return the pointer
    char *ret = new char[utf8Size + 1];
    std::memcpy(ret, utf8Str.c_str(), (utf8Size + 1)*sizeof(char));
    printf("Return buf[%zu]: <%s>n", 
            buf2str(ret, utf8Size + 1).c_str());
    delete [] workBuf;
    return ret;
}

然而，当在API字符串上运行时，它会返回垃圾以及一些测试数据：

int main()
{
    // simulate the input, make an example UTF-16BE stream from raw bytes
    const unsigned char test[] ={ '', 'H', '', 'e', '', 'l', '', 'l', '', 'o', 
            '', ',', '', ' ', '', 'w', '', 'o', '', 'r', '', 'l', 
            '', 'd', '', '!' };
    // create a t_wstring from the 16bit code sequences directly
    const t_wstring testStr(reinterpret_cast<const t_wchar_t*>(test), 13);
    printf("test data: %sn", buf2str(testStr.c_str(), testStr.size()).c_str());
    char* utf8 = ToUtf8(testStr);
      delete [] utf8;
    return 0;
}

以下是程序中"Hello，world！"字符串的一些输出。正如您所看到的，转换后的UTF8缓冲区包含垃圾。

测试数据：[13/26]''（0）'H'（72）''（0"（0）"（32）"（33）
t_wchar_t=2，wchar_t=4，char=1
UTF16大小：13，buf:[13/26]''（0）'H'（72）''（0"（0）"，"（44）"（0）""（0）"！"（33）
工作区：[13/26]''（0）'H'（72）''（0"（0）"（32）"（33）
UTF8大小：42，buf:[42/42]''（228）''（160）''（128）''（230）''（148）''''（176）''（128）''（230）''（188）''（128''（226）''（176''（128）''（230）''（188）''（128（226）（132）（128）（226）"（188）"（179）

我做错了什么？谢谢

编辑：感谢@TheUndadFish的评论，我在转换之前在我的工作缓冲区上添加了endianness转换，现在它如预期一样工作：

for (size_t i = 0; i < inSize; ++i)
{
    workBuf[i] = be16toh(workBuf[i]);
}

在您的案例中，utf_to_utf似乎正在处理输入，就好像它是小端UTF16一样。

取前4个字节：

你的意思是00 72 00 101编码为U+0048 U+0064。

当在编码U+4800 U+6400的相反序下进行解释时。

当它被转换为UTF-8时，结果是字节e4 a0 80 e6 94 80。

将它们表示为十进制得到228 160 128 230 148 128，这是"垃圾"的第一个值。