使用Boost将UTF-16BE转换为UTF-8.Locale会产生垃圾

UTF-16BE to UTF-8 using Boost.Locale yields garbage

本文关键字:Locale UTF-8 Boost UTF-16BE 转换 使用      更新时间:2023-10-16

我正在使用一个返回UTF-16BE字符串的API。我需要将它们转换为UTF-8,以便在UI中显示(而UI又接受char*缓冲区)。为此,我决定使用boost::locale::conv::utf_to_utf()并编写一个转换例程:

// defined by the API
typedef uint16_t t_wchar_t;
typedef std::basic_string<t_wchar_t> t_wstring;
char* ToUtf8(const t_wstring &utf16)
{
    // print out the input buffer, using printfs instead of cout because I have to
    printf("t_wchar_t = %zu, wchar_t = %zu, char = %zun", 
            sizeof(t_wchar_t), sizeof(wchar_t), sizeof(char));
    const t_wchar_t *inBuf = utf16.c_str();
    const size_t inSize = utf16.size();
    // buf2str is my debugging function for printing buffers as raw bytes
    printf("UTF16 size: %zu, buf: %sn", inSize, 
            buf2str(inBuf, inSize).c_str());
    // make a copy of the input buffer, prepend a BE BOM 
    // (didn't work without it, does not work with it either)
    t_wchar_t *workBuf = new t_wchar_t[inSize + 1];
    workBuf[0] = 0xfeff;
    std::memcpy(workBuf + 1, inBuf, inSize * sizeof(t_wchar_t));
    printf("Workbuf: %sn", buf2str(workBuf, inSize + 1).c_str());
    // perform conversion, print out the result buffer
    const string utf8Str = boost::locale::conv::utf_to_utf<char>(workBuf, 
            workBuf + inSize + 1);
    const size_t utf8Size = utf8Str.size();
    printf("UTF8 size: %zu, buf: %sn", utf8Size, 
            buf2str(utf8Str.c_str(), utf8Size).c_str());
    // allocate a char buffer, copy the result there and return the pointer
    char *ret = new char[utf8Size + 1];
    std::memcpy(ret, utf8Str.c_str(), (utf8Size + 1)*sizeof(char));
    printf("Return buf[%zu]: <%s>n", 
            buf2str(ret, utf8Size + 1).c_str());
    delete [] workBuf;
    return ret;
}

然而,当在API字符串上运行时,它会返回垃圾以及一些测试数据:

int main()
{
    // simulate the input, make an example UTF-16BE stream from raw bytes
    const unsigned char test[] ={ '', 'H', '', 'e', '', 'l', '', 'l', '', 'o', 
            '', ',', '', ' ', '', 'w', '', 'o', '', 'r', '', 'l', 
            '', 'd', '', '!' };
    // create a t_wstring from the 16bit code sequences directly
    const t_wstring testStr(reinterpret_cast<const t_wchar_t*>(test), 13);
    printf("test data: %sn", buf2str(testStr.c_str(), testStr.size()).c_str());
    char* utf8 = ToUtf8(testStr);
      delete [] utf8;
    return 0;
}

以下是程序中"Hello,world!"字符串的一些输出。正如您所看到的,转换后的UTF8缓冲区包含垃圾。

测试数据:[13/26]''(0)'H'(72)''(0"(0)"(32)"(33)
t_wchar_t=2,wchar_t=4,char=1
UTF16大小:13,buf:[13/26]''(0)'H'(72)''(0"(0)","(44)"(0)""(0)"!"(33)
工作区:[13/26]''(0)'H'(72)''(0"(0)"(32)"(33)
UTF8大小:42,buf:[42/42]''(228)''(160)''(128)''(230)''(148)''''(176)''(128)''(230)''(188)''(128''(226)''(176''(128)''(230)''(188)''(128(226)(132)(128)(226)"(188)"(179)

我做错了什么?谢谢

编辑:感谢@TheUndadFish的评论,我在转换之前在我的工作缓冲区上添加了endianness转换,现在它如预期一样工作:

for (size_t i = 0; i < inSize; ++i)
{
    workBuf[i] = be16toh(workBuf[i]);
}

在您的案例中,utf_to_utf似乎正在处理输入,就好像它是小端UTF16一样。

取前4个字节:

你的意思是00 72 00 101编码为U+0048 U+0064。

当在编码U+4800 U+6400的相反序下进行解释时。

当它被转换为UTF-8时,结果是字节e4 a0 80 e6 94 80。

将它们表示为十进制得到228 160 128 230 148 128,这是"垃圾"的第一个值。