如何从Little Endian UTF-16编码字节中获取C++std::string

How to get C++ std::string from Little-Endian UTF-16 encoded bytes

本文关键字:获取 C++std 字节 string 编码 Little Endian UTF-16      更新时间:2023-10-16

我有一个第三方设备,它通过一个没有很好记录的专有通信协议与我的Linux设备进行通信。在阅读了Joel On Software的这篇文章后,一些数据包传递的"字符串"似乎是UTF16 Little Endian编码的。换句话说,在收到这样的数据包后,我的Linux盒子上有类似的东西

// The string "Out"
unsigned char data1[] = {0x4f, 0x00, 0x75, 0x00, 0x74, 0x00, 0x00, 0x00};
// The string "°F"
unsigned char data2[] = {0xb0, 0x00, 0x46, 0x00, 0x00, 0x00};

据我所知,我不能将它们视为std::wstring,因为在Linux上,wchar_t是4个字节。然而,我确实有一件事要做,那就是我的Linux盒子也是Little Endian。所以,我认为我需要使用类似std::codecvt_utf8_utf16<char16_t>的东西。然而,即使在阅读了文档之后,我也不知道如何真正从unsigned char[]std::string。有人能帮忙吗?

如果您希望使用std::codcvt(自C++17以来一直不推荐使用(,您可以包装UTF-16文本,然后根据需要将其转换为UTF-8。

// simply cast raw data for constructor, since we known that char 
// is actually 'byte' array from network API
std::u16string u16_str( reinterpret_cast<const char16_t*>(data2) );
// UTF-16/char16_t to UTF-8
std::string u8_conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t>{}.to_bytes(u16_str);

为了完整起见,下面是我提出的中最简单的基于iconv的转换

#include <iconv.h>
auto iconv_eng = ::iconv_open("UTF-8", "UTF-16LE");
if (reinterpret_cast<::iconv_t>(-1) == iconv_eng)
{
std::cerr << "Unable to create ICONV engine: " << strerror(errno) << std::endl;
}
else
{
// src            a char * to utf16 bytes
// src_size       the maximum number of bytes to convert
// dest           a char * to utf8 bytes to generate
// dest_size      the maximum number of bytes to write
if (static_cast<std::size_t>(-1) == ::iconv(iconv_eng, &src, &src_size, &dest, &dest_size))
{
std::cerr << "Unable to convert from UTF16: " << strerror(errno) << std::endl;
}
else
{
std::string utf8_str(src);
::iconv_close(iconv_eng);
}
}