C++ UTF-8 瑞典语字符读取为 ASCII

C++ UTF-8 Swedish Characters are Read as ASCII

本文关键字：ASCII 读取字符 UTF-8 瑞典语 C++ 更新时间：2023-10-16

有一个C++程序，我需要添加读取文件的功能。我发现它不适用于欧洲特殊字符。我正在使用的例子是瑞典字符。

我将代码更改为使用宽字符，但这似乎没有帮助。

我正在阅读的示例文本文件包含以下内容：

"NEW-DATA"="Nysted Vi prøver lige igen"

这是在Windows和Nodepad上说这个文件正在使用UTF-8编码。

在 Visual Studio 中，调试时，读取的字符串将显示为使用 ASCII：

ï»¿"NEW-DATA"="Nysted Vi prÃ¸ver lige igen"

我更改了代码以使用"宽"方法：

std::wifstream infile;
infile.open(argv[3], std::wifstream::in);
if (infile.is_open())
{
std::wstring line;
while (std::getline(infile, line))
{

....

我还需要做些什么来让它正确识别 UTF-8 吗？

您可以将 UTF-8 内容读取为 ASCII 文本，但必须将它们转换为宽字符，以允许 Visual Studio 将其解释为 unicode。

这是我们用于此的常用函数：

BSTR UTF8ToBSTR(char const* astr)
{
static wchar_t wstr[BUFSIZ];
// Look for the funtion description in MSDN.
// Use of CP_UTF8 indicates that the input is UTF8 string.
// Get the size of the output needed for the conversion.
int size = MultiByteToWideChar(CP_UTF8, 0, astr, -1, NULL, 0);
// Do the conversion and get the output.
MultiByteToWideChar(CP_UTF8, 0, astr, -1, wstr, size);
// Allocate memory for the BSTR and return the BSTR.
return SysAllocString(wstr);
}

您必须添加代码来释放调用SysAllocString(wstr)分配的内存。

例如

BSTR bstr = UTF8ToBSTR(...);
// Use bstr
// ...

// Deallocate memory
SysFreeString(bstr);

正在发生的事情是，您有一个 UTF-8 编码的文件，但您正在尝试读取它，就好像它由宽字符组成一样。那行不通。如您所见，BOF标记已逐字读取到字符串中，因此很明显，您使用的机制不包含任何尝试对字符进行任何类型的解析和UTF-8字节对解码的逻辑。

宽字符和 UTF-8 是两个根本不同的东西。您不可能仅通过以下方法读取 UTF-8 扑通一声wchar_t(或std::wstring(并阅读。你是需要使用某种 Unicode 库。有std::wstring_convertC++11 中(但这需要工具支持(和有手动mbstowcs()/wcstombs()路线。无处不在最好使用库。

源： https://www.reddit.com/r/cpp/comments/108o7g/reading_utf8_encoded_text_files_to_stdwstring/

我认为mbstowcs()/wcstombs()是MicrosoftMultiByteToWideChar()和MultiByteToWideChar()的便携式替代品。