如何使文件流以 UTF-8 C++读取

How to make a filestream read in UTF-8 C++

本文关键字：UTF-8 C++ 读取何使文件更新时间：2023-10-16

我能够通过重定向终端上的输入和输出，然后使用 wcin 和 wcout 成功读取 UTF8 字符文本文件

_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);

现在我希望能够使用文件流读取 UTF8 文本，但我不知道如何设置文件流的模式，以便它可以像我使用 stdin 和 stdout 一样读取这些字符。我尝试使用wifstreams/wofstreams，那些仍然自己读取和写入垃圾的人。

C++的<iostreams>库没有内置支持从一种文本编码到另一种文本编码的转换。如果需要将输入文本从 utf-8 转换为另一种格式(例如，编码的基础代码点(，则需要手动编写该转换。

std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}

此代码将可靠地从 [正确编码] utf-8 转换为 utf-32，这通常是直接转换为字形 + 字符的最简单形式，但请记住，代码点不是字符。

为了保持代码中的一致性，我的建议是使用std::string将 utf-8 编码的文本存储在您的程序中，并将 utf-32 编码的文本存储为std::vector<uint32_t>。