以二进制模式将utf16写入文件
Writing utf16 to file in binary mode
我试图在二进制模式下用ofstream将wstring写入文件,但我认为我做错了什么。这就是我尝试过的:
ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();
例如,在编码设置为UTF16的Firefox中打开test.txt,它将显示为:
h�e�l�l�o�
有人能告诉我为什么会发生这种事吗?
编辑:
在十六进制编辑器中打开文件,我得到:
FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00
看起来我因为某种原因在每个字符之间多了两个字节?
在这里我们了解了很少使用的区域设置属性。如果您将字符串输出为字符串(而不是原始数据),则可以神奇地让区域设置自动进行适当的转换。
注意:此代码不考虑wchar_t字符的中间性。
#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"
int main(int argc,char* argv[])
{
// construct a custom unicode facet and add it to a local.
UTF16Facet *unicodeFacet = new UTF16Facet();
const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);
// Create a stream and imbue it with the facet
std::wofstream saveFile;
saveFile.imbue(unicodeLocale);
// Now the stream is imbued we can open it.
// NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
saveFile.open("output.uni");
saveFile << L"This is my Datan";
return(0);
}
文件:UTF16Facet.h
#include <locale>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
// Loop over both the input and output array/
for(;(from < from_end) && (to < to_limit);from += 2,++to)
{
/*Input the Data*/
/* As the input 16 bits may not fill the wchar_t object
* Initialise it so that zero out all its bit's. This
* is important on systems with 32bit wchar_t objects.
*/
(*to) = L' ';
/* Next read the data from the input stream into
* wchar_t object. Remember that we need to copy
* into the bottom 16 bits no matter what size the
* the wchar_t object is.
*/
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
}
from_next = from;
to_next = to;
return((from > from_end)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end) && (to < to_limit);++from,to += 2)
{
/* Output the Data */
/* NB I am assuming the characters are encoded as UTF-16.
* This means they are 16 bits inside a wchar_t object.
* As the size of wchar_t varies between platforms I need
* to take this into consideration and only take the bottom
* 16 bits of each wchar_t object.
*/
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
}
from_next = from;
to_next = to;
return((to > to_limit)?partial:ok);
}
};
我怀疑sizeof(wchar_t)在您的环境中是4,即它写出的是UTF-32/UCS-4而不是UTF-16。这当然就是十六进制转储的样子。
这很容易测试(只需打印出sizeof(wchar_t)),但我很确定这就是发生的事情
要从UTF-32 wstring转换为UTF-16,您需要应用适当的编码,因为代理对开始发挥作用。
如果您使用C++11
标准,这很容易(因为有很多像"utf8"
这样的附加包含,它永远解决了这个问题)。
但是,如果你想使用旧标准的多平台代码,你可以使用这种方法来编写流:
- 阅读关于流的UTF转换器的文章
- 从上面的来源将
stxutif.h
添加到您的项目中 -
以ANSI模式打开文件,并将BOM添加到文件的开头,如下所示:
std::ofstream fs; fs.open(filepath, std::ios::out|std::ios::binary); unsigned char smarker[3]; smarker[0] = 0xEF; smarker[1] = 0xBB; smarker[2] = 0xBF; fs << smarker; fs.close();
-
然后以
UTF
的形式打开文件,并在其中写入内容:std::wofstream fs; fs.open(filepath, std::ios::out|std::ios::app); std::locale utf8_locale(std::locale(), new utf8cvt<false>); fs.imbue(utf8_locale); fs << .. // Write anything you want...
在使用wofstream和上面定义的utf16方面的窗口上,由于wofstreat将值为0A的所有字节转换为2个字节0D 0A,所以失败,这与如何传入0A字节无关,'\x0A'、L'\x0A`、L'#x000A'、'\n'、L'-\n'和std::endl都会给出相同的结果。在windows上,你必须以二进制模式用一个ofstream(而不是wofsteam)打开文件,并像在原始帖子中一样编写输出。
提供的Utf16Facet
在gcc
中不适用于大字符串,这是适用于我的版本…这样文件将保存在UTF-16LE
中。对于UTF-16BE
,只需反转do_in
和do_out
中的赋值,例如to[0] = from[1]
和to[1] = from[0]
#include <locale>
#include <bits/codecvt.h>
class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
typedef MyType::state_type state_type;
typedef MyType::result result;
/* This function deals with converting data from the input stream into the internal stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_in(state_type &s,
const char *from,const char *from_end,const char* &from_next,
wchar_t *to, wchar_t *to_limit,wchar_t* &to_next) const
{
for(;from < from_end;from += 2,++to)
{
if(to<=to_limit){
(*to) = L' ';
reinterpret_cast<char*>(to)[0] = from[0];
reinterpret_cast<char*>(to)[1] = from[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
/* This function deals with converting data from the internal stream to a C/C++ file stream.*/
/*
* from, from_end: Points to the beginning and end of the input that we are converting 'from'.
* to, to_limit: Points to where we are writing the conversion 'to'
* from_next: When the function exits this should have been updated to point at the next location
* to read from. (ie the first unconverted input character)
* to_next: When the function exits this should have been updated to point at the next location
* to write to.
*
* status: This indicates the status of the conversion.
* possible values are:
* error: An error occurred the bad file bit will be set.
* ok: Everything went to plan
* partial: Not enough input data was supplied to complete any conversion.
* nonconv: no conversion was done.
*/
virtual result do_out(state_type &state,
const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
char *to, char *to_limit, char* &to_next) const
{
for(;(from < from_end);++from, to += 2)
{
if(to <= to_limit){
to[0] = reinterpret_cast<const char*>(from)[0];
to[1] = reinterpret_cast<const char*>(from)[1];
from_next = from;
to_next = to;
}
}
return((to != to_limit)?partial:ok);
}
};
您应该在WinHex等十六进制编辑器中查看输出文件,这样您就可以看到实际的位和字节,以验证输出实际上是UTF-16。把它贴在这里,让我们知道结果。这将告诉我们是该责怪Firefox还是你的C++程序。
但在我看来,你的C++程序是有效的,Firefox没有正确解释你的UTF-16。UTF-16为每个字符调用两个字节。但是Firefox打印的字符是它应该打印的两倍,所以它可能试图将字符串解释为UTF-8或ASCII,通常每个字符只有1个字节。
当你说"编码设置为UTF16的Firefox"是什么意思?我怀疑这项工作是否有效。
- .cpp和.h文件中的模板专用化声明
- 为什么两个不同的未命名名称空间可以共存于一个cpp文件中
- 文本文件中的单词链表
- CMake-按正确顺序将项目与C运行时对象文件链接
- 使用新行和不使用新行读取文件
- 在C++程序中输入的文本文件将不起作用,除非文本被复制和粘贴
- 挂起和取消挂起一个文件DLL
- 如何确定我已使用非编码文件到达 EOF?
- 命名空间中具有.h和.cpp文件的类
- 如何使用ndk-build.cmd构建Android.so文件
- 从包含m行的文件中提取n行,必要时(惰性地)重复该文件
- 读取文件并输入到矢量中
- 在C++中查找文件
- c++库的公共头文件中应该包含什么
- 用c++从输入文件中读取另一行
- Cppcheck生成xml转储文件
- 读取文件的最后一行并输入到链接列表时出错
- 无法编译 rtmidi 测试 cmidiin.cpp 文件, 非法指令
- 如何使用C++在 Win 上写入和读取 UTF16 文件
- 以二进制模式将utf16写入文件