将转义的 UTF-8 八位字节的字符数组转换为 C++ 的字符串
Convert a char array of escaped UTF-8 octets to a string in C++
我有一个字符数组,其中包含一些 UTF-8 编码的土耳其字符 - 以转义八位字节的形式。因此,如果我在 C++11 中运行此代码:
void foo(char* utf8_encoded) {
cout << utf8_encoded << endl;
}
它打印xc4xb0-xc3x87-xc3x9c-xc4x9e
.我想将此char[]
转换为std::string
,以便它包含 UTF-8 解码值İ-Ç-Ü-Ğ
。我已将该char[]
转换为wstring
但它仍然打印为xc4xb0-xc3x87-xc3x9c-xc4x9e
.我该怎么做?
编辑:我不是构建这个字符[]的人。它是私有库调用的回调函数的静态长度参数之一。所以回调函数如下:
void some_callback_function (INFO *info) {
cout << info->some_char_array << endl;
cout << "*****" << endl;
for(int i=0; i<64; i++) {
cout << "-" << info->some_char_array[i];
}
cout << "*****" << endl;
char bar[65] = "xc4xb0-xc3x87-xc3x9c-xc4x9e";
cout << bar << endl;
}
其中INFO
结构为:
typedef struct {
char some_char_array[65];
} INFO;
所以当我的回调函数被调用时,输出如下:
xc4xb0-xc3x87-xc3x9c-xc4x9e
*****
--x-c-4--x-b-0----x-c-3--x-8-7----x-c-3--x-9-c----x-c-4--x-9-e-----------------------------
*****
İ-Ç-Ü-Ğ
所以我目前的问题是,我没有理解 info->some_char_array
和 bar
字符数组之间的区别。我想要的是编辑info->some_char_array
这样,它将输出打印为 İ-Ç-Ü-Ğ
.
好吧,这是从我正在使用的较大解析器中撕下来的。但"有点少"是Boost.Spirit的本质。;-)
解析器不仅会解析十六进制转义,还会解析八进制(123
(和"标准"转义(n
(。在 CC0 下提供,因此您可以随心所欲地使用它。;-)
Boost.Spirit是Boost的"仅标头"部分,因此您无需链接任何库代码。不过,Spirit 标头为允许以这种方式在源代码中表达语法而进行的相当复杂的"魔法"C++在编译时有点困难。
但它有效,而且效果很好。
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include "boost/spirit/include/qi.hpp"
#include "boost/spirit/include/phoenix.hpp"
#include <string>
#include <cstring>
#include <sstream>
#include <stdexcept>
namespace
{
// Helper function: Turn on_error positional parameters into error message.
template< typename Iterator >
std::string make_error_message( boost::spirit::info const & info, Iterator first, Iterator last )
{
std::ostringstream oss;
oss << "Invalid sequence. Expecting " << info << " here: "" << std::string( first, last ) << """;
return oss.str();
}
}
// Wrap helper function with Boost.Phoenix boilerplate, so the function
// can be called from within a parser's [].
BOOST_PHOENIX_ADAPT_FUNCTION( std::string, make_error_message_, make_error_message, 3 )
// Supports various escape sequences:
// - Character escapes ( a b f n r t v " \ )
// - Octal escapes ( n nn nnn )
// - Hexadecimal escapes ( xnn ) (*)
//
// (*): In C/C++, a hexadecimal escape runs until the first non-hexdigit
// is encountered, which is not very helpful. This one takes exactly
// two hexdigits.
// Declaring a grammer that works given any kind of iterator,
// and results in a std::string object.
template < typename Iterator >
class EscapedString : public boost::spirit::qi::grammar< Iterator, std::string() >
{
public:
// Constructor
EscapedString() : EscapedString::base_type( escaped_string )
{
// An escaped string is a sequence of
// characters that are not '', or
// an escape sequence
escaped_string = *( +( boost::spirit::ascii::char_ - '' ) | escapes );
// An escape sequence begins with '', followed by
// an escaped character (e.g. "n"), or
// an 'x' and 2..2 hexadecimal digits, or
// 1..3 octal digits.
escapes = '' > ( escaped_character
| ( "x" > boost::spirit::qi::uint_parser< char, 16, 2, 2 >() )
| boost::spirit::qi::uint_parser< char, 8, 1, 3 >() );
// The list of special "escape" characters
escaped_character.add
( "a", 0x07 ) // alert
( "b", 0x08 ) // backspace
( "f", 0x0c ) // form feed
( "n", 0x0a ) // new line
( "r", 0x0d ) // carriage return
( "t", 0x09 ) // horizontal tab
( "v", 0x0b ) // vertical tab
( """, 0x22 ) // literal quotation mark
( "\", 0x5c ) // literal backslash
;
// Error handling
boost::spirit::qi::on_error< boost::spirit::qi::fail >
(
escapes,
// backslash not followed by a valid sequence
boost::phoenix::throw_(
boost::phoenix::construct< std::runtime_error >( make_error_message_( boost::spirit::_4, boost::spirit::_3, boost::spirit::_2 ) )
)
);
}
private:
// Qi Rule member
boost::spirit::qi::rule< Iterator, std::string() > escaped_string;
// Helpers
boost::spirit::qi::rule< Iterator, std::string() > escapes;
boost::spirit::qi::symbols< char const, char > escaped_character;
};
int main()
{
// Need to escape the backslashes, or "xc4" would give *one*
// byte of output (0xc4, decimal 196). I understood the input
// to be the FOUR character hex char literal,
// backslash, x, c, 4 in this case,
// which is what this string literal does.
char * some_char_array = "\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e";
std::cout << "Input: '" << some_char_array << "'n";
// result object
std::string s;
// Create an instance of the grammar with "char *"
// as the iterator type.
EscapedString< char * > es;
// start, end, parsing grammar, result object
boost::spirit::qi::parse( some_char_array,
some_char_array + std::strlen( some_char_array ),
es,
s );
std::cout << "Output: '" << s << "'n";
return 0;
}
这给出了:
Input: 'xc4xb0-xc3x87-xc3x9c-xc4x9e'
Output: 'İ-Ç-Ü-Ğ'
相关文章:
- 防止主数据类型C++的隐式转换
- 模板参数替换失败,并且未完成隐式转换
- 努力将整数转换为链表。不知道我在这里做错了什么
- HEX值到wchar_t字符(UTF-8)的转换
- lambda参数转换为constexpr技巧,然后获取带链接的数组
- 将 Qvector<uint8_t> 转换为 QString
- 如何在cuSparse中使用cusparseXcoo2csr从coo转换为csc
- 有关插入适配器的错误。[错误]请求从 'back_insert_iterator<vector<>>' 类型转换为非标量类型
- 在c++中使用nlohmann从类到json的转换
- 从"int*"强制转换为"unsigned int"会丢失精度错误
- 将Integer转换为4字节的unsined字符矢量(按大端字节顺序)
- 处理小于cpu数据总线的数据类型.(c++转换为机器代码)
- 如何使用OpenCV将RBG图像转换为HSV,并将H、S和V值保存为C++中的3个独立图像
- 复制列表初始化的隐式转换的等级是多少
- 正在将指针转换为范围
- 如何防止 c++ 在从浮点型转换为双精度型(不适用于 IO)时添加额外的小数?
- 将"打开的CV图像"中的"颜色"转换为整数格式
- 是否可以从int转换为enum类类型
- 了解 GLM- openGL 中的相机转换
- 将无符号char*转换为std::istream*C++