Xerces-c和跨平台字符串文字

Xerces-c and cross-platform string literals

本文关键字：文字字符串跨平台 Xerces-c 更新时间：2023-10-16

我正在将一个使用Xerces-c进行XML处理的代码库从Windows/VC++移植到Linux/G++。

在Windows上，Xerces-c使用wchar_t作为字符类型XmlCh。这使得人们可以使用std::wstring和L""语法的字符串文字。

在Linux/G++上，wchar_t是32位的，Xerces-c使用unsigned short int（16位）作为字符类型XmlCh。

我已经开始沿着这个轨道：

#ifdef _MSC_VER
using u16char_t = wchar_t;
using u16string_t = std::wstring;
#elif defined __linux
using u16char_t = char16_t;
using u16string_t = std::u16string;
#endif

不幸的是，char16_t和unsigned short int是不等价的，并且它们的指针不能隐式转换。因此，将u"Hello, world."传递给Xerces函数仍然会导致无效的转换错误。

看起来我将不得不显式地强制转换传递给Xerces函数的每个字符串。但在我这么做之前，我想问一下是否有人知道一种更明智的方法来编程跨平台的Xerces-c代码。

答案是不，没有人知道如何做到这一点。对于其他发现这个问题的人来说，这就是我想到的：

#ifdef _MSC_VER
#define U16S(x) L##x
#define U16XS(x) L##x
#define XS(x) x
#define US(x) x
#elif defined __linux
#define U16S(x) u##x
#define U16XS(x) reinterpret_cast<const unsigned short *>(u##x)
inline unsigned short *XS(char16_t* x) {
    return reinterpret_cast<unsigned short *>(x);
}
inline const unsigned short *XS(const char16_t* x) {
    return reinterpret_cast<const unsigned short *>(x);
}
inline char16_t* US(unsigned short *x) {
    return reinterpret_cast<char16_t *>(x);
}
inline const char16_t* US(const unsigned short *x) {
    return reinterpret_cast<const char16_t*>(x);
}
#include "char16_t_facets.hpp"
#endif
namespace SafeStrings {
#if defined _MSC_VER
    using u16char_t = wchar_t;
    using u16string_t = std::wstring;
    using u16sstream_t = std::wstringstream;
    using u16ostream_t = std::wostream;
    using u16istream_t = std::wistream;
    using u16ofstream_t = std::wofstream;
    using u16ifstream_t = std::wifstream;
    using filename_t = std::wstring;
#elif defined __linux
    using u16char_t = char16_t;
    using u16string_t = std::basic_string<char16_t>;
    using u16sstream_t = std::basic_stringstream<char16_t>;
    using u16ostream_t = std::basic_ostream<char16_t>;
    using u16istream_t = std::basic_istream<char16_t>;
    using u16ofstream_t = std::basic_ofstream<char16_t>;
    using u16ifstream_t = std::basic_ifstream<char16_t>;
    using filename_t = std::string;
#endif

char16_t_facets.hpp具有模板专业化std::ctype<char16_t>、std::numpunct<char16_t>、std::codecvt<char16_t, char, std::mbstate_t>的定义。有必要将它们与std::num_get<char16_t>和std::num_put<char16_t>一起添加到全局语言环境中（但没有必要为它们提供专业化）。codecvt的代码是唯一困难的部分，在GCC 5.0库中可以找到一个合理的模板（如果您使用GCC 5，则不需要提供codecvt专业化，因为它已经在库中了）。

完成所有这些操作后，char16_t流将正常工作。

然后，每次定义宽字符串时，都要写入U16S("string")，而不是L"string"。每次将字符串传递给Xerces时，都要为文本编写XS（string.c_str（））或U16XS（"string"）。每次从Xerces获取字符串时，都将其转换回u16string_t(US(call_xerces_function()))。

请注意，也可以重新编译Xerces-C，将字符类型设置为char16_t。这消除了上面所需的大量精力但是您将无法在依赖Xerces-C的系统上使用任何其他库。链接到任何这样的库都会导致链接错误（因为更改字符类型会更改许多Xerces函数签名）。