如何在 python 包装中使用 unicode 字符串用于带有 cython 的 c++ 类?

How to use unicode strings in python wrapping for c++ class with cython?

本文关键字:用于 cython c++ 字符串 unicode python 包装      更新时间:2023-10-16

我目前正在做一个宠物项目。我现在的目标是用cython为c ++类编写一个python的包装器。问题是我必须使用俄语文本(unicode(,但是cython包装只需要字节,尽管有c ++类方法,能够正确处理Unicode字符串。我阅读了Cython文档并试图在Google中找到它,但一无所获。

如何更改我的代码,以便我的 python 包装器可以接受 unicode 字符串?

这是指向我的 github 存储库的链接,其中包含当前代码文件 https://github.com/rproskuryakov/lemmatizer/tree/trie

"Trie.pxd">

from libcpp.string cimport string
from libcpp cimport bool
cdef extern from "Trie.cpp":
pass
# Declare the class with cdef
cdef extern from "Trie.h": 
cdef cppclass Trie:
Trie() except +
void add_word(string word)  # function that should take unicode
bool find(string word)  # function that should take unicode

"pytrie.pyx">

from trie cimport Trie  # link to according .pxd file
# Create a Cython extension type which holds a C++ instance
# as an attribute and create a bunch of forwarding methods
# Python extension type.
cdef class PyTrie:
cdef Trie c_tree # Hold a C++ instance which we're wrapping
def __cinit__(self):
self.c_tree = Trie()
def add_word(self, word): 
return self.c_tree.add_word(word) 
def find(self, word): 
return self.c_tree.find(word)

这是我在python中得到的。

>>> tree.add_word(b'hello') # works if i got english into ascii
>>> tree.add_word(b'привет') # doesnt work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "wrapper/pytrie.pyx", line 13, in pytrie.PyTrie.add_word
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found

C++字符串在内部是一个char数组,因此实际上在"字节"级别而不是Unicode级别进行操作。因此,Cython 不会自动支持unicode/str<->std::string转换。但是,您有两个相当简单的选项:

  1. 使用unicode/str.encode函数获取 unicode 对象的字节表示形式:

    def add_word(self, word):
    if isinstance(word,str): # Python3 version - use unicode for Python 2
    word = word.encode()
    return self.c_tree.add_word(word) 
    

    你必须注意的主要事情是,C++用来解释它的编码与Python用来编码它的编码相同(Python默认使用utf8(。

  2. 转换为C++类型std::wstring- 内部一个wchar_t数组。不幸的是,Cython 默认情况下不会包装wstring或提供自动转换,因此您需要编写自己的包装器。使用std::string的 Cython 包装作为参考 - 无论如何,您可能只需要包装构造函数。我已经使用Python C API转换为wchar_t*

    from libc.stddef cimport wchar_t
    cdef extern from "<string>" namespace std:
    cdef cppclass wstring:
    wstring() except +
    wstring(size_t, wchar_t) except +
    const wchar_T* data()
    cdef extern from "Python.h":
    # again, not wrapped by cython a s adefault
    Py_ssize_t PyUnicode_AsWideChar(object o, wchar_t *w, Py_ssize_t size) except -1
    # conversion function
    cdef wstring to_wstring(s):
    # create 0-filled output
    cdef wstring out = wstring(len(s),0)
    PyUnicode_AsWideChar(s, <wchar_t*>out.data(),len(s)) # note cast to remove const 
    # I'm not convinced this is 100% acceptable according the standard but practically it should work
    return out
    

您更喜欢这些选项中的哪一个在很大程度上取决于您的C++接受的 unicode 字符串。