Unicode const char* to JString using JNI and C++

本文关键字：using JNI and C++ JString to const char Unicode 更新时间：2023-10-16

简单的问题。如何使用 JNI 和 C++ 从 unicode const char* 中获取 jstring ？

这是我的问题，也是我已经尝试过的：

const char* value = (some value from server);
(*env)->NewStringUTF(value);

这里的问题是NewStringUTF返回一个UTF字符串，它不喜欢一些非UTF8字符（有点明显，但值得简单尝试）。

尝试 2，使用 NewString：

const char* value = (some value from server);
(*env)->NewString(value, strlen(value));

虽然 NewString 接受并返回一个 unicode 字符串，但 strlen（value）方法不起作用，因为它需要一个 jsize 参数，而不仅仅是一个好的 ol' size_t或长度。

我们如何获得 jsize？根据（非常非常少量的）文档和在线示例，您可以从jIntArray中获取jsize。我找不到有关如何将常量字符*转换为某种jarray的信息，无论如何这可能是一个坏主意。

另一种选择是将 jsize 从 int 中取出 size_t，我还没有成功。

有

没有人遇到过这个问题，或者对如何解决这个问题有建议？似乎 jsize 是我在 unicode 转换中缺少的键。另外，我正在使用JNI和Android NDK，以防它对任何人有帮助。

谢谢。

编辑我刚刚意识到NewString也期待一个jchar*，所以它的签名是（jchar*，jsize）。这意味着即使使用 jsize ，const char* 也不会编译。

编辑 2下面是使用 NewStringUTF 方法时在运行时引发的异常。这与@fadden所说的有关：

JNI WARNING: NewStringUTF input is not valid Modified UTF-8: illegal start byte 0xb7 03string: ' : Method(d6, us-dev1-api, 0), , 訩�x�m�P)

如错误消息所示，您的 char* 不是有效的 Modifed-utf8，因此 JVM 中止了。

您有两种方法可以避免它们。

检查字符* 内容以避免崩溃。

安卓 ART check_jni.cc 中的检查逻辑如下https://android.googlesource.com/platform/art/+/35e827a/runtime/check_jni.cc#1273

jstring toJString(JNIEnv* env, const char* bytes) {
    const char* error = nullptr;
    auto utf8 = CheckUtfBytes(bytes, &error);
    if (error) {
        std::ostringstream msg;
        msg << error << " 0x" << std::hex << static_cast<int>(utf8);
        throw std::system_error(-1, std::generic_category(), msg.str());
    } else {
        return env->NewStringUTF(bytes);
    }

这样，您始终会得到一个有效的jstring。

使用字符串构造函数从jbyteArray生成。

jstring toJString(JNIEnv *env, const char *pat) {
    int len = strlen(pat);
    jbyteArray bytes = env->NewByteArray(len);
    env->SetByteArrayRegion(bytes, 0, len, (jbyte *) pat);
    jstring encoding = env->NewStringUTF("utf-8");
    jstring jstr = (jstring) env->NewObject(java_lang_String_class,
            java_lang_String_init, bytes, encoding);
    env->DeleteLocalRef(encoding);
    env->DeleteLocalRef(bytes);
    return jstr;
}

这样，您

就可以避免崩溃，但字符串可能仍然无效，并且您复制内存两次，这表现很糟糕。

加上代码：

inline bool checkUtfBytes(const char* bytes) {
  while (*bytes != '') {
    const uint8_t* utf8 = reinterpret_cast<const uint8_t*>(bytes++);
    // Switch on the high four bits.
    switch (*utf8 >> 4) {
      case 0x00:
      case 0x01:
      case 0x02:
      case 0x03:
      case 0x04:
      case 0x05:
      case 0x06:
      case 0x07:
        // Bit pattern 0xxx. No need for any extra bytes.
        break;
      case 0x08:
      case 0x09:
      case 0x0a:
      case 0x0b:
        // Bit patterns 10xx, which are illegal start bytes.
        return false;
      case 0x0f:
        // Bit pattern 1111, which might be the start of a 4 byte sequence.
        if ((*utf8 & 0x08) == 0) {
          // Bit pattern 1111 0xxx, which is the start of a 4 byte sequence.
          // We consume one continuation byte here, and fall through to consume two more.
          utf8 = reinterpret_cast<const uint8_t*>(bytes++);
          if ((*utf8 & 0xc0) != 0x80) {
            return false;
          }
        } else {
          return false;
        }
        // Fall through to the cases below to consume two more continuation bytes.
      case 0x0e:
        // Bit pattern 1110, so there are two additional bytes.
        utf8 = reinterpret_cast<const uint8_t*>(bytes++);
        if ((*utf8 & 0xc0) != 0x80) {
          return false;
        }
        // Fall through to consume one more continuation byte.
      case 0x0c:
      case 0x0d:
        // Bit pattern 110x, so there is one additional byte.
        utf8 = reinterpret_cast<const uint8_t*>(bytes++);
        if ((*utf8 & 0xc0) != 0x80) {
          return false;
        }
        break;
    }
  }
  return true;
}