boost::regex与不区分大小写的UTF-8匹配(例如大写字母与小写字母)

boost::regex with case-insensitive match with UTF-8 (e.g. uppercase versus lowercase umlauts)

本文关键字:大写字母 小写字 UTF-8 不区 regex 大小写 boost 匹配      更新时间:2023-10-16

在构建支持Unicode国际组件(ICU)的boost::regex 1.52版本库后,具有不区分大小写匹配的正则表达式似乎不能像预期的那样处理大写和小写的德国变音符。

static const std::string pattern("^.*" "303226" ".*$");
static const std::string   test1("SCH" "303226" "NE");
static const std::string   test2("sch" "303266" "ne");
static const boost::regex exp(pattern, boost::regex::icase);
const char *result = (boost::regex_match(test1, exp)) ? "Match" : "NoMatch";
std::cout << "Testing "" << test1 << "" against pattern "" << pattern 
    << "" : " << result << std::endl;
result = (boost::regex_match(test2, exp)) ? "Match" : "NoMatch";
std::cout << "Testing "" << test2 << "" against pattern "" << pattern 
    << "" : " << result << std::endl;

收益率:

Testing "SCHÖNE" against pattern "^.*Ö.*$" : Match
Testing "schöne" against pattern "^.*Ö.*$" : NoMatch

使用Unicode和ICU字符串类型。

LWS示例。

#include <iostream>
#include <boost/regex.hpp>
#include <boost/regex/icu.hpp>
int main()
{
   static const std::string pattern("^.*" "303226" ".*$");
   static const std::string   test1("SCH" "303226" "NE");
   static const std::string   test2("sch" "303266" "ne");
   static const boost::u32regex exp=boost::make_u32regex(pattern, boost::regex::icase);
   const char *result = (boost::u32regex_match(test1, exp)) ? "Match" : "NoMatch";
   std::cout << "Testing "" << test1 << "" against pattern "" << pattern 
      << "" : " << result << std::endl;
   result = (boost::u32regex_match(test2, exp)) ? "Match" : "NoMatch";
   std::cout << "Testing "" << test2 << "" against pattern "" << pattern 
      << "" : " << result << std::endl;
}