如何在C++中有效地提取字符串模式

How to extract the string pattern in C++ efficiently?

本文关键字：提取字符串模式有效地 C++ 更新时间：2023-10-16

我有一个格式如下的模式：

AUTHOR, "TITLE" (PAGES pp.) [CODE STATUS]

例如，我有一个字符串

P.G. Wodehouse, "Heavy Weather" (336 pp.) [PH.409 AVAILABLE FOR LENDING]

我想提取

AUTHOR = P.G. Wodehouse
TITLE = Heavy Weather
PAGES = 336
CODE = PH.409
STATUS = AVAILABLE FOR LENDING

我只知道如何在Python中做到这一点，然而，在C++中有什么有效的方法可以做到同样的事情吗？

与Python中的方法完全相同。C++11有正则表达式（对于早期的C++，有Boost正则表达式。）至于读取循环：

std::string line;
while ( std::getline( file, line ) ) {
    //  ...
}

几乎与完全相同

for line in file:
    #    ...

唯一的区别是：

C++版本不会将尾随的'n'放入缓冲区中。（一般来说，C++版本在线路末端处理方面可能不太灵活。）
如果出现读取错误，C++版本将中断循环；Python版本将引发一个异常。

在你的情况下，两者都不应该成为问题。

编辑：

我突然想到，虽然C++和Python中的正则表达式非常相似，但使用它们的语法并不完全相同。因此：

在C++中，您通常会在使用正则表达式之前声明它的一个实例；像Python的re.match( r'...', line )这样的东西在理论上是可能的，但不是很习惯（它仍然需要在表达式中显式地构造一个正则表达式对象）。此外，match函数只是返回一个布尔值；如果您想要捕获，您需要为它们定义一个单独的对象。典型的用途可能是：

static std::regex const matcher( "the regular expression" );
std::smatch forCaptures;
if ( std::regex_match( line, forCaptures, matcher ) ) {
    std::string firstCapture = forCaptures[1];
    //  ...
}

这对应于Python:

m = re.match( 'the regular expression', line )
if m:
    firstCapture = m.group(1)
    #   ...

编辑：

另一个答案建议operator>>过载；我完全同意。只是出于好奇，我试了一下；下面这样的方法效果很好：

struct Book
{
    std::string author;
    std::string title;
    int         pages;
    std::string code;
    std::string status;
};
std::istream&
operator>>( std::istream& source, Book& dest )
{
    std::string line;
    std::getline( source, line );
    if ( source )
    {
        static std::regex const matcher(
            R"^(([^,]*),s*"([^"]*)"s*((d+) pp.)s*[(S+)s*([^]]*)])^"
            ); 
        std::smatch capture;
        if ( ! std::regex_match( line, capture, matcher ) ) {
            source.setstate( std::ios_base::failbit );
        } else {
            dest.author = capture[1];
            dest.title  = capture[2];
            dest.pages  = std::stoi( capture[3] );
            dest.code   = capture[4];
            dest.status = capture[5];
        }
    }
    return source;
}

一旦你做到了这一点，你就可以写这样的东西：

std::vector<Book> v( (std::istream_iterator<Book>( inputFile )),
                     (std::istream_iterator<Book>()) );

并在初始化一个向量时加载整个文件。

注意operator>>中的错误处理。如果一行格式错误，我们设置failbit；这是C++中的标准约定。

编辑：

既然有这么多讨论：以上内容适用于小型一次性程序，比如学校项目，或者一次性程序，这些程序会读取当前文件，以新格式输出，然后被丢弃。在生产代码中，我会坚持支持注释和空行；在出现错误的情况下继续，以便报告多个错误（使用行号），并可能报告连续行（因为标题可能会变得足够长，以至于难以处理）。如果除了需要输出行号之外没有其他原因，那么用operator>>这样做是不实际的，所以我会沿着以下行使用解析器：

int
getContinuationLines( std::istream& source, std::string& line )
{
    int results = 0;
    while ( source.peek() == '&' ) {
        std::string more;
        std::getline( source, more );   //  Cannot fail, because of peek
        more[0] = ' ';
        line += more;
        ++ results;
    }
    return results;
}
void
trimComment( std::string& line )
{
    char quoted = '';
    std::string::iterator position = line.begin();
    while ( position != line.end() && (quoted != '' || *position == '#') ) {
        if ( *position == '' && std::next( position ) != line.end() ) {
            ++ position;
        } else if ( *position == quoted ) {
            quoted = '';
        } else if ( *position == '"' || *position == ''' ) {
            quoted = *position;
        }
        ++ position;
    }
    line.erase( position, line.end() );
}
bool
isEmpty( std::string const& line )
{
    return std::all_of(
        line.begin(),
        line.end(),
        []( unsigned char ch ) { return isspace( ch ); } );
}
std::vector<Book>
parseFile( std::istream& source )
{
    std::vector<Book> results;
    int lineNumber = 0;
    std::string line;
    bool errorSeen = false;
    while ( std::getline( source, line ) ) {
        ++ lineNumber;
        int extraLines = getContinuationLines( source, line );
        trimComment( line );
        if ( ! isEmpty( line ) ) {
            static std::regex const matcher(
                R"^(([^,]*),s*"([^"]*)"s*((d+) pp.)s*[(S+)s*([^]]*)])^"
                ); 
            std::smatch capture;
            if ( ! std::regex_match( line, capture, matcher ) ) {
                std::cerr << "Format error, line " << lineNumber << std::endl;
                errorSeen = true;
            } else {
                results.emplace_back(
                    capture[1],
                    capture[2],
                    std::stoi( capture[3] ),
                    capture[4],
                    capture[5] );
            }
        }
        lineNumber += extraLines;
    }
    if ( errorSeen ) {
        results.clear();    //  Or more likely, throw some sort of exception.
    }
    return results;
}

这里真正的问题是如何向调用者报告错误；我怀疑在大多数情况下，例外是合适的，但根据用例的不同，其他替代方案也可能有效。在这个例子中，我只是返回一个空向量。（注释和续行之间的交互可能也需要更好地定义，并根据其定义方式进行修改。）

您的输入字符串分隔良好，因此我建议在regex上使用提取运算符，以提高速度并便于使用。

你首先需要为你的书创建一个struct：

struct book{
    string author;
    string title;
    int pages;
    string code;
    string status;
};

然后你需要编写实际的提取运算符：

istream& operator>>(istream& lhs, book& rhs){
    lhs >> ws;
    getline(lhs, rhs.author, ',');
    lhs.ignore(numeric_limits<streamsize>::max(), '"');
    getline(lhs, rhs.title, '"');
    lhs.ignore(numeric_limits<streamsize>::max(), '(');
    lhs >> rhs.pages;
    lhs.ignore(numeric_limits<streamsize>::max(), '[');
    lhs >> rhs.code >> ws;
    getline(lhs, rhs.status, ']');
    return lhs;
}

这会给你巨大的力量。例如，您可以将istream中的所有书籍提取到vector中，如下所示：

istringstream foo("P.G. Wodehouse, "Heavy Weather" (336 pp.) [PH.409 AVAILABLE FOR LENDING]nJohn Bunyan, "The Pilgrim's Progress" (336 pp.) [E.1173 CHECKED OUT]");
vector<book> bar{ istream_iterator<book>(foo), istream_iterator<book>() };

使用flex（它生成C或C++代码，用作部分或完整程序）

%%
^[^,]+/,          {printf("Autor: %sn",yytext  );}
"[^"]+"         {printf("Title: %sn",yytext  );}
([^ ]+/[ ]pp.   {printf("Pages: %sn",yytext+1);}
..................
.|n              {}
%%

（未经测试）

以下是代码：

#include <iostream>
#include <cstring>
using namespace std;
string extract (string a)
{
    string str = "AUTHOR = "; //the result string
    int i = 0;
    while (a[i] != ',')
        str += a[i++];
    while (a[i++] != '"');
    str += "nTITLE = ";
    while (a[i] != '"')
        str += a[i++];
    while (a[i++] != '(');
    str += "nPAGES = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] != '[');
    str += "nCODE = ";
    while (a[i] != ' ')
        str += a[i++];
    while (a[i++] == ' ');
    str += "nSTATUS = ";
    while (a[i] != ']')
        str += a[i++];
    return str;
}
int main ()
{
    string a;
    getline (cin, a);
    cout << extract (a) << endl;
    return 0;
}

快乐编码：）