ANTLR4 词法C++11 原始字符串

ANTLR4 Lexing C++11 Raw String

本文关键字：字符串原始 C++11 词法 ANTLR4 更新时间：2023-10-16

All，

我一直在尝试从标准文档 N4567 创建C++语法，这是我能找到的最新语法。我相信语法是完整的，但我需要测试它。我一直试图解决的一个问题是让词法分析器从标准中识别原始字符串。我已经使用Actions & Semantic Predicates实现了一个可能的解决方案。我需要帮助确定它是否真的有效。我已经阅读了关于操作和谓词之间交互的ANTLR4参考，但可以决定我的解决方案是否有效。下面包括一个精简的语法。任何想法将不胜感激。我试图将我的想法包含在样本中。

grammar SampleRaw;
@lexer::members {
    string d_char_seq = "";
}
string_literal
        : ENCODING_PREFIX? '"' S_CHAR* '"'
        | ENCODING_PREFIX? 'R' Raw_String
        ;
ENCODING_PREFIX             //  one of
        : 'u8'
        | [uUL]
        ;
S_CHAR          /* any member of the source character set except the
                   double_quote ", backslash , or NEW_LINE character
                 */
        : ~["\nr]
        | ESCAPE_SEQUENCE
        | UNIV_CHAR_NAME
        ;
fragment ESCAPE_SEQUENCE
        : SIMPLE_ESCAPE_SEQ
        | OCT_ESCAPE_SEQ
        | HEX_ESCAPE_SEQ
        ;
fragment SIMPLE_ESCAPE_SEQ  // one of
        : '' '''
        | '' '"'
        | '' '?'
        | '' ''
        | '' 'a'
        | '' 'b'
        | '' 'f'
        | '' 'n'
        | '' 'r'
        | '' 't'
        | '' 'v'
        ;
fragment OCT_ESCAPE_SEQ
        : [0-3] ( OCT_DIGIT OCT_DIGIT? )?
        | [4-7] ( OCT_DIGIT )?
        ;
fragment HEX_ESCAPE_SEQ
        : '' 'x' HEX_DIGIT+
        ;
fragment UNIV_CHAR_NAME
        : '' 'u' HEX_QUAD
        | '' 'U' HEX_QUAD HEX_QUAD
        ;
fragment HEX_QUAD
        : HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
        ;
fragment HEX_DIGIT
        : [a-zA-Z0-9]
        ;
fragment OCT_DIGIT
        : [0-7]
        ;
/*
Raw_String
        : '"' D_CHAR* '(' R_CHAR* ')' D_CHAR* '"'
        ;
 */
Raw_String
        : ( /* CASE when D_CHAR is empty
               ACTION in D_CHAR_SEQ attempts to reset variable d_char_seq
               if it is empty, so handle it staticly
             */
            '"' 
                '('
                    ( ~[)]       // Anything but )
                    | [)] ~["]  // ) Actually OK, can't be followed by "
                                 //  - )" - these are the terminating chars
                    )* 
                ')' 
            '"'
          | '"'
                D_CHAR_SEQ  /* Will the ACTION in D_CHAR_SEQ be an issue for
                               the Semantic Predicates Below????
                             */
                    '('
                        ( ~[)]  // Anything but )
                        | [)] D_CHAR_SEQ { ( getText() !=  d_char_seq ) }?
                                /* ) Actually OK, can't be followed D_CHAR_SEQ match
                                   IF D_CHAR_SEQs match, turn OFF the Alternative
                                 */
                        | [)] D_CHAR_SEQ { ( getText() ==  d_char_seq ) }? ~["]
                                /* ) Actually OK, must be followed D_CHAR_SEQ match
                                     IF D_CHAR_SEQs match, turn ON the Alternative
                                     Cant't match the final " , but
                                     WE HAVE MATCHED OUR TERMINATING CHARS
                                 */
                        )*
                    ')'
                D_CHAR_SEQ /* No need to check here,
                              Matching Terminating CHARS is only way to get out 
                              of loop above
                            */
            '"'
          )
          { d_char_seq = ""; } // Reset Variable
        ;
/*
fragment R_CHAR
                // any member of the source character set, except a right
                // parenthesis ) followed by the initial D_CHAR*
                // (which may be empty) followed by a double quote ".
                // 
        : ~[)]
        ;
 */
fragment D_CHAR
                /* any member of the basic source character set except
                   space, the left parenthesis (, the right parenthesis ),
                   the backslash , and the control characters representing
                    horizontal tab, vertical tab, form feed, and newline.
                 */
        : ~[ )(\tvfnr]
        ;
fragment D_CHAR_SEQ
        : D_CHAR+ { d_char_seq = ( d_char_seq == "" ) ? getText() : d_char_seq ; }
        ;

我设法自己破解了这个问题，任何评论或可能的改进将不胜感激。如果这可以在没有行动的情况下完成，那也很高兴知道。

一个缺点是\"和D_CHAR_SEQ是传递给解析器Raw_String文本的一部分。解析器可以将它们剥离出来，但是，如果词法分析器这样做会很好。

grammar SampleRaw;
Reg_String
    : '"' S_CHAR* '"'
    ;
fragment S_CHAR
        /* any member of the source character set except the
           double_quote ", backslash , or NEW_LINE character
         */
    : ~[nr"\]
    | ESCAPE_SEQUENCE
    | UNIV_CHAR_NAME
    ;
fragment ESCAPE_SEQUENCE
    : SIMPLE_ESCAPE_SEQ
    | OCT_ESCAPE_SEQ
    | HEX_ESCAPE_SEQ
    ;
fragment SIMPLE_ESCAPE_SEQ  // one of
    : '' '''
    | '' '"'
    | '' '?'
    | '' ''
    | '' 'a'
    | '' 'b'
    | '' 'f'
    | '' 'n'
    | '' 'r'
    | '' 't'
    | '' 'v'
    ;
fragment OCT_ESCAPE_SEQ
    : [0-3] ( OCT_DIGIT OCT_DIGIT? )?
    | [4-7] ( OCT_DIGIT )?
    ;
fragment OCT_DIGIT
    : [0-7]
    ;
fragment HEX_ESCAPE_SEQ
    : '' 'x' HEX_DIGIT+
    ;
fragment HEX_DIGIT
    : [a-zA-Z0-9]
    ;
fragment UNIV_CHAR_NAME
    : '' 'u' HEX_QUAD
    | '' 'U' HEX_QUAD HEX_QUAD
    ;
fragment HEX_QUAD
    : HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;
Raw_String
    : 'R'
      '"'              // Match Opening Double Quote
      ( /* Handle Empty D_CHAR_SEQ without Predicates
           This should also work
           '(' .*? ')'
         */
        '(' ( ~')' | ')'+ ~'"' )* (')'+)
      | D_CHAR_SEQ
            /*  // Limit D_CHAR_SEQ to 16 characters
               { ( ( getText().length() - ( getText().indexOf(""") + 1 ) ) <= 16 ) }?
            */
        '('
        /* From Spec :
           Any member of the source character set, except
           a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
           ( which may be empty ) followed by a double quote ".
         - The following loop consumes characters until it matches the
           terminating sequence of characters for the RAW STRING
         - The options are mutually exclusive, so Only one will
           ever execute in each loop pass
         - Each Option will execute at least once.  The first option needs to
           match the ')' character even if the D_CHAR_SEQ is empty. The second
           option needs to match the closing " to fall out of the loop. Each
           option will only consume at most 1 character
         */
        (   //  Consume everthing but the Double Quote
          ~'"'
        |   //  If text Does Not End with closing Delimiter, consume the Double Quote
          '"'
          {
               !getText().endsWith(
                    ")"
                  + getText().substring( getText().indexOf( """ ) + 1
                                       , getText().indexOf( "(" )
                                       )
                  + '"'
                )
          }?
        )*
      )
      '"'              // Match Closing Double Quote
      /*
      // Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
      //  Send D_CHAR_SEQ <TAB> ... to Parser
      {
        setText( getText().substring( getText().indexOf(""") + 1
                                    , getText().indexOf("(")
                                    )
               + "t"
               + getText().substring( getText().indexOf("(") + 1
                                    , getText().lastIndexOf(")")
                                    )
               );
      }
       */
    ;
fragment D_CHAR_SEQ     // Should be limited to 16 characters
    : D_CHAR+
    ;
fragment D_CHAR
        /*  Any member of the basic source character set except
            space, the left parenthesis (, the right parenthesis ),
            the backslash , and the control characters representing
            horizontal tab, vertical tab, form feed, and newline.
         */
    : 'u0021'..'u0023'
    | 'u0025'..'u0027'
    | 'u002a'..'u003f'
    | 'u0041'..'u005b'
    | 'u005d'..'u005f'
    | 'u0061'..'u007e'
    ;
ENCODING_PREFIX         //  one of
    : 'u8'
    | [uUL]
    ;
WhiteSpace
    : [ u0000-u0020u007f]+ -> skip
    ;
start
    : string_literal* EOF
    ;
string_literal
    : ENCODING_PREFIX? Reg_String
    | ENCODING_PREFIX? Raw_String
    ;