ANTLR4 词法C++11 原始字符串
ANTLR4 Lexing C++11 Raw String
All,
我一直在尝试从标准文档 N4567 创建C++语法,这是我能找到的最新语法。 我相信语法是完整的,但我需要测试它。 我一直试图解决的一个问题是让词法分析器从标准中识别原始字符串。 我已经使用Actions & Semantic Predicates实现了一个可能的解决方案。 我需要帮助确定它是否真的有效。我已经阅读了关于操作和谓词之间交互的ANTLR4参考,但可以决定我的解决方案是否有效。 下面包括一个精简的语法。 任何想法将不胜感激。 我试图将我的想法包含在样本中。
grammar SampleRaw;
@lexer::members {
string d_char_seq = "";
}
string_literal
: ENCODING_PREFIX? '"' S_CHAR* '"'
| ENCODING_PREFIX? 'R' Raw_String
;
ENCODING_PREFIX // one of
: 'u8'
| [uUL]
;
S_CHAR /* any member of the source character set except the
double_quote ", backslash , or NEW_LINE character
*/
: ~["\nr]
| ESCAPE_SEQUENCE
| UNIV_CHAR_NAME
;
fragment ESCAPE_SEQUENCE
: SIMPLE_ESCAPE_SEQ
| OCT_ESCAPE_SEQ
| HEX_ESCAPE_SEQ
;
fragment SIMPLE_ESCAPE_SEQ // one of
: '' '''
| '' '"'
| '' '?'
| '' ''
| '' 'a'
| '' 'b'
| '' 'f'
| '' 'n'
| '' 'r'
| '' 't'
| '' 'v'
;
fragment OCT_ESCAPE_SEQ
: [0-3] ( OCT_DIGIT OCT_DIGIT? )?
| [4-7] ( OCT_DIGIT )?
;
fragment HEX_ESCAPE_SEQ
: '' 'x' HEX_DIGIT+
;
fragment UNIV_CHAR_NAME
: '' 'u' HEX_QUAD
| '' 'U' HEX_QUAD HEX_QUAD
;
fragment HEX_QUAD
: HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment HEX_DIGIT
: [a-zA-Z0-9]
;
fragment OCT_DIGIT
: [0-7]
;
/*
Raw_String
: '"' D_CHAR* '(' R_CHAR* ')' D_CHAR* '"'
;
*/
Raw_String
: ( /* CASE when D_CHAR is empty
ACTION in D_CHAR_SEQ attempts to reset variable d_char_seq
if it is empty, so handle it staticly
*/
'"'
'('
( ~[)] // Anything but )
| [)] ~["] // ) Actually OK, can't be followed by "
// - )" - these are the terminating chars
)*
')'
'"'
| '"'
D_CHAR_SEQ /* Will the ACTION in D_CHAR_SEQ be an issue for
the Semantic Predicates Below????
*/
'('
( ~[)] // Anything but )
| [)] D_CHAR_SEQ { ( getText() != d_char_seq ) }?
/* ) Actually OK, can't be followed D_CHAR_SEQ match
IF D_CHAR_SEQs match, turn OFF the Alternative
*/
| [)] D_CHAR_SEQ { ( getText() == d_char_seq ) }? ~["]
/* ) Actually OK, must be followed D_CHAR_SEQ match
IF D_CHAR_SEQs match, turn ON the Alternative
Cant't match the final " , but
WE HAVE MATCHED OUR TERMINATING CHARS
*/
)*
')'
D_CHAR_SEQ /* No need to check here,
Matching Terminating CHARS is only way to get out
of loop above
*/
'"'
)
{ d_char_seq = ""; } // Reset Variable
;
/*
fragment R_CHAR
// any member of the source character set, except a right
// parenthesis ) followed by the initial D_CHAR*
// (which may be empty) followed by a double quote ".
//
: ~[)]
;
*/
fragment D_CHAR
/* any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash , and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: ~[ )(\tvfnr]
;
fragment D_CHAR_SEQ
: D_CHAR+ { d_char_seq = ( d_char_seq == "" ) ? getText() : d_char_seq ; }
;
我设法自己破解了这个问题,任何评论或可能的改进将不胜感激。 如果这可以在没有行动的情况下完成,那也很高兴知道。
一个缺点是\"和D_CHAR_SEQ是传递给解析器Raw_String文本的一部分。 解析器可以将它们剥离出来,但是,如果词法分析器这样做会很好。
grammar SampleRaw;
Reg_String
: '"' S_CHAR* '"'
;
fragment S_CHAR
/* any member of the source character set except the
double_quote ", backslash , or NEW_LINE character
*/
: ~[nr"\]
| ESCAPE_SEQUENCE
| UNIV_CHAR_NAME
;
fragment ESCAPE_SEQUENCE
: SIMPLE_ESCAPE_SEQ
| OCT_ESCAPE_SEQ
| HEX_ESCAPE_SEQ
;
fragment SIMPLE_ESCAPE_SEQ // one of
: '' '''
| '' '"'
| '' '?'
| '' ''
| '' 'a'
| '' 'b'
| '' 'f'
| '' 'n'
| '' 'r'
| '' 't'
| '' 'v'
;
fragment OCT_ESCAPE_SEQ
: [0-3] ( OCT_DIGIT OCT_DIGIT? )?
| [4-7] ( OCT_DIGIT )?
;
fragment OCT_DIGIT
: [0-7]
;
fragment HEX_ESCAPE_SEQ
: '' 'x' HEX_DIGIT+
;
fragment HEX_DIGIT
: [a-zA-Z0-9]
;
fragment UNIV_CHAR_NAME
: '' 'u' HEX_QUAD
| '' 'U' HEX_QUAD HEX_QUAD
;
fragment HEX_QUAD
: HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
Raw_String
: 'R'
'"' // Match Opening Double Quote
( /* Handle Empty D_CHAR_SEQ without Predicates
This should also work
'(' .*? ')'
*/
'(' ( ~')' | ')'+ ~'"' )* (')'+)
| D_CHAR_SEQ
/* // Limit D_CHAR_SEQ to 16 characters
{ ( ( getText().length() - ( getText().indexOf(""") + 1 ) ) <= 16 ) }?
*/
'('
/* From Spec :
Any member of the source character set, except
a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
( which may be empty ) followed by a double quote ".
- The following loop consumes characters until it matches the
terminating sequence of characters for the RAW STRING
- The options are mutually exclusive, so Only one will
ever execute in each loop pass
- Each Option will execute at least once. The first option needs to
match the ')' character even if the D_CHAR_SEQ is empty. The second
option needs to match the closing " to fall out of the loop. Each
option will only consume at most 1 character
*/
( // Consume everthing but the Double Quote
~'"'
| // If text Does Not End with closing Delimiter, consume the Double Quote
'"'
{
!getText().endsWith(
")"
+ getText().substring( getText().indexOf( """ ) + 1
, getText().indexOf( "(" )
)
+ '"'
)
}?
)*
)
'"' // Match Closing Double Quote
/*
// Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
// Send D_CHAR_SEQ <TAB> ... to Parser
{
setText( getText().substring( getText().indexOf(""") + 1
, getText().indexOf("(")
)
+ "t"
+ getText().substring( getText().indexOf("(") + 1
, getText().lastIndexOf(")")
)
);
}
*/
;
fragment D_CHAR_SEQ // Should be limited to 16 characters
: D_CHAR+
;
fragment D_CHAR
/* Any member of the basic source character set except
space, the left parenthesis (, the right parenthesis ),
the backslash , and the control characters representing
horizontal tab, vertical tab, form feed, and newline.
*/
: 'u0021'..'u0023'
| 'u0025'..'u0027'
| 'u002a'..'u003f'
| 'u0041'..'u005b'
| 'u005d'..'u005f'
| 'u0061'..'u007e'
;
ENCODING_PREFIX // one of
: 'u8'
| [uUL]
;
WhiteSpace
: [ u0000-u0020u007f]+ -> skip
;
start
: string_literal* EOF
;
string_literal
: ENCODING_PREFIX? Reg_String
| ENCODING_PREFIX? Raw_String
;
相关文章:
- 有没有办法从非C/C++文件中读取C++原始字符串文字的内容
- 如何在连接器 C++ 中将原始字节转换为字符串
- 将以 null 结尾的字节字符串转换为原始字符串文本
- 是否可以在原始字符串文本中插入转义序列?
- 如何找到修改后的字符串与原始字符串相等的时间
- 如何在c++中打印原始std::string/c样式的字符串
- 是否存在像C++中那样带有分隔符的C#原始字符串
- 反转由空格分隔的字符串元素将返回比原始字符串更大的字符串
- C++ 原始字符串 Unicode 文字
- 编码大于原始文本:如何获取零和一的字符串并将它们作为实际字节写入文件
- 在对原始字符串进行一些处理后返回(或转换)原始字符串
- 打印可以重复以获取原始字符串的最小字符串
- 如何从字符串变量为正则表达式构建原始字符串
- 如何制作包含原始字符串终止符C++原始字符串
- 如何在 C++ VS 中使用原始字符串文字(编码问题)
- 如何将变量用作原始 JSON 字符串中的数据?
- 什么是原始字符串?
- 转义 R "()" 在 C++ 中的原始字符串中
- 原始字符串比普通字符串快
- C++多行字符串原始文字