Version: 6.9.8 (2022/04/11)
Syntax: ONIG_SYNTAX_ONIGURUMA (default)
This document provides a comprehensive reference for Oniguruma regular expressions syntax, version 6.9.8. It is intended for use as a quick guide and reference for AI language models and developers working with Oniguruma regex.
Oniguruma regular expressions utilize the following fundamental syntax elements:
\- Escape character:- Enables meta characters to be treated as literal characters.
- Disables the special meaning of meta characters.
|- Alternation:- Matches either the expression before or after the
|operator.
- Matches either the expression before or after the
(...)- Grouping:- Creates a group to apply quantifiers or capture substrings.
[...]- Character Class:- Defines a set of characters to match.
Special character escapes for common characters and code point representations.
\t- Horizontal tab (0x09)\v- Vertical tab (0x0B)\n- Newline / Line feed (0x0A)\r- Carriage return (0x0D)\b- Backspace (0x08) - Note: Effective as backspace only within character classes[...]. Outside, it represents a word boundary.\f- Form feed (0x0C)\a- Bell (0x07)\e- Escape (0x1B)\nnn- Octal character:nnnrepresents 1 to 3 octal digits.- Interpreted as an encoded byte value.
\xHH- Hexadecimal character:HHrepresents 2 hexadecimal digits.- Interpreted as an encoded byte value.
\x{7HHHHHHH}- Hexadecimal character (Unicode code point):7HHHHHHHrepresents 1 to 8 hexadecimal digits.- Interpreted as a Unicode code point value.
\o{17777777777}- Octal character (Unicode code point):17777777777represents 1 to 11 octal digits.- Interpreted as a Unicode code point value.
\uHHHH- Hexadecimal character (Unicode code point):HHHHrepresents 4 hexadecimal digits.- Interpreted as a Unicode code point value.
\cx/\C-x- Control character:xis a character.- Interpreted as a Unicode control character code point value.
\M-x- Meta character:- Represents
(x | 0x80). - Interpreted as a Unicode code point value.
- Represents
\M-\C-x- Meta control character:- Represents a combination of meta and control characters.
- Interpreted as a Unicode code point value.
Representing sequences of Unicode code points.
- Hexadecimal code point sequences:
\x{7HHHHHHH 7HHHHHHH ... 7HHHHHHH}- Space-separated hexadecimal code points (1-8 digits each).
- Octal code point sequences:
\o{17777777777 17777777777 ... 17777777777}- Space-separated octal code points (1-11 digits each).
Predefined character classes for common character sets.
.- Any character (except newline by default).\w- Word character:- Not Unicode (default): Alphanumeric characters, underscore (
_), and multibyte characters. - Unicode: Characters belonging to the General Category: Letter, Mark, Number, or Connector_Punctuation.
- Not Unicode (default): Alphanumeric characters, underscore (
\W- Non-word character:- The negation of
\w.
- The negation of
\s- Whitespace character:- Not Unicode (default):
\t,\n,\v,\f,\r,\x20(space). - Unicode:
U+0009(Tab),U+000A(Line Feed),U+000B(Vertical Tab),U+000C(Form Feed),U+000D(Carriage Return),U+0085(NEL - Next Line).- Characters from General Category: Line_Separator, Paragraph_Separator, Space_Separator.
- Not Unicode (default):
\S- Non-whitespace character:- The negation of
\s.
- The negation of
\d- Decimal digit character:- Unicode: Characters from General Category: Decimal_Number.
\D- Non-decimal-digit character:- The negation of
\d.
- The negation of
\h- Hexadecimal digit character:- Equivalent to
[0-9a-fA-F].
- Equivalent to
\H- Non-hexadecimal digit character:- The negation of
\h.
- The negation of
\R- General newline:- Cannot be used in character classes
[...]. - Matches
\r\nor\n,\v,\f,\r. - Unicode:
\r\nor\n,\v,\f,\r, orU+0085,U+2028,U+2029. - Note: Does not backtrack from
\r\nto\r.
- Cannot be used in character classes
\N- Negative newline:- Equivalent to
(?-m:.)(any character except newline in single-line mode).
- Equivalent to
\O- True any character:- Equivalent to
(?m:.)(any character including newline in multi-line mode). - Represents the "original function" of any character matching.
- Equivalent to
\X- Text Segment:- Equivalent to
(?>\O(?:\Y\O)*). - Meaning depends on the Text Segment mode option
(?y{..}). - Does not inherently check for boundary at the start of the match. Use
\y\Xto ensure boundary matching. - Extended Grapheme Cluster mode (default):
- Unicode: Follows Unicode Standard Annex #29.
- Not Unicode:
\X === (?>\r\n|\O).
- Word mode:
- Currently supported in Unicode only.
- Follows Unicode Standard Annex #29.
- Equivalent to
Using Unicode character properties for more specific character class matching.
\p{property-name}- Matches characters with the specified property.\p{^property-name}/\P{property-name}- Matches characters without the specified property (negative).
Available Property Names:
- Works on all encodings:
Alnum,Alpha,Blank,Cntrl,Digit,Graph,Lower,Print,Punct,Space,Upper,XDigit,Word,ASCII - Works on EUC_JP, Shift_JIS:
Hiragana,Katakana - Works on UTF8, UTF16, UTF32:
Refer to
doc/UNICODE_PROPERTIESfor a comprehensive list.
Quantifiers specify how many times a preceding element can occur.
Match as much as possible.
?- Zero or one time (0 or 1).*- Zero or more times (0 to infinity).+- One or more times (1 to infinity).{n,m}- At leastnand at mostmtimes (inclusive,n <= m).{n,}- At leastntimes (nto infinity).{,n}- At mostntimes (0 ton, equivalent to{0,n}).{n}- Exactlyntimes.
Match as little as possible. Appended with ?.
-
??- Zero or one time, reluctantly. -
*?- Zero or more times, reluctantly. -
+?- One or more times, reluctantly. -
{n,m}?- At leastnand at mostmtimes, reluctantly (n <= m). -
{n,}?- At leastntimes, reluctantly. -
{,n}?- At mostntimes, reluctantly (equivalent to{0,n}?).Note:
{n}?is a reluctant quantifier only inONIG_SYNTAX_JAVAandONIG_SYNTAX_PERL. InONIG_SYNTAX_ONIGURUMA,/a{n}?/is equivalent to/(?:a{n})?/.
Greedy and do not backtrack once a match is found. Appended with +.
-
?+- Zero or one time, possessively. -
*+- Zero or more times, possessively. -
++- One or more times, possessively. -
{n,m}+- At leastmand at mostntimes, possessively (n > min the original text, likely a typo and should ben <= morn >= mas possessive quantifiers usually follow the same range logic as greedy ones, and the example showsa*+being possessivea*). -
{n,}+- At leastntimes, possessively. -
{n}+- Exactlyntimes, possessively.Note:
{n,m}+,{n,}+,{n}+are possessive only inONIG_SYNTAX_JAVAandONIG_SYNTAX_PERL. Example:/a*+/is equivalent to/(?>a*)/(atomic group).
Anchors assert positions within the text without consuming characters.
^- Beginning of the line.$- End of the line.\b- Word boundary:- Position between a word character (
\w) and a non-word character (\W), or at the beginning/end of the string if the first/last characters are word characters.
- Position between a word character (
\B- Non-word boundary:- Any position that is not a word boundary (
\b).
- Any position that is not a word boundary (
\A- Beginning of the string.\Z- End of string, or before newline at the very end of the string.\z- End of string.\G- Where the current search attempt begins.\K- Keep:- Keeps the start position of the result string. The matched portion before
\Kis effectively discarded from the final match result.
- Keeps the start position of the result string. The matched portion before
Boundaries related to text segments, affected by the (?y{..}) option.
-
\y- Text Segment boundary. -
\Y- Text Segment non-boundary.The meaning of
\yand\Ydepends on the Text Segment mode.- Extended Grapheme Cluster mode (default):
- Unicode: Follows Unicode Standard Annex #29.
- Not Unicode: All positions except between
\rand\n.
- Word mode:
- Currently supported in Unicode only.
- Follows Unicode Standard Annex #29.
- Extended Grapheme Cluster mode (default):
Define custom sets of characters.
-
^...- Negative class (negation):- Matches any character not in the set. Lowest precedence within character classes.
-
x-y- Range:- Specifies a range of characters from
xtoy(inclusive).
- Specifies a range of characters from
-
[...]- Set in character class:- Allows nesting character classes within character classes.
-
..&&..- Intersection:- Matches characters that are in both sets. Low precedence, only higher than negation (
^). - Example:
[a-w&&[^c-g]z]is equivalent to([a-w] AND ([^c-g] OR z)), resulting in[abh-w].
Note: To use
[,-, or]as literal characters within a character class, escape them with\. - Matches characters that are in both sets. Low precedence, only higher than negation (
Predefined character classes based on POSIX standards. Can be negated using [:^xxxxx:].
Not Unicode Case:
alnum- Alphanumeric characters.alpha- Alphabetical characters.ascii- ASCII characters: code values[0 - 127].blank-\t,\x20(space).cntrl- Control characters.digit- Decimal digits:0-9.graph- Graphic characters (includes all multibyte encoded characters).lower- Lowercase alphabetical characters.print- Printable characters (includes all multibyte encoded characters).punct- Punctuation characters.space-\t,\n,\v,\f,\r,\x20.upper- Uppercase alphabetical characters.xdigit- Hexadecimal digits:0-9,a-f,A-F.word- Alphanumeric characters,_, and multibyte characters.
Unicode Case:
alnum- Letter | Mark | Decimal_Number.alpha- Letter | Mark.ascii- Unicode code points0000 - 007F.blank- Space_Separator |U+0009(Tab).cntrl- Control | Format | Unassigned | Private_Use | Surrogate.digit- Decimal_Number.graph-[[:^space:]] && ^Control && ^Unassigned && ^Surrogate.lower- Lowercase_Letter.print-[[:graph:]] | [[:space:]].punct- Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation.space- Space_Separator | Line_Separator | Paragraph_Separator |U+0009|U+000A|U+000B|U+000C|U+000D|U+0085.upper- Uppercase_Letter.xdigit-U+0030 - U+0039|U+0041 - U+0046|U+0061 - U+0066(0-9, a-f, A-F).word- Letter | Mark | Decimal_Number | Connector_Punctuation.
Extended group syntax provides various functionalities beyond basic grouping.
(?#...)- Comment:- Ignored during regex processing.
(?imxWDSPy-imxWDSP:subexp)- Option on/off for subexpression:- Applies options within
subexp. Options can be turned on or off by prefixing with-.i- Ignore case.m- Multi-line mode (dot.matches newline).x- Extended mode (whitespace and comments are ignored).W- ASCII-only word (\w,\p{Word},[[:word:]]).D- ASCII-only digit (\d,\p{Digit},[[:digit:]]).S- ASCII-only space (\s,\p{Space},[[:space:]]).P- ASCII-only POSIX properties (includesW,D,S).- (
alnum,alpha,blank,cntrl,digit,graph,lower,print,punct,space,upper,xdigit,word)
- (
y{?}- Text Segment mode:- Changes the meaning of
\X,\y,\Y. Unicode only. y{g}- Extended Grapheme Cluster mode (default).y{w}- Word mode.- See Unicode Standard Annex #29.
- Changes the meaning of
- Applies options within
(?imxWDSPy-imxWDSP)- Isolated option:- Applies options to the rest of the pattern from this point or until the next closing parenthesis
). - Example:
/ab(?i)c|def|gh/is equivalent to/ab(?i:c|def|gh)/.
- Applies options to the rest of the pattern from this point or until the next closing parenthesis
/(?CIL).../,/(?CIL:...)/ - Whole option:- Applies options to the entire regular expression. Must be placed at the beginning.
C-ONIG_OPTION_DONT_CAPTURE_GROUP.I-ONIG_OPTION_IGNORECASE_IS_ASCII.L-ONIG_OPTION_FIND_LONGEST.
- Applies options to the entire regular expression. Must be placed at the beginning.
(?:subexp)- Non-capturing group:- Groups
subexpwithout capturing the matched substring.
- Groups
(subexp)- Capturing group:- Groups
subexpand captures the matched substring. Assigned a number and can optionally be named.
- Groups
(?=subexp)- Look-ahead assertion:- Positive look-ahead. Asserts that
subexpmatches after the current position, without consuming characters.
- Positive look-ahead. Asserts that
(?!subexp)- Negative look-ahead assertion:- Negative look-ahead. Asserts that
subexpdoes not match after the current position, without consuming characters.
- Negative look-ahead. Asserts that
(?<=subexp)- Look-behind assertion:- Positive look-behind. Asserts that
subexpmatches before the current position, without consuming characters.
- Positive look-behind. Asserts that
(?<!subexp)- Negative look-behind assertion:- Negative look-behind. Asserts that
subexpdoes not match before the current position, without consuming characters. - Limitations:
- Cannot use Absent stopper
(?~|expr)and Range clear(?~|)operators in look-behind assertions. - Limited ignore-case support: Only supports conversion between single characters. No support for multi-character Unicode conversions.
- Cannot use Absent stopper
- Negative look-behind. Asserts that
(?>subexp)- Atomic group:- Non-backtracking group. Once
subexpmatches, backtracking into it is prevented. - Example:
/a*+/is equivalent to/(?>a*)/.
- Non-backtracking group. Once
(?<name>subexp),(?'name'subexp)- Named capturing group:- Defines a capturing group with a name.
namemust consist of word characters (\w).- Named groups are also assigned numbers like regular capturing groups.
- Multiple groups can share the same name.
Mechanism to execute code during regex matching at specific points.
Callouts of contents:
-
(?{...contents...})- Callout in progress:- Executes
contentswhen the regex engine reaches this point during matching.
- Executes
-
(?{...contents...}D)- Callout with direction flag:Dis a direction flag character:'X'- In progress and retraction (both forward and backtracking).'<'- In retraction only (backtracking).'>'- In progress only (forward matching).
-
(?{...contents...}[tag])- Callout with tag:- Assigns a
tagto the callout.
- Assigns a
-
(?{...contents...}[tag]D)- Callout with tag and direction flag:- Combines tag and direction flag.
Notes:
- Escape characters have no special meaning within
contents. contentscannot start with{.- For nested braces, use
(?{{{...contents...}}})to allownconsecutive}withincontentsby using(n+1)consecutive{{{...}}}as delimiters. tagstring characters:_,A-Z,a-z,0-9(first character:_,A-Z,a-z).
Callouts of name:
-
(*name)- Callout by name. -
(*name{args...})- Callout by name with arguments. -
(*name[tag])- Callout by name with tag. -
(*name[tag]{args...})- Callout by name with tag and arguments.Notes:
namestring characters:_,A-Z,a-z,0-9(first character:_,A-Z,a-z).tagstring characters:_,A-Z,a-z,0-9(first character:_,A-Z,a-z).
Operators related to matching ranges that exclude certain patterns.
-
(?~absent)- Absent repeater:- Works like
.*(more precisely\O*), but restricted to ranges that do not include matches ofabsent. - Abbreviation for
(?~|(?:absent)|\O*). - Uses
\O*as the repeater.
- Works like
-
(?~|absent|exp)- Absent expression:- Works like
exp, but limited to ranges that do not include matches ofabsent. - Example:
(?~|345|\d*)on"12345678"results in matches"12","1","".
- Works like
-
(?~|absent)- Absent stopper:- Limits the string range to the right of this operator to exclude any matches of
absent.
- Limits the string range to the right of this operator to exclude any matches of
-
(?~|)- Range clear:- Clears the effects of previous Absent stoppers.
Note: Nested Absent functions are not supported, and their behavior is undefined.
Conditional regex constructs based on a condition.
-
(?(condition_exp)then_exp|else_exp)- If-then-else. -
(?(condition_exp)then_exp)- If-then.condition_expcan be:- A backreference number or name: Checks if the backreference is valid (i.e., the group captured something).
- A normal regular expression: Evaluates the regex as the condition.
When
condition_expis a backreference,then_expandelse_expcan be omitted. In this case, it works as a backreference validity checker.Backreference validity checker:
(?(n)),(?(-n)),(?(+n)),(?(n+level)), ... (Number-based backreferences)(?(<n>)),(?('-n')),(?(<+n>)), ... (Bracketed number-based backreferences)(?(<name>)),(?('name')),(?(<name+level>)), ... (Named backreferences)
Referencing previously captured groups to match the same text again.
-
Backreference by number:
\n- Backreference to the nth capturing group (n >= 1).\k<n>,\k'n'- Backreference to the nth capturing group (n >= 1).\k<-n>,\k'-n'- Backreference to the nth group counting backwards from the current position (n >= 1).\k<+n>,\k'+n'- Backreference to the nth group counting forwards from the current position (n >= 1).
-
Backreference by name:
\k<name>,\k'name'- Backreference to the named capturing groupname.
If multiple groups have the same name, backreferencing checks the last defined group with that name first, then the previous one, and so on, until a match is found.
Important: Backreference by number is forbidden if any named group is defined and
ONIG_OPTION_CAPTURE_GROUPis not set.
Referencing groups based on the recursion level of the regex engine.
-
\k<n+level>,\k'n+level' -
\k<n-level>,\k'n-level' -
\k<name+level>,\k'name+level' -
\k<name-level>,\k'name-level'level(>= 0) specifies the recursion level relative to the current position.Examples:
Ex 1:
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee") /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
\k<b+0>refers to the(?<b>.)at the same recursion level.Ex 2: (XML-like tag matching)
r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) (?<element> \g<stag> \g<content>* \g<etag> ){0} (?<stag> < \g<name> \s* > ){0} (?<name> [a-zA-Z_:]+ ){0} (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} (?<etag> </ \k<name+1> >){0} \g<element> __REGEXP__ p r.match("<foo>f<bar>bbb</bar>f</foo>").captures
Re-executing a subexpression defined within a group.
-
Call by number:
\g<n>,\g'n'- Call the nth capturing group (n >= 1).\g<0>,\g'0'- Call group 0 (the entire regular expression).\g<-n>,\g'-n'- Call the nth group counting backwards from the current position (n >= 1).\g<+n>,\g'+n'- Call the nth group counting forwards from the current position (n >= 1).
-
Call by name:
\g<name>,\g'name'- Call the named capturing groupname.
Restrictions:
- Left-most recursive calls are not allowed.
- Error:
(?<name>a|\g<name>b) - OK:
(?<name>a|b\g<name>c)
- Error:
- Calls to a name assigned to multiple groups are not allowed.
- Call by number is forbidden if any named group is defined and
ONIG_OPTION_CAPTURE_GROUPis not set. - The option status of the called group is always effective.
- Example:
/(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A")
- Example:
Behavior of unnamed capturing groups (...) changes based on options and the presence of named groups. Named groups (?<name>...) are not affected.
-
Case 1:
/.../(No named groups, no options)(...)is treated as a capturing group.
-
Case 2:
/.../g(No named groups,goption -ONIG_OPTION_DONT_CAPTURE_GROUP)(...)is treated as a non-capturing group(?:...).
-
Case 3:
/..(?<name>..)../(Named groups present, no options)(...)is treated as a non-capturing group.- Numbered backreferences/calls are not allowed.
-
Case 4:
/..(?<name>..)../G(Named groups present,Goption -ONIG_OPTION_CAPTURE_GROUP)(...)is treated as a capturing group.- Numbered backreferences/calls are allowed.
Where:
g:ONIG_OPTION_DONT_CAPTURE_GROUPG:ONIG_OPTION_CAPTURE_GROUP
These options control whether unnamed groups should be treated as capturing or non-capturing when named groups are present.
Options that are syntax-specific.
ONIG_SYNTAX_ONIGURUMA(Default Syntax):(?m): Dot (.) also matches newline.
ONIG_SYNTAX_PERLandONIG_SYNTAX_JAVA:(?s): Dot (.) also matches newline.(?m):^matches after newline,$matches before newline.
Oniguruma's original extensions to regular expression syntax.
- Hexadecimal digit character type:
\h,\H. - True any character:
\O. - Text segment boundary:
\y,\Y. - Backreference validity checker:
(?(...)). - Named group:
(?<name>...),(?'name'...). - Named backreference:
\k<name>. - Subexpression call:
\g<name>,\g<group-num>. - Absent expression:
(?~|...|...). - Absent stopper:
(?~|...).
Features present in Perl 5.8.0 but missing in Oniguruma.
\N{name}(Named character properties).\l,\u,\L,\U,\C(Case modification escapes).(??{code})(Deferred code execution).\Q...\E(Quote metacharacters literally) - Effective inONIG_SYNTAX_PERLandONIG_SYNTAX_JAVA.
Key differences between Oniguruma and the regex engine in Ruby 1.8.
- Added character property:
\p{property},\P{property}. - Added hexadecimal digit character type:
\h,\H. - Added look-behind assertions:
(?<=fixed-width-pattern),(?<!fixed-width-pattern). - Added possessive quantifiers:
?+,*+,++. - Added character class operations:
[...]nesting,&&intersection.- (
[must be escaped as a literal character in character classes).
- (
- Added named groups and subexpression calls.
- Octal or hexadecimal number sequences can be treated as multibyte code characters in character classes if multibyte encoding is specified.
- Example:
[\xa1\xa2],[\xa1\xa7-\xa4\xa1]
- Example:
- Allowed ranges between single-byte and multibyte characters in character classes.
- Example:
/[a-<<any EUC-JP character>>]/in EUC-JP encoding.
- Example:
- Effect range of isolated options extends to the next
).- Example:
(?:(?i)a|b)is interpreted as(?:(?i:a|b)), not(?:(?i:a)|b).
- Example:
- Isolated options are not transparent to preceding patterns.
- Example:
a(?i)*is a syntax error.
- Example:
- Allowed unpaired left brace
{as a normal character.- Examples:
/{/,/({)/,/a{2,3/etc.
- Examples:
- Negative POSIX bracket
[:^xxxx:]is supported. - POSIX bracket
[:ascii:]is added. - Repeat of look-ahead assertions is not allowed.
- Examples:
/(?=a)*/,/(?!b){5}/are invalid.
- Examples:
- Ignore case option (
/i) is effective for escape sequences.- Example:
/\x61/i =~ "A"
- Example:
- In range quantifiers, the minimum value is optional.
/a{,n}/is equivalent to/a{0,n}/.- Omission of both minimum and maximum values is not allowed:
/a{,}/is invalid.
/{n}?/is not a reluctant quantifier./a{n}?/is equivalent to/(?:a{n})?/.
- Invalid backreferences are checked and raise errors.
- Examples:
/\1/,/(a)\2/will cause errors if backreference is invalid.
- Examples:
- Zero-width matches in infinite loops stop the repeat, and changes in capture group status are checked as a stop condition.
- Examples:
/(?:()|())*\1\2/ =~ ""/(?:\1a|())*/ =~ "a"
- Examples: