Version: 6.9.8 (2022/04/11)
Syntax: ONIG_SYNTAX_ONIGURUMA
(default)
This document provides a comprehensive reference for Oniguruma regular expressions syntax, version 6.9.8. It is intended for use as a quick guide and reference for AI language models and developers working with Oniguruma regex.
Oniguruma regular expressions utilize the following fundamental syntax elements:
\
- Escape character:- Enables meta characters to be treated as literal characters.
- Disables the special meaning of meta characters.
|
- Alternation:- Matches either the expression before or after the
|
operator.
- Matches either the expression before or after the
(...)
- Grouping:- Creates a group to apply quantifiers or capture substrings.
[...]
- Character Class:- Defines a set of characters to match.
Special character escapes for common characters and code point representations.
\t
- Horizontal tab (0x09
)\v
- Vertical tab (0x0B
)\n
- Newline / Line feed (0x0A
)\r
- Carriage return (0x0D
)\b
- Backspace (0x08
) - Note: Effective as backspace only within character classes[...]
. Outside, it represents a word boundary.\f
- Form feed (0x0C
)\a
- Bell (0x07
)\e
- Escape (0x1B
)\nnn
- Octal character:nnn
represents 1 to 3 octal digits.- Interpreted as an encoded byte value.
\xHH
- Hexadecimal character:HH
represents 2 hexadecimal digits.- Interpreted as an encoded byte value.
\x{7HHHHHHH}
- Hexadecimal character (Unicode code point):7HHHHHHH
represents 1 to 8 hexadecimal digits.- Interpreted as a Unicode code point value.
\o{17777777777}
- Octal character (Unicode code point):17777777777
represents 1 to 11 octal digits.- Interpreted as a Unicode code point value.
\uHHHH
- Hexadecimal character (Unicode code point):HHHH
represents 4 hexadecimal digits.- Interpreted as a Unicode code point value.
\cx
/\C-x
- Control character:x
is a character.- Interpreted as a Unicode control character code point value.
\M-x
- Meta character:- Represents
(x | 0x80)
. - Interpreted as a Unicode code point value.
- Represents
\M-\C-x
- Meta control character:- Represents a combination of meta and control characters.
- Interpreted as a Unicode code point value.
Representing sequences of Unicode code points.
- Hexadecimal code point sequences:
\x{7HHHHHHH 7HHHHHHH ... 7HHHHHHH}
- Space-separated hexadecimal code points (1-8 digits each).
- Octal code point sequences:
\o{17777777777 17777777777 ... 17777777777}
- Space-separated octal code points (1-11 digits each).
Predefined character classes for common character sets.
.
- Any character (except newline by default).\w
- Word character:- Not Unicode (default): Alphanumeric characters, underscore (
_
), and multibyte characters. - Unicode: Characters belonging to the General Category: Letter, Mark, Number, or Connector_Punctuation.
- Not Unicode (default): Alphanumeric characters, underscore (
\W
- Non-word character:- The negation of
\w
.
- The negation of
\s
- Whitespace character:- Not Unicode (default):
\t
,\n
,\v
,\f
,\r
,\x20
(space). - Unicode:
U+0009
(Tab),U+000A
(Line Feed),U+000B
(Vertical Tab),U+000C
(Form Feed),U+000D
(Carriage Return),U+0085
(NEL - Next Line).- Characters from General Category: Line_Separator, Paragraph_Separator, Space_Separator.
- Not Unicode (default):
\S
- Non-whitespace character:- The negation of
\s
.
- The negation of
\d
- Decimal digit character:- Unicode: Characters from General Category: Decimal_Number.
\D
- Non-decimal-digit character:- The negation of
\d
.
- The negation of
\h
- Hexadecimal digit character:- Equivalent to
[0-9a-fA-F]
.
- Equivalent to
\H
- Non-hexadecimal digit character:- The negation of
\h
.
- The negation of
\R
- General newline:- Cannot be used in character classes
[...]
. - Matches
\r\n
or\n
,\v
,\f
,\r
. - Unicode:
\r\n
or\n
,\v
,\f
,\r
, orU+0085
,U+2028
,U+2029
. - Note: Does not backtrack from
\r\n
to\r
.
- Cannot be used in character classes
\N
- Negative newline:- Equivalent to
(?-m:.)
(any character except newline in single-line mode).
- Equivalent to
\O
- True any character:- Equivalent to
(?m:.)
(any character including newline in multi-line mode). - Represents the "original function" of any character matching.
- Equivalent to
\X
- Text Segment:- Equivalent to
(?>\O(?:\Y\O)*)
. - Meaning depends on the Text Segment mode option
(?y{..})
. - Does not inherently check for boundary at the start of the match. Use
\y\X
to ensure boundary matching. - Extended Grapheme Cluster mode (default):
- Unicode: Follows Unicode Standard Annex #29.
- Not Unicode:
\X === (?>\r\n|\O)
.
- Word mode:
- Currently supported in Unicode only.
- Follows Unicode Standard Annex #29.
- Equivalent to
Using Unicode character properties for more specific character class matching.
\p{property-name}
- Matches characters with the specified property.\p{^property-name}
/\P{property-name}
- Matches characters without the specified property (negative).
Available Property Names:
- Works on all encodings:
Alnum
,Alpha
,Blank
,Cntrl
,Digit
,Graph
,Lower
,Print
,Punct
,Space
,Upper
,XDigit
,Word
,ASCII
- Works on EUC_JP, Shift_JIS:
Hiragana
,Katakana
- Works on UTF8, UTF16, UTF32:
Refer to
doc/UNICODE_PROPERTIES
for a comprehensive list.
Quantifiers specify how many times a preceding element can occur.
Match as much as possible.
?
- Zero or one time (0 or 1).*
- Zero or more times (0 to infinity).+
- One or more times (1 to infinity).{n,m}
- At leastn
and at mostm
times (inclusive,n <= m
).{n,}
- At leastn
times (n
to infinity).{,n}
- At mostn
times (0 ton
, equivalent to{0,n}
).{n}
- Exactlyn
times.
Match as little as possible. Appended with ?
.
-
??
- Zero or one time, reluctantly. -
*?
- Zero or more times, reluctantly. -
+?
- One or more times, reluctantly. -
{n,m}?
- At leastn
and at mostm
times, reluctantly (n <= m
). -
{n,}?
- At leastn
times, reluctantly. -
{,n}?
- At mostn
times, reluctantly (equivalent to{0,n}?
).Note:
{n}?
is a reluctant quantifier only inONIG_SYNTAX_JAVA
andONIG_SYNTAX_PERL
. InONIG_SYNTAX_ONIGURUMA
,/a{n}?/
is equivalent to/(?:a{n})?/
.
Greedy and do not backtrack once a match is found. Appended with +
.
-
?+
- Zero or one time, possessively. -
*+
- Zero or more times, possessively. -
++
- One or more times, possessively. -
{n,m}+
- At leastm
and at mostn
times, possessively (n > m
in the original text, likely a typo and should ben <= m
orn >= m
as possessive quantifiers usually follow the same range logic as greedy ones, and the example showsa*+
being possessivea*
). -
{n,}+
- At leastn
times, possessively. -
{n}+
- Exactlyn
times, possessively.Note:
{n,m}+
,{n,}+
,{n}+
are possessive only inONIG_SYNTAX_JAVA
andONIG_SYNTAX_PERL
. Example:/a*+/
is equivalent to/(?>a*)/
(atomic group).
Anchors assert positions within the text without consuming characters.
^
- Beginning of the line.$
- End of the line.\b
- Word boundary:- Position between a word character (
\w
) and a non-word character (\W
), or at the beginning/end of the string if the first/last characters are word characters.
- Position between a word character (
\B
- Non-word boundary:- Any position that is not a word boundary (
\b
).
- Any position that is not a word boundary (
\A
- Beginning of the string.\Z
- End of string, or before newline at the very end of the string.\z
- End of string.\G
- Where the current search attempt begins.\K
- Keep:- Keeps the start position of the result string. The matched portion before
\K
is effectively discarded from the final match result.
- Keeps the start position of the result string. The matched portion before
Boundaries related to text segments, affected by the (?y{..})
option.
-
\y
- Text Segment boundary. -
\Y
- Text Segment non-boundary.The meaning of
\y
and\Y
depends on the Text Segment mode.- Extended Grapheme Cluster mode (default):
- Unicode: Follows Unicode Standard Annex #29.
- Not Unicode: All positions except between
\r
and\n
.
- Word mode:
- Currently supported in Unicode only.
- Follows Unicode Standard Annex #29.
- Extended Grapheme Cluster mode (default):
Define custom sets of characters.
-
^...
- Negative class (negation):- Matches any character not in the set. Lowest precedence within character classes.
-
x-y
- Range:- Specifies a range of characters from
x
toy
(inclusive).
- Specifies a range of characters from
-
[...]
- Set in character class:- Allows nesting character classes within character classes.
-
..&&..
- Intersection:- Matches characters that are in both sets. Low precedence, only higher than negation (
^
). - Example:
[a-w&&[^c-g]z]
is equivalent to([a-w] AND ([^c-g] OR z))
, resulting in[abh-w]
.
Note: To use
[
,-
, or]
as literal characters within a character class, escape them with\
. - Matches characters that are in both sets. Low precedence, only higher than negation (
Predefined character classes based on POSIX standards. Can be negated using [:^xxxxx:]
.
Not Unicode Case:
alnum
- Alphanumeric characters.alpha
- Alphabetical characters.ascii
- ASCII characters: code values[0 - 127]
.blank
-\t
,\x20
(space).cntrl
- Control characters.digit
- Decimal digits:0-9
.graph
- Graphic characters (includes all multibyte encoded characters).lower
- Lowercase alphabetical characters.print
- Printable characters (includes all multibyte encoded characters).punct
- Punctuation characters.space
-\t
,\n
,\v
,\f
,\r
,\x20
.upper
- Uppercase alphabetical characters.xdigit
- Hexadecimal digits:0-9
,a-f
,A-F
.word
- Alphanumeric characters,_
, and multibyte characters.
Unicode Case:
alnum
- Letter | Mark | Decimal_Number.alpha
- Letter | Mark.ascii
- Unicode code points0000 - 007F
.blank
- Space_Separator |U+0009
(Tab).cntrl
- Control | Format | Unassigned | Private_Use | Surrogate.digit
- Decimal_Number.graph
-[[:^space:]] && ^Control && ^Unassigned && ^Surrogate
.lower
- Lowercase_Letter.print
-[[:graph:]] | [[:space:]]
.punct
- Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation.space
- Space_Separator | Line_Separator | Paragraph_Separator |U+0009
|U+000A
|U+000B
|U+000C
|U+000D
|U+0085
.upper
- Uppercase_Letter.xdigit
-U+0030 - U+0039
|U+0041 - U+0046
|U+0061 - U+0066
(0-9, a-f, A-F).word
- Letter | Mark | Decimal_Number | Connector_Punctuation.
Extended group syntax provides various functionalities beyond basic grouping.
(?#...)
- Comment:- Ignored during regex processing.
(?imxWDSPy-imxWDSP:subexp)
- Option on/off for subexpression:- Applies options within
subexp
. Options can be turned on or off by prefixing with-
.i
- Ignore case.m
- Multi-line mode (dot.
matches newline).x
- Extended mode (whitespace and comments are ignored).W
- ASCII-only word (\w
,\p{Word}
,[[:word:]]
).D
- ASCII-only digit (\d
,\p{Digit}
,[[:digit:]]
).S
- ASCII-only space (\s
,\p{Space}
,[[:space:]]
).P
- ASCII-only POSIX properties (includesW
,D
,S
).- (
alnum
,alpha
,blank
,cntrl
,digit
,graph
,lower
,print
,punct
,space
,upper
,xdigit
,word
)
- (
y{?}
- Text Segment mode:- Changes the meaning of
\X
,\y
,\Y
. Unicode only. y{g}
- Extended Grapheme Cluster mode (default).y{w}
- Word mode.- See Unicode Standard Annex #29.
- Changes the meaning of
- Applies options within
(?imxWDSPy-imxWDSP)
- Isolated option:- Applies options to the rest of the pattern from this point or until the next closing parenthesis
)
. - Example:
/ab(?i)c|def|gh/
is equivalent to/ab(?i:c|def|gh)/
.
- Applies options to the rest of the pattern from this point or until the next closing parenthesis
/(?CIL).../
,/(?CIL:...)
/ - Whole option:- Applies options to the entire regular expression. Must be placed at the beginning.
C
-ONIG_OPTION_DONT_CAPTURE_GROUP
.I
-ONIG_OPTION_IGNORECASE_IS_ASCII
.L
-ONIG_OPTION_FIND_LONGEST
.
- Applies options to the entire regular expression. Must be placed at the beginning.
(?:subexp)
- Non-capturing group:- Groups
subexp
without capturing the matched substring.
- Groups
(subexp)
- Capturing group:- Groups
subexp
and captures the matched substring. Assigned a number and can optionally be named.
- Groups
(?=subexp)
- Look-ahead assertion:- Positive look-ahead. Asserts that
subexp
matches after the current position, without consuming characters.
- Positive look-ahead. Asserts that
(?!subexp)
- Negative look-ahead assertion:- Negative look-ahead. Asserts that
subexp
does not match after the current position, without consuming characters.
- Negative look-ahead. Asserts that
(?<=subexp)
- Look-behind assertion:- Positive look-behind. Asserts that
subexp
matches before the current position, without consuming characters.
- Positive look-behind. Asserts that
(?<!subexp)
- Negative look-behind assertion:- Negative look-behind. Asserts that
subexp
does not match before the current position, without consuming characters. - Limitations:
- Cannot use Absent stopper
(?~|expr)
and Range clear(?~|)
operators in look-behind assertions. - Limited ignore-case support: Only supports conversion between single characters. No support for multi-character Unicode conversions.
- Cannot use Absent stopper
- Negative look-behind. Asserts that
(?>subexp)
- Atomic group:- Non-backtracking group. Once
subexp
matches, backtracking into it is prevented. - Example:
/a*+/
is equivalent to/(?>a*)/
.
- Non-backtracking group. Once
(?<name>subexp)
,(?'name'subexp)
- Named capturing group:- Defines a capturing group with a name.
name
must consist of word characters (\w
).- Named groups are also assigned numbers like regular capturing groups.
- Multiple groups can share the same name.
Mechanism to execute code during regex matching at specific points.
Callouts of contents:
-
(?{...contents...})
- Callout in progress:- Executes
contents
when the regex engine reaches this point during matching.
- Executes
-
(?{...contents...}D)
- Callout with direction flag:D
is a direction flag character:'X'
- In progress and retraction (both forward and backtracking).'<'
- In retraction only (backtracking).'>'
- In progress only (forward matching).
-
(?{...contents...}[tag])
- Callout with tag:- Assigns a
tag
to the callout.
- Assigns a
-
(?{...contents...}[tag]D)
- Callout with tag and direction flag:- Combines tag and direction flag.
Notes:
- Escape characters have no special meaning within
contents
. contents
cannot start with{
.- For nested braces, use
(?{{{...contents...}}})
to allown
consecutive}
withincontents
by using(n+1)
consecutive{{{...}}}
as delimiters. tag
string characters:_
,A-Z
,a-z
,0-9
(first character:_
,A-Z
,a-z
).
Callouts of name:
-
(*name)
- Callout by name. -
(*name{args...})
- Callout by name with arguments. -
(*name[tag])
- Callout by name with tag. -
(*name[tag]{args...})
- Callout by name with tag and arguments.Notes:
name
string characters:_
,A-Z
,a-z
,0-9
(first character:_
,A-Z
,a-z
).tag
string characters:_
,A-Z
,a-z
,0-9
(first character:_
,A-Z
,a-z
).
Operators related to matching ranges that exclude certain patterns.
-
(?~absent)
- Absent repeater:- Works like
.*
(more precisely\O*
), but restricted to ranges that do not include matches ofabsent
. - Abbreviation for
(?~|(?:absent)|\O*)
. - Uses
\O*
as the repeater.
- Works like
-
(?~|absent|exp)
- Absent expression:- Works like
exp
, but limited to ranges that do not include matches ofabsent
. - Example:
(?~|345|\d*)
on"12345678"
results in matches"12"
,"1"
,""
.
- Works like
-
(?~|absent)
- Absent stopper:- Limits the string range to the right of this operator to exclude any matches of
absent
.
- Limits the string range to the right of this operator to exclude any matches of
-
(?~|)
- Range clear:- Clears the effects of previous Absent stoppers.
Note: Nested Absent functions are not supported, and their behavior is undefined.
Conditional regex constructs based on a condition.
-
(?(condition_exp)then_exp|else_exp)
- If-then-else. -
(?(condition_exp)then_exp)
- If-then.condition_exp
can be:- A backreference number or name: Checks if the backreference is valid (i.e., the group captured something).
- A normal regular expression: Evaluates the regex as the condition.
When
condition_exp
is a backreference,then_exp
andelse_exp
can be omitted. In this case, it works as a backreference validity checker.Backreference validity checker:
(?(n))
,(?(-n))
,(?(+n))
,(?(n+level))
, ... (Number-based backreferences)(?(<n>))
,(?('-n'))
,(?(<+n>))
, ... (Bracketed number-based backreferences)(?(<name>))
,(?('name'))
,(?(<name+level>))
, ... (Named backreferences)
Referencing previously captured groups to match the same text again.
-
Backreference by number:
\n
- Backreference to the nth capturing group (n >= 1).\k<n>
,\k'n'
- Backreference to the nth capturing group (n >= 1).\k<-n>
,\k'-n'
- Backreference to the nth group counting backwards from the current position (n >= 1).\k<+n>
,\k'+n'
- Backreference to the nth group counting forwards from the current position (n >= 1).
-
Backreference by name:
\k<name>
,\k'name'
- Backreference to the named capturing groupname
.
If multiple groups have the same name, backreferencing checks the last defined group with that name first, then the previous one, and so on, until a match is found.
Important: Backreference by number is forbidden if any named group is defined and
ONIG_OPTION_CAPTURE_GROUP
is not set.
Referencing groups based on the recursion level of the regex engine.
-
\k<n+level>
,\k'n+level'
-
\k<n-level>
,\k'n-level'
-
\k<name+level>
,\k'name+level'
-
\k<name-level>
,\k'name-level'
level
(>= 0) specifies the recursion level relative to the current position.Examples:
Ex 1:
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee") /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
\k<b+0>
refers to the(?<b>.)
at the same recursion level.Ex 2: (XML-like tag matching)
r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED) (?<element> \g<stag> \g<content>* \g<etag> ){0} (?<stag> < \g<name> \s* > ){0} (?<name> [a-zA-Z_:]+ ){0} (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0} (?<etag> </ \k<name+1> >){0} \g<element> __REGEXP__ p r.match("<foo>f<bar>bbb</bar>f</foo>").captures
Re-executing a subexpression defined within a group.
-
Call by number:
\g<n>
,\g'n'
- Call the nth capturing group (n >= 1).\g<0>
,\g'0'
- Call group 0 (the entire regular expression).\g<-n>
,\g'-n'
- Call the nth group counting backwards from the current position (n >= 1).\g<+n>
,\g'+n'
- Call the nth group counting forwards from the current position (n >= 1).
-
Call by name:
\g<name>
,\g'name'
- Call the named capturing groupname
.
Restrictions:
- Left-most recursive calls are not allowed.
- Error:
(?<name>a|\g<name>b)
- OK:
(?<name>a|b\g<name>c)
- Error:
- Calls to a name assigned to multiple groups are not allowed.
- Call by number is forbidden if any named group is defined and
ONIG_OPTION_CAPTURE_GROUP
is not set. - The option status of the called group is always effective.
- Example:
/(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A")
- Example:
Behavior of unnamed capturing groups (...)
changes based on options and the presence of named groups. Named groups (?<name>...)
are not affected.
-
Case 1:
/.../
(No named groups, no options)(...)
is treated as a capturing group.
-
Case 2:
/.../g
(No named groups,g
option -ONIG_OPTION_DONT_CAPTURE_GROUP
)(...)
is treated as a non-capturing group(?:...)
.
-
Case 3:
/..(?<name>..)../
(Named groups present, no options)(...)
is treated as a non-capturing group.- Numbered backreferences/calls are not allowed.
-
Case 4:
/..(?<name>..)../G
(Named groups present,G
option -ONIG_OPTION_CAPTURE_GROUP
)(...)
is treated as a capturing group.- Numbered backreferences/calls are allowed.
Where:
g
:ONIG_OPTION_DONT_CAPTURE_GROUP
G
:ONIG_OPTION_CAPTURE_GROUP
These options control whether unnamed groups should be treated as capturing or non-capturing when named groups are present.
Options that are syntax-specific.
ONIG_SYNTAX_ONIGURUMA
(Default Syntax):(?m)
: Dot (.
) also matches newline.
ONIG_SYNTAX_PERL
andONIG_SYNTAX_JAVA
:(?s)
: Dot (.
) also matches newline.(?m)
:^
matches after newline,$
matches before newline.
Oniguruma's original extensions to regular expression syntax.
- Hexadecimal digit character type:
\h
,\H
. - True any character:
\O
. - Text segment boundary:
\y
,\Y
. - Backreference validity checker:
(?(...))
. - Named group:
(?<name>...)
,(?'name'...)
. - Named backreference:
\k<name>
. - Subexpression call:
\g<name>
,\g<group-num>
. - Absent expression:
(?~|...|...)
. - Absent stopper:
(?~|...)
.
Features present in Perl 5.8.0 but missing in Oniguruma.
\N{name}
(Named character properties).\l
,\u
,\L
,\U
,\C
(Case modification escapes).(??{code})
(Deferred code execution).\Q...\E
(Quote metacharacters literally) - Effective inONIG_SYNTAX_PERL
andONIG_SYNTAX_JAVA
.
Key differences between Oniguruma and the regex engine in Ruby 1.8.
- Added character property:
\p{property}
,\P{property}
. - Added hexadecimal digit character type:
\h
,\H
. - Added look-behind assertions:
(?<=fixed-width-pattern)
,(?<!fixed-width-pattern)
. - Added possessive quantifiers:
?+
,*+
,++
. - Added character class operations:
[...]
nesting,&&
intersection.- (
[
must be escaped as a literal character in character classes).
- (
- Added named groups and subexpression calls.
- Octal or hexadecimal number sequences can be treated as multibyte code characters in character classes if multibyte encoding is specified.
- Example:
[\xa1\xa2]
,[\xa1\xa7-\xa4\xa1]
- Example:
- Allowed ranges between single-byte and multibyte characters in character classes.
- Example:
/[a-<<any EUC-JP character>>]/
in EUC-JP encoding.
- Example:
- Effect range of isolated options extends to the next
)
.- Example:
(?:(?i)a|b)
is interpreted as(?:(?i:a|b))
, not(?:(?i:a)|b)
.
- Example:
- Isolated options are not transparent to preceding patterns.
- Example:
a(?i)*
is a syntax error.
- Example:
- Allowed unpaired left brace
{
as a normal character.- Examples:
/{/
,/({)/
,/a{2,3/
etc.
- Examples:
- Negative POSIX bracket
[:^xxxx:]
is supported. - POSIX bracket
[:ascii:]
is added. - Repeat of look-ahead assertions is not allowed.
- Examples:
/(?=a)*/
,/(?!b){5}/
are invalid.
- Examples:
- Ignore case option (
/i
) is effective for escape sequences.- Example:
/\x61/i =~ "A"
- Example:
- In range quantifiers, the minimum value is optional.
/a{,n}/
is equivalent to/a{0,n}/
.- Omission of both minimum and maximum values is not allowed:
/a{,}/
is invalid.
/{n}?/
is not a reluctant quantifier./a{n}?/
is equivalent to/(?:a{n})?/
.
- Invalid backreferences are checked and raise errors.
- Examples:
/\1/
,/(a)\2/
will cause errors if backreference is invalid.
- Examples:
- Zero-width matches in infinite loops stop the repeat, and changes in capture group status are checked as a stop condition.
- Examples:
/(?:()|())*\1\2/ =~ ""
/(?:\1a|())*/ =~ "a"
- Examples: