Oniguruma Regular Expressions v6.9.8 Reference

Version: 6.9.8 (2022/04/11) Syntax: ONIG_SYNTAX_ONIGURUMA (default)

This document provides a comprehensive reference for Oniguruma regular expressions syntax, version 6.9.8. It is intended for use as a quick guide and reference for AI language models and developers working with Oniguruma regex.

1. Syntax Elements

Oniguruma regular expressions utilize the following fundamental syntax elements:

\ - Escape character:
- Enables meta characters to be treated as literal characters.
- Disables the special meaning of meta characters.
| - Alternation:
- Matches either the expression before or after the | operator.
(...) - Grouping:
- Creates a group to apply quantifiers or capture substrings.
[...] - Character Class:
- Defines a set of characters to match.

2. Characters

Special character escapes for common characters and code point representations.

\t - Horizontal tab (0x09)
\v - Vertical tab (0x0B)
\n - Newline / Line feed (0x0A)
\r - Carriage return (0x0D)
\b - Backspace (0x08) - Note: Effective as backspace only within character classes [...]. Outside, it represents a word boundary.
\f - Form feed (0x0C)
\a - Bell (0x07)
\e - Escape (0x1B)
\nnn - Octal character:
- nnn represents 1 to 3 octal digits.
- Interpreted as an encoded byte value.
\xHH - Hexadecimal character:
- HH represents 2 hexadecimal digits.
- Interpreted as an encoded byte value.
\x{7HHHHHHH} - Hexadecimal character (Unicode code point):
- 7HHHHHHH represents 1 to 8 hexadecimal digits.
- Interpreted as a Unicode code point value.
\o{17777777777} - Octal character (Unicode code point):
- 17777777777 represents 1 to 11 octal digits.
- Interpreted as a Unicode code point value.
\uHHHH - Hexadecimal character (Unicode code point):
- HHHH represents 4 hexadecimal digits.
- Interpreted as a Unicode code point value.
\cx / \C-x - Control character:
- x is a character.
- Interpreted as a Unicode control character code point value.
\M-x - Meta character:
- Represents (x | 0x80).
- Interpreted as a Unicode code point value.
\M-\C-x - Meta control character:
- Represents a combination of meta and control characters.
- Interpreted as a Unicode code point value.

2.1 Code Point Sequences

Representing sequences of Unicode code points.

Hexadecimal code point sequences:
```
\x{7HHHHHHH 7HHHHHHH ... 7HHHHHHH}
```
- Space-separated hexadecimal code points (1-8 digits each).
Octal code point sequences:
```
\o{17777777777 17777777777 ... 17777777777}
```
- Space-separated octal code points (1-11 digits each).

3. Character Types

Predefined character classes for common character sets.

. - Any character (except newline by default).
\w - Word character:
- Not Unicode (default): Alphanumeric characters, underscore (_), and multibyte characters.
- Unicode: Characters belonging to the General Category: Letter, Mark, Number, or Connector_Punctuation.
\W - Non-word character:
- The negation of \w.
\s - Whitespace character:
- Not Unicode (default): \t, \n, \v, \f, \r, \x20 (space).
- Unicode:
  - U+0009 (Tab), U+000A (Line Feed), U+000B (Vertical Tab), U+000C (Form Feed), U+000D (Carriage Return), U+0085 (NEL - Next Line).
  - Characters from General Category: Line_Separator, Paragraph_Separator, Space_Separator.
\S - Non-whitespace character:
- The negation of \s.
\d - Decimal digit character:
- Unicode: Characters from General Category: Decimal_Number.
\D - Non-decimal-digit character:
- The negation of \d.
\h - Hexadecimal digit character:
- Equivalent to [0-9a-fA-F].
\H - Non-hexadecimal digit character:
- The negation of \h.
\R - General newline:
- Cannot be used in character classes [...].
- Matches \r\n or \n, \v, \f, \r.
- Unicode: \r\n or \n, \v, \f, \r, or U+0085, U+2028, U+2029.
- Note: Does not backtrack from \r\n to \r.
\N - Negative newline:
- Equivalent to (?-m:.) (any character except newline in single-line mode).
\O - True any character:
- Equivalent to (?m:.) (any character including newline in multi-line mode).
- Represents the "original function" of any character matching.
\X - Text Segment:
- Equivalent to (?>\O(?:\Y\O)*).
- Meaning depends on the Text Segment mode option (?y{..}).
- Does not inherently check for boundary at the start of the match. Use \y\X to ensure boundary matching.
- Extended Grapheme Cluster mode (default):
  - Unicode: Follows Unicode Standard Annex #29.
  - Not Unicode: \X === (?>\r\n|\O).
- Word mode:
  - Currently supported in Unicode only.
  - Follows Unicode Standard Annex #29.

Character Property

Using Unicode character properties for more specific character class matching.

\p{property-name} - Matches characters with the specified property.
\p{^property-name} / \P{property-name} - Matches characters without the specified property (negative).

Available Property Names:

Works on all encodings: Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower, Print, Punct, Space, Upper, XDigit, Word, ASCII
Works on EUC_JP, Shift_JIS: Hiragana, Katakana
Works on UTF8, UTF16, UTF32: Refer to doc/UNICODE_PROPERTIES for a comprehensive list.

4. Quantifiers

Quantifiers specify how many times a preceding element can occur.

Greedy Quantifiers

Match as much as possible.

? - Zero or one time (0 or 1).
* - Zero or more times (0 to infinity).
+ - One or more times (1 to infinity).
{n,m} - At least n and at most m times (inclusive, n <= m).
{n,} - At least n times (n to infinity).
{,n} - At most n times (0 to n, equivalent to {0,n}).
{n} - Exactly n times.

Reluctant (Lazy) Quantifiers

Match as little as possible. Appended with ?.

?? - Zero or one time, reluctantly.
*? - Zero or more times, reluctantly.
+? - One or more times, reluctantly.
{n,m}? - At least n and at most m times, reluctantly (n <= m).
{n,}? - At least n times, reluctantly.
{,n}? - At most n times, reluctantly (equivalent to {0,n}?).

Note: {n}? is a reluctant quantifier only in ONIG_SYNTAX_JAVA and ONIG_SYNTAX_PERL. In ONIG_SYNTAX_ONIGURUMA, /a{n}?/ is equivalent to /(?:a{n})?/.

Possessive Quantifiers

Greedy and do not backtrack once a match is found. Appended with +.

?+ - Zero or one time, possessively.
*+ - Zero or more times, possessively.
++ - One or more times, possessively.
{n,m}+ - At least m and at most n times, possessively (n > m in the original text, likely a typo and should be n <= m or n >= m as possessive quantifiers usually follow the same range logic as greedy ones, and the example shows a*+ being possessive a*).
{n,}+ - At least n times, possessively.
{n}+ - Exactly n times, possessively.

Note: {n,m}+, {n,}+, {n}+ are possessive only in ONIG_SYNTAX_JAVA and ONIG_SYNTAX_PERL. Example: /a*+/ is equivalent to /(?>a*)/ (atomic group).

5. Anchors

Anchors assert positions within the text without consuming characters.

^ - Beginning of the line.
$ - End of the line.
\b - Word boundary:
- Position between a word character (\w) and a non-word character (\W), or at the beginning/end of the string if the first/last characters are word characters.
\B - Non-word boundary:
- Any position that is not a word boundary (\b).
\A - Beginning of the string.
\Z - End of string, or before newline at the very end of the string.
\z - End of string.
\G - Where the current search attempt begins.
\K - Keep:
- Keeps the start position of the result string. The matched portion before \K is effectively discarded from the final match result.

Text Segment Boundaries

Boundaries related to text segments, affected by the (?y{..}) option.

\y - Text Segment boundary.
\Y - Text Segment non-boundary.

The meaning of \y and \Y depends on the Text Segment mode.
- Extended Grapheme Cluster mode (default):
  - Unicode: Follows Unicode Standard Annex #29.
  - Not Unicode: All positions except between \r and \n.
- Word mode:
  - Currently supported in Unicode only.
  - Follows Unicode Standard Annex #29.

6. Character Class `[...]`

Define custom sets of characters.

^... - Negative class (negation):
- Matches any character not in the set. Lowest precedence within character classes.
x-y - Range:
- Specifies a range of characters from x to y (inclusive).
[...] - Set in character class:
- Allows nesting character classes within character classes.
..&&.. - Intersection:
- Matches characters that are in both sets. Low precedence, only higher than negation (^).
- Example: [a-w&&[^c-g]z] is equivalent to ([a-w] AND ([^c-g] OR z)), resulting in [abh-w].
Note: To use [, -, or ] as literal characters within a character class, escape them with \.

POSIX Bracket Expressions `[:xxxxx:]`

Predefined character classes based on POSIX standards. Can be negated using [:^xxxxx:].

Not Unicode Case:

alnum - Alphanumeric characters.
alpha - Alphabetical characters.
ascii - ASCII characters: code values [0 - 127].
blank - \t, \x20 (space).
cntrl - Control characters.
digit - Decimal digits: 0-9.
graph - Graphic characters (includes all multibyte encoded characters).
lower - Lowercase alphabetical characters.
print - Printable characters (includes all multibyte encoded characters).
punct - Punctuation characters.
space - \t, \n, \v, \f, \r, \x20.
upper - Uppercase alphabetical characters.
xdigit - Hexadecimal digits: 0-9, a-f, A-F.
word - Alphanumeric characters, _, and multibyte characters.

Unicode Case:

alnum - Letter | Mark | Decimal_Number.
alpha - Letter | Mark.
ascii - Unicode code points 0000 - 007F.
blank - Space_Separator | U+0009 (Tab).
cntrl - Control | Format | Unassigned | Private_Use | Surrogate.
digit - Decimal_Number.
graph - [[:^space:]] && ^Control && ^Unassigned && ^Surrogate.
lower - Lowercase_Letter.
print - [[:graph:]] | [[:space:]].
punct - Connector_Punctuation | Dash_Punctuation | Close_Punctuation | Final_Punctuation | Initial_Punctuation | Other_Punctuation | Open_Punctuation.
space - Space_Separator | Line_Separator | Paragraph_Separator | U+0009 | U+000A | U+000B | U+000C | U+000D | U+0085.
upper - Uppercase_Letter.
xdigit - U+0030 - U+0039 | U+0041 - U+0046 | U+0061 - U+0066 (0-9, a-f, A-F).
word - Letter | Mark | Decimal_Number | Connector_Punctuation.

7. Extended Groups `(?...)`

Extended group syntax provides various functionalities beyond basic grouping.

(?#...) - Comment:
- Ignored during regex processing.
(?imxWDSPy-imxWDSP:subexp) - Option on/off for subexpression:
- Applies options within subexp. Options can be turned on or off by prefixing with -.
  - i - Ignore case.
  - m - Multi-line mode (dot . matches newline).
  - x - Extended mode (whitespace and comments are ignored).
  - W - ASCII-only word (\w, \p{Word}, [[:word:]]).
  - D - ASCII-only digit (\d, \p{Digit}, [[:digit:]]).
  - S - ASCII-only space (\s, \p{Space}, [[:space:]]).
  - P - ASCII-only POSIX properties (includes W, D, S).
    - (alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, word)
  - y{?} - Text Segment mode:
    - Changes the meaning of \X, \y, \Y. Unicode only.
    - y{g} - Extended Grapheme Cluster mode (default).
    - y{w} - Word mode.
    - See Unicode Standard Annex #29.
(?imxWDSPy-imxWDSP) - Isolated option:
- Applies options to the rest of the pattern from this point or until the next closing parenthesis ).
- Example: /ab(?i)c|def|gh/ is equivalent to /ab(?i:c|def|gh)/.
/(?CIL).../, /(?CIL:...)/ - Whole option:
- Applies options to the entire regular expression. Must be placed at the beginning.
  - C - ONIG_OPTION_DONT_CAPTURE_GROUP.
  - I - ONIG_OPTION_IGNORECASE_IS_ASCII.
  - L - ONIG_OPTION_FIND_LONGEST.
(?:subexp) - Non-capturing group:
- Groups subexp without capturing the matched substring.
(subexp) - Capturing group:
- Groups subexp and captures the matched substring. Assigned a number and can optionally be named.
(?=subexp) - Look-ahead assertion:
- Positive look-ahead. Asserts that subexp matches after the current position, without consuming characters.
(?!subexp) - Negative look-ahead assertion:
- Negative look-ahead. Asserts that subexp does not match after the current position, without consuming characters.
(?<=subexp) - Look-behind assertion:
- Positive look-behind. Asserts that subexp matches before the current position, without consuming characters.
(?<!subexp) - Negative look-behind assertion:
- Negative look-behind. Asserts that subexp does not match before the current position, without consuming characters.
- Limitations:
  - Cannot use Absent stopper (?~|expr) and Range clear (?~|) operators in look-behind assertions.
  - Limited ignore-case support: Only supports conversion between single characters. No support for multi-character Unicode conversions.
(?>subexp) - Atomic group:
- Non-backtracking group. Once subexp matches, backtracking into it is prevented.
- Example: /a*+/ is equivalent to /(?>a*)/.
(?<name>subexp), (?'name'subexp) - Named capturing group:
- Defines a capturing group with a name.
- name must consist of word characters (\w).
- Named groups are also assigned numbers like regular capturing groups.
- Multiple groups can share the same name.

Callouts

Mechanism to execute code during regex matching at specific points.

Callouts of contents:

(?{...contents...}) - Callout in progress:
- Executes contents when the regex engine reaches this point during matching.
(?{...contents...}D) - Callout with direction flag:
- D is a direction flag character:
  - 'X' - In progress and retraction (both forward and backtracking).
  - '<' - In retraction only (backtracking).
  - '>' - In progress only (forward matching).
(?{...contents...}[tag]) - Callout with tag:
- Assigns a tag to the callout.
(?{...contents...}[tag]D) - Callout with tag and direction flag:
- Combines tag and direction flag.
Notes:
- Escape characters have no special meaning within contents.
- contents cannot start with {.
- For nested braces, use (?{{{...contents...}}}) to allow n consecutive } within contents by using (n+1) consecutive {{{...}}} as delimiters.
- tag string characters: _, A-Z, a-z, 0-9 (first character: _, A-Z, a-z).

Callouts of name:

(*name) - Callout by name.
(*name{args...}) - Callout by name with arguments.
(*name[tag]) - Callout by name with tag.
(*name[tag]{args...}) - Callout by name with tag and arguments.

Notes:
- name string characters: _, A-Z, a-z, 0-9 (first character: _, A-Z, a-z).
- tag string characters: _, A-Z, a-z, 0-9 (first character: _, A-Z, a-z).

Absent Functions

Operators related to matching ranges that exclude certain patterns.

(?~absent) - Absent repeater:
- Works like .* (more precisely \O*), but restricted to ranges that do not include matches of absent.
- Abbreviation for (?~|(?:absent)|\O*).
- Uses \O* as the repeater.
(?~|absent|exp) - Absent expression:
- Works like exp, but limited to ranges that do not include matches of absent.
- Example: (?~|345|\d*) on "12345678" results in matches "12", "1", "".
(?~|absent) - Absent stopper:
- Limits the string range to the right of this operator to exclude any matches of absent.
(?~|) - Range clear:
- Clears the effects of previous Absent stoppers.
Note: Nested Absent functions are not supported, and their behavior is undefined.

If-Then-Else Conditional Expressions

Conditional regex constructs based on a condition.

(?(condition_exp)then_exp|else_exp) - If-then-else.
(?(condition_exp)then_exp) - If-then.

condition_exp can be:
- A backreference number or name: Checks if the backreference is valid (i.e., the group captured something).
- A normal regular expression: Evaluates the regex as the condition.
When condition_exp is a backreference, then_exp and else_exp can be omitted. In this case, it works as a backreference validity checker.

Backreference validity checker:
- (?(n)), (?(-n)), (?(+n)), (?(n+level)), ... (Number-based backreferences)
- (?(<n>)), (?('-n')), (?(<+n>)), ... (Bracketed number-based backreferences)
- (?(<name>)), (?('name')), (?(<name+level>)), ... (Named backreferences)

8. Backreferences

Referencing previously captured groups to match the same text again.

Backreference by number:
- \n - Backreference to the nth capturing group (n >= 1).
- \k<n>, \k'n' - Backreference to the nth capturing group (n >= 1).
- \k<-n>, \k'-n' - Backreference to the nth group counting backwards from the current position (n >= 1).
- \k<+n>, \k'+n' - Backreference to the nth group counting forwards from the current position (n >= 1).
Backreference by name:
- \k<name>, \k'name' - Backreference to the named capturing group name.
If multiple groups have the same name, backreferencing checks the last defined group with that name first, then the previous one, and so on, until a match is found.

Important: Backreference by number is forbidden if any named group is defined and ONIG_OPTION_CAPTURE_GROUP is not set.

Backreference with Recursion Level

Referencing groups based on the recursion level of the regex engine.

\k<n+level>, \k'n+level'
\k<n-level>, \k'n-level'
\k<name+level>, \k'name+level'

\k<name-level>, \k'name-level'

level (>= 0) specifies the recursion level relative to the current position.

Examples:

Ex 1:

/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b>))\z/.match("reee")
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")

\k<b+0> refers to the (?<b>.) at the same recursion level.

Ex 2: (XML-like tag matching)

r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
(?<element> \g<stag> \g<content>* \g<etag> ){0}
(?<stag> < \g<name> \s* > ){0}
(?<name> [a-zA-Z_:]+ ){0}
(?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
(?<etag> </ \k<name+1> >){0}
\g<element>
__REGEXP__

p r.match("<foo>f<bar>bbb</bar>f</foo>").captures

9. Subexpression Calls (Subexp Calls)

Re-executing a subexpression defined within a group.

Call by number:
- \g<n>, \g'n' - Call the nth capturing group (n >= 1).
- \g<0>, \g'0' - Call group 0 (the entire regular expression).
- \g<-n>, \g'-n' - Call the nth group counting backwards from the current position (n >= 1).
- \g<+n>, \g'+n' - Call the nth group counting forwards from the current position (n >= 1).
Call by name:
- \g<name>, \g'name' - Call the named capturing group name.
Restrictions:
- Left-most recursive calls are not allowed.
  - Error: (?<name>a|\g<name>b)
  - OK: (?<name>a|b\g<name>c)
- Calls to a name assigned to multiple groups are not allowed.
- Call by number is forbidden if any named group is defined and ONIG_OPTION_CAPTURE_GROUP is not set.
- The option status of the called group is always effective.
  - Example: /(?-i:\g<name>)(?i:(?<name>a)){0}/.match("A")

10. Captured Group Behavior

Behavior of unnamed capturing groups (...) changes based on options and the presence of named groups. Named groups (?<name>...) are not affected.

Case 1: /.../ (No named groups, no options)
- (...) is treated as a capturing group.
Case 2: /.../g (No named groups, g option - ONIG_OPTION_DONT_CAPTURE_GROUP)
- (...) is treated as a non-capturing group (?:...).
Case 3: /..(?<name>..)../ (Named groups present, no options)
- (...) is treated as a non-capturing group.
- Numbered backreferences/calls are not allowed.
Case 4: /..(?<name>..)../G (Named groups present, G option - ONIG_OPTION_CAPTURE_GROUP)
- (...) is treated as a capturing group.
- Numbered backreferences/calls are allowed.
Where:
- g: ONIG_OPTION_DONT_CAPTURE_GROUP
- G: ONIG_OPTION_CAPTURE_GROUP
These options control whether unnamed groups should be treated as capturing or non-capturing when named groups are present.

A-1. Syntax-Dependent Options

Options that are syntax-specific.

ONIG_SYNTAX_ONIGURUMA (Default Syntax):
- (?m): Dot (.) also matches newline.
ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA:
- (?s): Dot (.) also matches newline.
- (?m): ^ matches after newline, $ matches before newline.

A-2. Original Extensions (Compared to other regex engines)

Oniguruma's original extensions to regular expression syntax.

Hexadecimal digit character type: \h, \H.
True any character: \O.
Text segment boundary: \y, \Y.
Backreference validity checker: (?(...)).
Named group: (?<name>...), (?'name'...).
Named backreference: \k<name>.
Subexpression call: \g<name>, \g<group-num>.
Absent expression: (?~|...|...).
Absent stopper: (?~|...).

A-3. Missing Features (Compared to Perl 5.8.0)

Features present in Perl 5.8.0 but missing in Oniguruma.

\N{name} (Named character properties).
\l, \u, \L, \U, \C (Case modification escapes).
(??{code}) (Deferred code execution).
\Q...\E (Quote metacharacters literally) - Effective in ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.

A-4. Differences with Japanized GNU regex (version 0.12) of Ruby 1.8

Key differences between Oniguruma and the regex engine in Ruby 1.8.

Added character property: \p{property}, \P{property}.
Added hexadecimal digit character type: \h, \H.
Added look-behind assertions: (?<=fixed-width-pattern), (?<!fixed-width-pattern).
Added possessive quantifiers: ?+, *+, ++.
Added character class operations: [...] nesting, && intersection.
- ([ must be escaped as a literal character in character classes).
Added named groups and subexpression calls.
Octal or hexadecimal number sequences can be treated as multibyte code characters in character classes if multibyte encoding is specified.
- Example: [\xa1\xa2], [\xa1\xa7-\xa4\xa1]
Allowed ranges between single-byte and multibyte characters in character classes.
- Example: /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
Effect range of isolated options extends to the next ).
- Example: (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
Isolated options are not transparent to preceding patterns.
- Example: a(?i)* is a syntax error.
Allowed unpaired left brace { as a normal character.
- Examples: /{/, /({)/, /a{2,3/ etc.
Negative POSIX bracket [:^xxxx:] is supported.
POSIX bracket [:ascii:] is added.
Repeat of look-ahead assertions is not allowed.
- Examples: /(?=a)*/, /(?!b){5}/ are invalid.
Ignore case option (/i) is effective for escape sequences.
- Example: /\x61/i =~ "A"
In range quantifiers, the minimum value is optional.
- /a{,n}/ is equivalent to /a{0,n}/.
- Omission of both minimum and maximum values is not allowed: /a{,}/ is invalid.
/{n}?/ is not a reluctant quantifier.
- /a{n}?/ is equivalent to /(?:a{n})?/.
Invalid backreferences are checked and raise errors.
- Examples: /\1/, /(a)\2/ will cause errors if backreference is invalid.
Zero-width matches in infinite loops stop the repeat, and changes in capture group status are checked as a stop condition.
- Examples:
  - /(?:()|())*\1\2/ =~ ""
  - /(?:\1a|())*/ =~ "a"

kaigouthro/oniguruma_syntax.md