The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
symbol ::= expressionSymbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
-
#xNwhere
Nis a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 isN. The number of leading zeros in the#xNform is insignificant. -
[a-zA-Z],[#xN-#xN]matches any Char with a value in the range(s) indicated (inclusive).
-
[abc],[#xN#xN#xN]matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
-
[^a-z],[^#xN-#xN]matches any Char with a value outside the range indicated.
-
[^abc],[^#xN#xN#xN]matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.
-
"string"matches a literal string matching that given inside the double quotes.
-
'string'matches a literal string matching that given inside the single quotes.
These symbols may be combined to match more complex patterns as follows, where
A and B represent simple expressions:
-
(expression)expressionis treated as a unit and may be combined as described in this list. -
A?matches
Aor nothing; optionalA. -
A Bmatches
Afollowed byB. This operator has higher precedence than alternation; thusA B | C Dis identical to(A B) | (C D). -
A | Bmatches
AorB. -
A - Bmatches any string that matches
Abut does not matchB. -
A+matches one or more occurrences of
A. Concatenation has higher precedence than alternation; thusA+ | B+is identical to(A+) | (B+). -
A*matches zero or more occurrences of
A. Concatenation has higher precedence than alternation; thusA* | B*is identical to(A*) | (B*).
Other notations used in the productions are:
-
/* ... */comment.