Many strings have a structure, pattern, or logic that can be used to identify and validate data. Regular expressions (regex) are a means of identifying strings that meet some such structure. This tutorial will go through a regex example that identifies strings that are in a valid email structure.
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
Anchors define the boundaries of the strings you are attempting to capture
Both ^
and $
are present in our example.
^
signifies that a string needs to begin with the exact character that follows it or the group of characters it matches to based on a capture group. So for example ^S
would match Spaghetti
, Saucy
but not Panini
and ^[A-Z]
will match any strings that begin with a capital letter
$
works the same except it signifies the charater or pattern that a string must end with to be matched. So y$
would match with Scary
,Sporty,
and Baby
but not Ginger
or Posh
, while [a-z]$
will match any of the spice girls.
Quantifiers let you tune your matches by controlling the quantity of any given pattern in your regex. There are four quantifiers that cover 6 basic kinds of matches
*
- Optionally match. Returns cases where a pattern is matched 0 or more times.
+
- Must match. Returns cases where the pattern is matched 1 or more times.
?
- Matches at most once. Returns cases where the pattern is matched 0 or 1 times.
{}
- setting min and max and exact
{ n }
- sets the exact number of times a match can occur{ n, }
- sets the minimum number of times a match can occur{ n, x }
- sets both the minimum and maximum number of times a match can occer
our data contains three examples of quantifiers using two different quantifiers, +
& { n,x }
.
# Example 1 using +
#This matches any lowercase letter, any number, underscores, a period (escaped) , or hyphen one or more times
[a-z0-9_\.-]+
# Example 2 using +
# \d is equivalent to [0-9] so this matches any number, any lowercase letter, a period, or hyphen one or more times
[\da-z\.-]+
# Example 1 using {}
#matches any lowercase letter or period that appears at least 2 times and at most 6 times
[a-z\.]{2,6}
Regular expressions have a number of built in character classes that help define common groups of characters you want to match
three of the most commonly used are \d
, \s
, and \w
.
\d
matches to any digit character (so any arabic numeral)
\s
matches to any whitepace character (spaces, tabs)
\w
matches to any word character (any latin alphabet character or arabic numeral
additionally Regex has built in inverse classes for each of the above so that it matches to anything BUT the character class in question.
for example
\D
matches to any character that is not a digit
\S
matches to any character that is not a whitepace character
\W
matches to any character that is not a word character (characters from non latin alphabets will match this).
flags are additional parameters that effect the search that regex performs. These will occur outside of the slashes that bookend a regex expression. There are 7 particular flags that javascript uses.
/d
indicates that the indices of each match should also be returned
/g
indicates a global search ie all matches should be returned. If not present only the first match will be returned
/i
case insensitive search is performed. X is equivalent to x
/m
& /s
are similar in that both control whether the special newline character will be matched. /m
allows the anchors ^
and $
to match to a newline special character which includes \n
and \r
. /s
allows the .
special character to match to these newline special characters.
/u
allows for the use of unicode to match a single character.
/y
sticky match. tells your regex to match at the lastIndex attribute only. LastIndex can be set manually.
a capture group defines a discrete pattern that needs to be matched. These are wrapped in paraenthesis and can contain either exact strings and number patterns, bracket expressions, or some combination thereof.
for example the capture group (abc)
will match the string abc
but not acb
or bac
our example has three capture groups that use bracket expressions (see below)
([a-z0-9_\.-]+) #this capture group matches any string with length greater than 0 that can contain any combination of lowercase latin letters, numbers, underscores, dashes or periods
([\da-z\.-]+) #this capture group matches any string with length greater than 0 that may contain a digit, lowercase letter, period, or dash in any combination
([a-z\.]{2,6}) #this capture group matches any string containing only lowercase latin letters or periods that is between two and six characters long.
bracket expressions represent that certain characters are interchangeable for the purpose of matching.
For example [abc]
will match to a
, b
or c
[a-z0-9_\.-] # matches a lowercase latin letters, numbers, underscores, dashes or period
[\da-z\.-] # matches a digit, lowercase letter, period, or dash
[a-z\.] # matches lowercase latin letters or perio
greedy and lazy are descriptors concerning how a given pattern will search for a match. Many times the same regex will have many potential matches within the same context. Regex is said to be greedy if it seeks the longest possible sequence satisfying its pattern. Regex is said to be lazy if it seeks to match the shortest possible sequence satisfying its pattern.
for example given the string
'abcdddddddddddQ'
greedy regex looks like this
regex = 'abc(d*)'
#will match
'abcddddddddddd'
lazy regex looks like this
regex = 'abc(d*?)'
#will match
'abcd'
Boundaries here refers specifically to the regex special character \b
which matches to word boundaries. Word boundaries represent the positions where there is a change froma word character \w
and a non word characters \W
. Word characters include any uppercase letters, lower case letters, and numbers. Non Word characters match anything else, which in particular includes spaces and grammatical notation.
This can be used to identify words in blocks of text
For example
\b[\w]+\b
"Hello there, General Kenobi."
will match
'Hello'
'there'
'General'
'Kenobi'
Note that the boundary character also matched to the word at the beginning of the text even though it is preceded by no characters at all.
You can use previously defined capture groups using back references. Capture groups can be referenced in sequential order
There are none present in our main example so lets create one.
(abc)(ABC)(\1)(\2)
will match
abcABCabcABC
look ahead and look behind are not intuitively named. I often think of them as starts with and ends with instead. For simple matches they are effectively equivalent to starts with pattern
and ends with pattern
in many use cases.
So what are some examples where this is useful?
#Capturing strings that are in quotation, parenthesis or brackets
(?<=["])[a-z ]*(?=")
(?<=[(])[a-z ]*(?=[(])
(?<=[[])[a-z ]*(?=[[])
#Capturing questions
[a-zA-Z0-9\,()"';: ]*(?=[?])
Ryan Schubert is a coder, researcher, statistition and now developer. He actualy first encountered regex in a Computational Biology lab where it was used to capture valid gene names.