A regular expression is a common way of describing a set of possible strings. It can be seen as a miniature language, although it is expressed in a condensed form. Regular expressions are useful in programming, particularly for checking whether user input matches a desired format. For example, you can imagine a program that asks a user for a ZIP code but which should display an error message when the entered string isn't a 5-digit number. Regular expressions are also quite handy in editing text; any professional-grade word processor or text editor has a find-and-replace feature that accepts regular expressions.
The simplest form of regular expression is as a sequence of letters
or digits. For example, the pattern “amb
”
describes the set of strings including just amb and no other strings.
But there are a number of special symbols one can use. One is
the set of brackets ‘[
’ and ‘]
’, which indicates a
choice among several characters.
If we want to allow the a to be capitalized in the above
example, then we could use “[Aa]mb
” as our pattern.
Or “b[aeiou]g
” would indicate the set
with bag, beg, big, bog, and
bug.
You can also indicate a range of characters inside the brackets using a
hyphen; thus “a[a-z]b
”
will match any string containing an
a followed by a lower-case letter followed by a b, as
in aab, amb, arb, and 23 others.
If you want all strings with two capital letters, similar to the
United States state codes, you could use
“[A-Z][A-Z]
”.
It is also sometimes useful to allow any character to appear
in a particular spot in a regular expression. A period
‘.
’ will match any single character.
You can use a set of braces (‘{
’ and ‘}
’) to
specify that something appear a particular number of times within a
regular expression. You would place the braces directly after the
pattern that should be repeated that number of times, and the number
would appear within the braces. Another way to write our two-letter
state code pattern is as “[A-Z]{2}
”.
Or we can match 5-digit ZIP codes using the
pattern “[0-9]{5}
”.
A set of braces can also contain two numbers inside, specifying that
the number of repetitions of the preceding pattern must be between those
two numbers. The set of numbers between 100 and 99999, for
example, may be described with “[1-9][0-9]{2,4}
”.
You can use a set of parentheses to specify an order of operations.
This can be useful when you want a pattern of potentially several
characters to repeat multiple times.
We can extend our ZIP code pattern to allow either a 5-digit or
9-digit ZIP code using
“[0-9]{5}(-[0-9]{4}){0,1}
”.
Here, we've used parentheses to enclose the final dash and four digits,
saying that this portion is allowed to occur either 0 or 1 times.
There is a shorthand for saying 0 or 1 times: You can use a question
mark (‘?
’) in place of {0,1}
. Similarly, you can use
an asterisk (‘*
’) to indicate that the preceding pattern can
repeat any number of times (including the possibility that it may not
appear at all); and you can use a plus sign (‘+
’) to indicate
that the preceding pattern must appear at least once but may be repeated beyond that. If I wanted to
describe the set of strings describing integers, I could use the
regular expression “-?[0-9]+
”: It says that the string
may begin with a negative sign, but that it must contain at least one
digit.
Finally, you'll occasionally want a regular expression to mention one
of the characters that have a special meaning within a regular
expression. You can do this by
preceding it with a backslash.
For instance, a regular expression one could use for describing e-mail
addresses is
“[a-z]+@[a-z]+(\.[a-z]+)+
”.
This says that there must be one or more letters before the ‘@
’
sign, followed by one or more letters, followed by one or more instances
of a period followed by one or more letters. The period must be preceded
by a backslash, because otherwise preg_match will allow any
character in place of the period. (This regular expression is deficient
because some institutions use periods in e-mail addresses, and some
domain names contain hyphens.)
Below is a summary of all of the special characters we have seen.
()
grouping []
range of characters .
any character {}
copies of the preceding pattern ?
zero or one of the preceding pattern *
any number of the preceding pattern (including zero) +
at least one of the preceding pattern \
treat next character literally instead of as a special symbol
There are many other options available with regular expressions, but this list is adequate for most purposes.