CSci 150: Foundations of computer science
Home Syllabus Readings Projects Tests

Regular expressions

A regular expression is a common way of describing a set of possible strings. It can be seen as a miniature language, although it is expressed in a condensed form. Regular expressions are useful in programming, particularly for checking whether user input matches a desired format. For example, you can imagine a program that asks a user for a ZIP code but which should display an error message when the entered string isn't a 5-digit number. Regular expressions are also quite handy in editing text; any professional-grade word processor or text editor has a find-and-replace feature that accepts regular expressions.

Fixed-length expressions

The simplest form of regular expression is as a sequence of letters or digits. For example, the pattern “amb” describes the set of strings including just amb and no other strings.

But there are a number of special symbols one can use. One is the set of brackets ‘[’ and ‘]’, which indicates a choice among several characters. If we want to allow the a to be capitalized in the above example, then we could use “[Aa]mb” as our pattern. Or “b[aeiou]g” would indicate the set with bag, beg, big, bog, and bug.

You can also indicate a range of characters inside the brackets using a hyphen; thus “a[a-z]b” will match any string containing an a followed by a lower-case letter followed by a b, as in aab, amb, arb, and 23 others. If you want all strings with two capital letters, similar to the United States state codes, you could use “[A-Z][A-Z]”.

It is also sometimes useful to allow any character to appear in a particular spot in a regular expression. A period ‘.’ will match any single character.

Repetition

You can use a set of braces (‘{’ and ‘}’) to specify that something appear a particular number of times within a regular expression. You would place the braces directly after the pattern that should be repeated that number of times, and the number would appear within the braces. Another way to write our two-letter state code pattern is as “[A-Z]{2}”. Or we can match 5-digit ZIP codes using the pattern “[0-9]{5}”.

A set of braces can also contain two numbers inside, specifying that the number of repetitions of the preceding pattern must be between those two numbers. The set of numbers between 100 and 99999, for example, may be described with “[1-9][0-9]{2,4}”.

You can use a set of parentheses to specify an order of operations. This can be useful when you want a pattern of potentially several characters to repeat multiple times. We can extend our ZIP code pattern to allow either a 5-digit or 9-digit ZIP code using “[0-9]{5}(-[0-9]{4}){0,1}”. Here, we've used parentheses to enclose the final dash and four digits, saying that this portion is allowed to occur either 0 or 1 times.

There is a shorthand for saying 0 or 1 times: You can use a question mark (‘?’) in place of {0,1}. Similarly, you can use an asterisk (‘*’) to indicate that the preceding pattern can repeat any number of times (including the possibility that it may not appear at all); and you can use a plus sign (‘+’) to indicate that the preceding pattern must appear at least once but may be repeated beyond that. If I wanted to describe the set of strings describing integers, I could use the regular expression “-?[0-9]+”: It says that the string may begin with a negative sign, but that it must contain at least one digit.

Summary

Finally, you'll occasionally want a regular expression to mention one of the characters that have a special meaning within a regular expression. You can do this by preceding it with a backslash. For instance, a regular expression one could use for describing e-mail addresses is “[a-z]+@[a-z]+(\.[a-z]+)+”. This says that there must be one or more letters before the ‘@’ sign, followed by one or more letters, followed by one or more instances of a period followed by one or more letters. The period must be preceded by a backslash, because otherwise preg_match will allow any character in place of the period. (This regular expression is deficient because some institutions use periods in e-mail addresses, and some domain names contain hyphens.)

Below is a summary of all of the special characters we have seen.

()grouping
[]range of characters
.any character
{}copies of the preceding pattern
?zero or one of the preceding pattern
*any number of the preceding pattern (including zero)
+at least one of the preceding pattern
\treat next character literally instead of as a special symbol

There are many other options available with regular expressions, but this list is adequate for most purposes.