January Introduction This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep.

The two approaches have wildly different performance characteristics: Time to match a? The two graphs plot the time required by each approach to match the regular expression a?

Notice that Perl requires over sixty seconds to match a character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo.

The trends shown in the graph continue: Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages.

A more detailed graph later in this article presents data for other implementations. It may be hard to believe the graphs: Most of the time, in fact, regular expression matching in Perl is fast enough.

In contrast, there are no regular expressions that are pathological for the Thompson NFA implementation. Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs.

They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them.

By the late s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs.

The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools. This article reviews the good theory: It also puts the theory into practice, describing a simple implementation of Thompson's algorithm.

That implementation, less than lines of C, is the one that went head to head with Perl above. The article concludes with a discussion of how theory might yet be converted into practice in the real-world implementations.


Regular Expressions Regular expressions are a notation for describing sets of character strings. When a particular string is in the set described by a regular expression, we often say that the regular expression matches the string. The simplest regular expression is a single literal character.

To match a metacharacter, escape it with a backslash: Two regular expressions can be alternated or concatenated to form a new regular expression: The operator precedence, from weakest to strongest binding, is first alternation, then concatenation, and finally the repetition operators. Explicit parentheses can be used to force different meanings, just as in arithmetic expressions.

The syntax described so far is a subset of the traditional Unix egrep regular expression syntax. This subset suffices to describe all regular languages: Newer regular expression facilities notably Perl and those that have copied it have added many new operators and escape sequences.

These additions make the regular expressions more concise, and sometimes more cryptic, but usually not more powerful: One common regular expression extension that does provide additional power is called backreferences. As far as the theoretical term is concerned, regular expressions with backreferences are not regular expressions.


The power that backreferences add comes at great cost: Perl and the other languages could not now remove backreference support, of course, but they could employ much faster algorithms when presented with regular expressions that don't have backreferences, like the ones considered above.

