cfRegeX


Matching Mechanics

Knowing how a regex engine interprets the rules of a regex pattern is useful, as it helps you to write more efficient expressions and avoid certain mistakes.

The core regex engine used by cfRegex is the java.util.regex library, which uses an algorithm known as "backtracking NFA".

Whilst you don't need to know the full details of the underlying theory of how this algorithm works and how it compares to other possible implementations, it is useful to understand a number of practical aspects to it, which are described in the sections below.

Regex works Left-to-Right

This is something common to all existing regex engines - matching is done from left-to-right, in all situations. Whilst a regex matcher might skip back to a previous position, it never actually scans from right-to-left

If you need to deal only with something at the end of a string (such as a file extension) it is always best to isolate the end of the string using simpler methods before trying to match a regex against it. (e.g. use Right(string,num) or ListLast(string,delimiter) as these both work by starting from the right and moving towards the left.)

To demonstrate this behaviour, consider the text "the quick fox jumps over the lazy brown dog", and the expression "\b.+?dog". You might think the ".+?" will match at least once but as few characters as possible, and so the closest the "\b" can match to "dog" is after the "n" of "brown", before the space - however this is falling into the trap of applying your own direction to the interpretation of the instructions.

The regex engine starts with the "\b", which matches immediately, before the first "t", then the ".+?" matches the "t", tries to pass control to the next instruction (because it is a lazy quantifier), but this keeps failing and passes back to the lazy quantifier, which keeps expanding it's match one character at a time, until it has eventually found the "dog" at the end and ended up matching the entire text.

So remember: Regex always starts with the first instruction and at the start (left-most position) of a string of characters.

Backtracking

One of the key benefits of regex is its ability to try multiple alternatives, however it is not always obvious what the cost of this may be.

When matching, a regex engine applies the regex instructions one at a time. If the instruction succeeds, it moves forward and tries the next instruction. When an instruction fails, it looks to see if there are alternatives, if there are none it skips back to the last point it had alternatives and tries a different option, and it repeats this whole process until either the regex has matched, or the regex engine has ran out of alternatives to try.

A regex matcher will not skip forward to future parts of the regex and try those first. It will not even check what the next instruction is and keep track of when that might match - it deals with the current instruction completely before even considering what the following instruction might be.

For example, the pattern ^.+(?=://) might be used in order to match the protocol from a URL - basically saying, "start at the beginning, match any character as many times as possible, then look for (but don't match) ://".

However, what happens here is that the + quantifer matches the entire input string until the end, then it tries (and fails) to match :// and it backtracks one character at a time, attempting to match, failing, and backtracking again, until eventually it reaches almost the start of the string, and eventually finds the :// at position four (assuming a string starting "http://...").

The most common solution people use for this is to use the "lazy" +? quantifier instead, however whilst this is significantly faster it will still involve unnecessary backtracking. Lazy quantifiers will match one unit at a time, then immediately say "ok, what's next", and (when the next instruction fails) control is passed back, an additional unit is matched, and then it passes control forward again. For this example that's not going to matter much, but for longer strings this can be just as much as issue as the .+ originally used.

The improved solution is not to use . but to be more specific with what is expected to be found. Using either [^:]+ or \w+ will both prevent the regex going all the way along the string, whilst also avoiding the continuous backtracking that a lazy quantifier can cause.

See the following diagram for a more visible example of what is going on:

Backtracking Example

Sometimes backtracking is unavoidable - without it regex would not be as useful as it is - but the key is to try to keep it to a minimum, and if it can be avoided altogether that's even better.

Alternation uses the earliest match.

When using pipe "|" to allow a number of alternatives, it is important to keep in mind that regex does not even consider later options if an earlier one has matched successfully.

The expression (car|cart|carpet)(\w*) matched against the text "carpets" will place "car" in group one and "pets" in group two.

In other regex implemenations (ones without backtracking), all three alternatives are matched in parallel and the longest match is used (which would result in "carpet" and "s").

If we changed the expression to (car|cart|carpet)([^p]\w*) and still matched against the text "carpets", the first group would still match "car", however (when attempting to match the second group) the [^p] would prevent a succesful match, so the regex engine would backtrack, attempt "cart" which fails, then attempt "carpet" which succeeds, then move forward to the second group again, which would match the "s".

For this reason, you should order alternatives so the most likely to match is first and the least likely of the options is last.