cfRegeX


Positions

When working with regex it is useful to remember that matching can deal with positions as well as (or instead of) matching actual characters. A position is the 'gap' between characters, and is sometimes referred to as a zero-length match.

There are two main ways to match positions - by using pre-defined "boundary" metacharacters, or by using ad-hoc "lookaround" expressions.

Boundaries

Start of input

To match the start of the input text - the position before the first character - you can use either "\A" or "^". The former will only ever match this position, but when multiline mode is enabled, "^" will additionally match start of line position.

End of input

To match the end of the input text - the position after the last character - you can use either "\z" or "$". The former will only ever match the end of input, whilst the latter can also match the end of lines, if multiline mode is enabled.

There is also a "\Z" which almost matches the end of input but will match at the position before a trailing newline, if there is one.

Start of line

When in multiline mode, you can use caret "^" to match the start of the line, which is defined as the position after a newline.

What is considered a newline can be altered with Unix Lines mode, which allows you to include or exclude carriage returns, (however this will only affect the start of line position if there are individual carriage returns, since carriage returns paired with a line-break come at the end of the line, not the start).

With multiline mode disabled, there is no explicit start of line character, though a positive lookbehind can be used, i.e. (?<=\n)

End of line

When in Multiline mode you can use dollar "$" to match the end of the line, which is defined as the position before any newline.

Whether carriage returns are considered part of newline can be controlled with the Unix Lines mode which (when enabled) means that only newline character is considered a newline, and the position matched will be after any carriage returns that might otherwise be paired with a newline.

With multiline mode disabled, there is no explicit end of line character, though a positive lookahead can be used, i.e. (?=\n)

Word Boundary

This is slightly different to what you might expect. It does not match whitespace between words (remember, whitespace is characters, and we're dealing in positions), but the "\b" word boundary metacharacter is used to match a change between a word character and a non-word character.

There is a word boundary position between the two characters "a-" and also between "-b", but there is not a word boundary between "ab" nor is there one between "--".

Whilst some regex implementations have distinct "start word" and "end word" boundaries, the engine used by cfRegex does not differentiate them. You can workaround this by using lookarounds to immitate start of word (?<!\w)(?=\w) and end of word (?<=\w)(?!\w)

You can match the opposite of a word boundary using "\B", which will match between "ab" and between "--" but not between "a-" nor "-b".

Lookarounds

When you need to match an adhoc-position, you can use lookarounds. A lookaround lets you use a sub-expression to indicate the position that can match. For lookaheads you have the full regex syntax available to you. For lookbehinds you can only use limited-width quantifiers (that is, the standard "*", "+", and variants are unavailable, since they do not have a maximum width).

As you might guess from the name, lookarounds do not actually match anything - they simply look at what is ahead or behind and determine if their sub-expression will match or not, and either succeed (and let matching continue) or fail (and the match fails).

Lookarounds can be either positive (their sub-expression must match to succeed), or negative (their sub-expression must not match to success), which - combined with lookahead and lookbehind - gives four different lookarounds in total:

It is useful to remember that - since lookarounds do not consume characters - they can be "stacked" to allow for a combination of conditions (which might be less maintainable if expressed all together), for example you can use "(?=\w)(?!x)" to match the position before any word character except the letter "x". As a single lookahead this would need to be "(?=[A-Za-wyz0-9_])" which is obviously more long-winded.