A Short History of Regex
In the late 1960s Ken Thompson of Bell Labs wrote them into the editor QED, and in the 1970s they made it into Unix programs and utilities such as grep, sed and AWK. These tools made text-matching much easier than the alternative—writing custom parsing programs for each task. Naturally, some saw the potential for even more powerful text-matching patterns. In the 1980s, programmers could not resist the urge to expand the existing regular expression syntax to make its patterns even more useful—most notably Henry Spencer with his regex library, then Larry Wall with the Perl language, which used then expanded Spencer's library. The engines that process this expanded regular expression syntax were no longer DFAs—they are called Non-Deterministic Finite Automatons (NFAs). At that stage, regex patterns could no longer said to be regular in the mathematical sense. This is why a small minority of people today (most of whom have email addresses ending with .edu) will maintain that what we call regex are not regular expressions. For the rest of us… Regex and regular expressions? Same-same. Perl had a huge influence on the flavors of regular expressions used in most modern engines today. This is why modern regular expressions are often called Perl-style. The differences in features across regex engines are considerable, so in my view speaking of Perl-style regular expressions only makes sense when one wants to make it clear one is not talking about the ivory tower brand of mathematically-correct expressions. But if you really want to avoid ambiguity, just say regex, as that is one word that white-coat computer scientists are not claiming. Subject: Reiterating that regular = DFA = NFA Could you please fix this? I see you've replied, but the main text still needs to be corrected. I actually noticed this before reading the comment. Before: The engines that process this expanded regular expression syntax were no longer DFAs—they are called Non-Deterministic Finite Automatons (NFAs). After (the wording may need a bit of additional reworking): The engines that process this expanded regular expression syntax were no longer regular. They are certainly recursively enumerable, most likely also context-sensitive, and if so, most likely also context-free. I am slightly confused by whether or not context-free languages are also context-sensitive, given the wording. I'm pretty sure they are considered to be, though. Reply to Solomon Ucko Hi Solomon, I don't have the brain cycles right now to carefully consider the adequate wording, but your comment is up. Subject: Please check your theory DFAs and NFAs have equivalent expressive power. In other words, for every NFA, there is an equivalent DFA, and visa-versa. So it is incorrect to say that these "regex patterns could no longer said to be regular in the mathematical sense. " They in fact do define regular languages. Some "regex" engines, however do go beyond regular languages to define "context-free" languages. These cannot be represented by an NFA (or DFA) and require a push-down automata (PDA). Reply to Thomas McLeod Thomas, That's sort of right but sort of muddled. Deterministic Finite Automata, Nondeterministic Finite Automata, and regular expressions all generate/recognize exactly the set of regular languages. However, the "regular expressions" in programming languages might better be described as "extended regular expressions". The language of all strings with some number of As followed by a B followed by the same number of As is not a regular language (pumping lemma). However, it is accepted by the following regex: (A*)B\1 "Automata" is plural. The singular is "automaton".Quick-Start: Regex Cheat Sheet
Characters
Character | Legend | Example | Sample Match |
---|---|---|---|
\d | Most engines: one digit from 0 to 9 | file_\d\d | file_25 |
\d | .NET, Python 3: one Unicode digit in any script | file_\d\d | file_9 |
\w | Most engines: "word character": ASCII letter, digit or underscore | \w-\w\w\w | A-b_1 |
\w | .Python 3: "word character": Unicode letter, ideogram, digit, or underscore | \w-\w\w\w | 字-ま_ |
\w | .NET: "word character": Unicode letter, ideogram, digit, or connector | \w-\w\w\w | 字-ま |
\s | Most engines: "whitespace character": space, tab, newline, carriage return, vertical tab | a\sb\sc | a b c |
\s | .NET, Python 3, JavaScript: "whitespace character": any Unicode separator | a\sb\sc | a b c |
\D | One character that is not a digit as defined by your engine's \d | \D\D\D | ABC |
\W | One character that is not a word character as defined by your engine's \w | \W\W\W\W\W | *-+=) |
\S | One character that is not a whitespace character as defined by your engine's \s | \S\S\S\S | Yoyo |
Quantifier | Legend | Example | Sample Match |
---|---|---|---|
+ | One or more | Version \w-\w+ | Version A-b1_1 |
{3} | Exactly three times | \D{3} | ABC |
{2,4} | Two to four times | \d{2,4} | 156 |
{3,} | Three or more times | \w{3,} | regex_tutorial |
* | Zero or more times | A*B*C* | AAACC |
? | Once or none | plurals? | plural |
Character | Legend | Example | Sample Match |
---|---|---|---|
. | Any character except line break | a.c | abc |
. | Any character except line break | .* | whatever, man. |
\. | A period (special character: needs to be escaped by a \) | a\.c | a.c |
\ | Escapes a special character | \.\*\+\?\$\^\/\\ | .*+?$^/\ |
\ | Escapes a special character | \[\{\(\)\}\] | [{()}] |
Logic | Legend | Example | Sample Match |
---|---|---|---|
| | Alternation / OR operand | 22|33 | 33 |
( … ) | Capturing group | A(nt|pple) | Apple (captures "pple") |
\1 | Contents of Group 1 | r(\w)g\1x | regex |
\2 | Contents of Group 2 | (\d\d)\+(\d\d)=\2\+\1 | 12+65=65+12 |
(?: … ) | Non-capturing group | A(?:nt|pple) | Apple |
Character | Legend | Example | Sample Match |
---|---|---|---|
\t | Tab | T\t\w{2} | Tab |
\r | Carriage return character | see below | |
\n | Line feed character | see below | |
\r\n | Line separator on Windows | AB\r\nCD | ABCD |
\N | Perl, PCRE (C, PHP, R…): one character that is not a line break | \N+ | ABC |
\h | Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator | ||
\H | One character that is not a horizontal whitespace | ||
\v | .NET, JavaScript, Python, Ruby: vertical tab | ||
\v | Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator | ||
\V | Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace | ||
\R | Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v) |
Quantifier | Legend | Example | Sample Match |
---|---|---|---|
+ | The + (one or more) is "greedy" | \d+ | 12345 |
? | Makes quantifiers "lazy" | \d+? | 1 in 12345 |
* | The * (zero or more) is "greedy" | A* | AAA |
? | Makes quantifiers "lazy" | A*? | empty in AAA |
{2,4} | Two to four times, "greedy" | \w{2,4} | abcd |
? | Makes quantifiers "lazy" | \w{2,4}? | ab in abcd |
Character | Legend | Example | Sample Match |
---|---|---|---|
[ … ] | One of the characters in the brackets | [AEIOU] | One uppercase vowel |
[ … ] | One of the characters in the brackets | T[ao]p | Tap or Top |
- | Range indicator | [a-z] | One lowercase letter |
[x-y] | One of the characters in the range from x to y | [A-Z]+ | GREAT |
[ … ] | One of the characters in the brackets | [AB1-5w-z] | One of either: A,B,1,2,3,4,5,w,x,y,z |
[x-y] | One of the characters in the range from x to y | [-~]+ | Characters in the printable section of the ASCII table. |
[^x] | One character that is not x | [^a-z]{3} | A1! |
[^x-y] | One of the characters not in the range from x to y | [^-~]+ | Characters that are not in the printable section of the ASCII table. |
[\d\D] | One character that is a digit or a non-digit | [\d\D]+ | Any characters, including new lines, which the regular dot doesn't match |
[\x41] | Matches the character at hexadecimal position 41 in the ASCII table, i.e. A | [\x41-\x45]{3} | ABE |
Anchor | Legend | Example | Sample Match |
---|---|---|---|
^ | Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not") | ^abc .* | abc (line start) |
$ | End of string or end of line depending on multiline mode. Many engine-dependent subtleties. | .*? the end$ | this is the end |
\A | Beginning of string (all major engines except JS) | \Aabc[\d\D]* | abc (string......start) |
\z | Very end of the string Not available in Python and JS | the end\z | this is...\n...the end |
\Z | End of string or (except Python) before final line break Not available in JS | the end\Z | this is...\n...the end\n |
\G | Beginning of String or End of Previous Match .NET, Java, PCRE (C, PHP, R…), Perl, Ruby | ||
\b | Word boundary Most engines: position where one side only is an ASCII letter, digit or underscore | Bob.*\bcat\b | Bob ate the cat |
\b | Word boundary .NET, Java, Python 3, Ruby: position where one side only is a Unicode letter, digit or underscore | Bob.*\b\кошка\b | Bob ate the кошка |
\B | Not a word boundary | c.*\Bcat\B.* | copycats |
Character | Legend | Example | Sample Match |
---|---|---|---|
[:alpha:] | PCRE (C, PHP, R…): ASCII letters A-Z and a-z | [8[:alpha:]]+ | WellDone88 |
[:alpha:] | Ruby 2: Unicode letter or ideogram | [[:alpha:]\d]+ | кошка99 |
[:alnum:] | PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z | [[:alnum:]]{10} | ABCDE12345 |
[:alnum:] | Ruby 2: Unicode digit, letter or ideogram | [[:alnum:]]{10} | кошка90210 |
[:punct:] | PCRE (C, PHP, R…): ASCII punctuation mark | [[:punct:]]+ | ?!.,:; |
[:punct:] | Ruby: Unicode punctuation mark | [[:punct:]]+ | ,: |
Modifier | Legend | Example | Sample Match |
---|---|---|---|
(?i) | Case-insensitive mode (except JavaScript) | (?i)Monday | monDAY |
(?s) | DOTALL mode (except JS and Ruby). The dot (.) matches new line characters (\r\n). Also known as "single-line mode" because the dot treats the entire input as a single line | (?s)From A.*to Z | From A to Z |
(?m) | Multiline mode(except Ruby and JS) ^ and $ match at the beginning and end of every line | (?m)1\r\n^2$\r\n^3$ | 1 2 3 |
(?m) | In Ruby: the same as (?s) in other engines, i.e. DOTALL mode, i.e. dot matches line breaks | (?m)From A.*to Z | From A to Z |
(?x) | Free-Spacing Mode mode (except JavaScript). Also known as comment mode or whitespace mode | (?x) # this is a # comment abc # write on multiple # lines [ ]d # spaces must be # in brackets | abc d |
(?n) | .NET, PCRE 10.30+: named capture only | Turns all (parentheses) into non-capture groups. To capture, use named groups. | |
(?d) | Java: Unix linebreaks only | The dot and the ^ and $ anchors are only affected by \n | |
(?^) | PCRE 10.32+: unset modifiers | Unsets ismnx modifiers |
Lookaround | Legend | Example | Sample Match |
---|---|---|---|
(?=…) | Positive lookahead | (?=\d{10})\d{5} | 01234 in 0123456789 |
(?<=…) | Positive lookbehind | (?<=\d)cat | cat in 1cat |
(?!…) | Negative lookahead | (?!theatre)the\w+ | theme |
(?<!…) | Negative lookbehind | \w{3}(?<!mon)ster | Munster |
Class Operation | Legend | Example | Sample Match |
---|---|---|---|
[…-[…]] | .NET: character class subtraction. One character that is in those on the left, but not in the subtracted class. | [a-z-[aeiou]] | Any lowercase consonant |
[…-[…]] | .NET: character class subtraction. | [\p{IsArabic}-[\D]] | An Arabic character that is not a non-digit, i.e., an Arabic digit |
[…&&[…]] | Java, Ruby 2+: character class intersection. One character that is both in those on the left and in the && class. | [\S&&[\D]] | An non-whitespace character that is a non-digit. |
[…&&[…]] | Java, Ruby 2+: character class intersection. | [\S&&[\D]&&[^a-zA-Z]] | An non-whitespace character that a non-digit and not a letter. |
[…&&[^…]] | Java, Ruby 2+: character class subtraction is obtained by intersecting a class with a negated class | [a-z&&[^aeiou]] | An English lowercase letter that is not a vowel. |
[…&&[^…]] | Java, Ruby 2+: character class subtraction | [\p{InArabic}&&[^\p{L}\p{N}]] | An Arabic character that is not a letter or a number |
Syntax | Legend | Example | Sample Match |
---|---|---|---|
| Keep Out Perl, PCRE (C, PHP, R…), Python's alternate engine, Ruby 2+: drop everything that was matched so far from the overall match to be returned | prefix\K\d+ | 12 |
| Perl, PCRE (C, PHP, R…), Java: treat anything between the delimiters as a literal string. Useful to escape metacharacters. | \Q(C++ ?)\E | (C++ ?) |
^(.*\.)[^.]+$
Replace: \1rar
Translation: At the beginning of the file name, greedily match any characters then a dot, capturing this to Group 1.
The greedy dot-star will shoot to the end of the file name, then backtrack to the last dot.
This capture is the stem of the file name.
After this capture, match any character that is a non-dot: the extension.
Replace all of this with the content of Group 1, expressed as \1 or $1 (this is the captured stem and includes the dot) and "rar".
Removing a Character from the File Name
You could do this with a simple search-and-replace, but, to get accustomed to regex, here is a convoluted way to do it.
The aim is to zap all dashes.
Search pattern: ^([^-]*)-(.*)#
Replace: \1\2
Translation: Match and capture any non-dash characters to Group 1, match a dash, then eat up any characters, capturing those in Group 2.
Replace the file name with the first group immediately followed by the second group (the dash is gone).
The hash character on the first line (#) tells the DOpus engine to perform that substitution until the string stops changing.
That way, all dashes are zapped one by one.
Replacing Dots with Spaces in File Names
This pattern is for times when you have 95 files that look something like this:
Search pattern: ([^.]+)\.(.*?)\.([^.]+)$#
Replace: \1 \2.\3
Translation: The pattern is a bit long, so let's unroll it.
([^.]+)# Greedily eat anything that is not a dot and capture that substring in group 1
\.# Match a dot
(.*?)# Lazily eat up anything and capture that substring in group 2
\.# Match a dot (this is the dot before the file extension)
([^.]+)$# Greedily eat up anything that is not a dot, until the end of the filename, capturing that in group 3 (this is the extension)
The final hash character (#) tells Opus to repeat the replace operation until there are no dots left to eat and the filename has been cleaned up.
The replace string takes the captured groups and inserts a space in place of each dot.
Swapping Two Parts of a Filename
Suppose you have named a lot of movie files according to this pattern:
8.2 Groundhog Day (1993).avi
The number at the front is the movie's IMDB rating.
The number between parentheses at the end is the movie's release year.
One day, you decide that instead of sorting movies by their rating, you really want to sort them by year, which means that in the file name, you'd like to swap the position of the rating and year.
You want your files to look like this:
(1993) Groundhog Day 8.2.avi
Without regular expressions, you are in trouble.
This is actually a fairly common scenario.
It could happen for any collection of files you have tagged, such as music tracks or topo maps.
Here is a regular expression that works in this case.
Search pattern: ^(\d\.\d)([^(]*)(\([\d]{4}\))(.*)
Replace: \3\2\1\4
Let's unroll the regex:
^(\d\.\d)# At the beginning, in Group 1, capture a digit, a dot and a digit.
That's the IMDB rating.
([^(]*)# In the second group, greedily capture anything that is not an opening parenthesis.
(\([\d]{4}\))# In the third group, capture an opening parenthesis (which needs to be escaped in the regex), four digits, and a closing parenthesis.
(.*)# In the last group, capture anything.
The replace pattern simply takes the four groups and changes their order.
(?=\b\w{7}\b)\w*?hay\w*
Translation: Look right ahead for a seven-letter word (the \b boundaries are important).
Lazily eat up any word characters followed by "hay", then eat up any word characters.
We know that the greedy match has to stop because the word is seven characters long.
Here, in our word, we allow any characters that regex calls "word characters", which, besides letters, also include digits and underscores.
If we want a more conservative pattern, we just need to change the lookup:
Traditional word (only letters): (?i)(?=\b[A-Z]{7}\b)\w*?hay\w*
In this pattern, in the lookup, you can see that I replaced \w{7} with [A-Z]{7}, which matches seven capital letters.
To include lowercase letters, we could have used [A-Za-z]{7}.
Instead, we used the case insensitive modifier (?i).
Thanks to this modifier, the pattern can match "HAY" or "hAy" just as easily as "hay".
It all depends on what you want: regex puts the power is in your hands.
Line Contains both "bubble" and "gum"
Search pattern: ^(?=.*?\bbubble\b).*?\bgum\b.*
Translation: While anchored a the beginning of the line, look ahead for any characters followed by the word bubble.
We could use a second lookahead to look for gum, but it is faster to just match the whole line, taking care to match gum on the way.
Line Contains "boy" or "buy"
Search pattern: \bb[ou]y\b
Translation: Inside a word (inside two \b boundaries), match the character b, then one character that is either o or u, then y.
Find Repeated Words, such as "the the"
This is a popular example in the regex literature.
I don't know about you you, but it doesn't happen all that often often that mistakenly repeated words find their way way into my text.
If this example is so popular, it's probably because it's a short pattern that does a great job of showcasing the power of regex.
You can find a million ways to write your repeated word pattern.
In this one, I used POSIX classes (available in Perl and PHP), allowing us to throw in optional punctuation between the words, in addition to optional space.
Search pattern: \b([[:alpha:]]+)[ [:punct:]]+\1
Translation: After a word delimiter, in group one, capture a positive number of letters, then eat up space characters or punctuation marks, then match the same word we captured earlier in group one.
If you don't want the punctuation, just use an \s+ in place of [ [:punct:]]+.
Remember that \s eats up any white-space characters, including newlines, tabs and vertical tabs, so if this is not what you want use [ ]+ to specify space characters.
The brackets are optional, but they make the space character easier to spot, especially in a variable-width font.
Line does Not Contain "boy"
Search pattern: ^(?!.*boy).*
Translation: At the beginning of the line, if the negative lookahead can assert that what follows is not "any characters then boy", match anything on the line.
Line Contains "bubble" but Neither "gum" Nor "bath"
Search pattern: ^(?!.*gum)(?!.*bath).*?bubble.*
Translation: At the beginning of the line, assert that what follows is not "any characters then gum", assert that what follows is not "any characters then bath", then match the whole string, making sure to pick up bubble on the way.
Email Address
If I ever have to look for an email address in my text editor, frankly, I just search for @.
That shows me both well-formed addresses, as well as addresses whose authors let their creativity run loose, for instance by typing DOT in place of the period.
When it comes to validating user input, you want an expression that checks for well-formed addresses.
There are thousands of email address regexes out there.
In the end, none can really tell you whether an address is valid until you send a message and the recipient replies.
The regex below is borrowed from chapter 4 of Jan Goyvaert's excellent book, .
I'm in tune with Jan's reasoning that what you really want is an expression that works with 999 addresses out of a thousand, an expression that doesn't require a lot of maintenance, for instance by forcing you to add new top-level domains ("dot something") every time the powers in charge of those things decide it's time to launch names ending in something like dot-phone or dot-dog.
Search pattern: (?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b
Let's unroll this one:
(?i)# Turn on case-insensitive mode
\b# Position engine at a word boundary
[A-Z0-9._%+-]+# Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus.
Yes, some of these are rare in an email address.
@# Match @
(?:[A-Z0-9-]+\.)+# Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens.
These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com
[A-Z]{2,6}# Match two to six letters, for instance US, COM, INFO.
This is meant to be the top-level domain.
Yes, this also matches DOG.
You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced.
\b# Match a word boundary
RewriteRule old_dir/(.*)$ new_dir/$1 [L,R=301]
Explanation: The old url is captured in Group 1, and appended at the end of the new path.
Targeting Certain Browsers
BrowserMatch \bMSIE no-gzip
This directive checks if the user's browser name contains "MSIE" (with a word boundary before the "M").
If so, Apache applies what follows on the line.
(In this case, no-gzip tells Apache not to compress content.)
Targeting Certain Files
<FilesMatch "\.html?$">
Header set Cache-Control "max-age=43200"
</FilesMatch>
The first line of this htaccess directive for file caching has a small regex matching files ending with a dot and "htm" or "html".
Other Regular Expressions in Apache
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*?Webster
Purpose: In a rewrite rule, tests for certain user agents.
RewriteCond %{HTTP_REFERER} ^http://www\.google\.com$
Purpose: In a rewrite rule, tests for a specific referrer.
RewriteCond %{REMOTE_ADDR} 192\.168\.\d\d.*
Purpose: In a rewrite rule, tests for an IP range.
RewriteCond %{TIME_HOUR} ^13$
Purpose: In a rewrite rule, check if the hour is 1pm.
There are other uses of regex in Apache.
These examples should give you a taste.
For background information, you may want to look at the manual page for mod_rewrite, the mod_rewrite page, the rewrite guide and the advanced rewrite guide.
Is Apache using the same PCRE version as PHP?
Not necessarily.
To see which version of PCRE PHP uses, look at the result of phpinfo() and search for PCRE.
In addition to the version number, you will find a reference to a directory: something like pcre-regex=/opt/pcre.
Another way to find that folder is to run ldd /some/path/php | grep pcre in the shell, where "some/path" is the path returned by "which php".
You can use that directory in a shell command line to get more information on your PCRE version:
/opt/pcre/bin/pcretest -C
On cPanel, EasyApache installs PCRE in the /opt folder, so if PHP reports the folder above, you can expect that mod_rewrite and PHP are using the same version of PCRE (unless there is a bug in cPanel).
On other installs, you may want to find all the installed versions of pcretest to see which versions are installed:
find / -name pcretest
SELECT * FROM YourDatabase WHERE YourField REGEXP 'ty$';
Here's a second example that select fields that do not contain a digit:
SELECT * FROM YourDatabase WHERE YourField NOT REGEXP "[[:digit:]]";
You can use RLIKE in place of REGEXP, as the two are synonyms.
But why would you?
Regular expressions in MySQL are aimed to comply with POSIX 1003.2, also known as Harry Spencer's "regex 7" library.
The MySQL documentation page for REGEXP states that it is incomplete, but that the full details are included in MySQL source distributions, in the regex.7 file… Okay, that's a drag, but let's download the source.
Oh, no, you can't, you need a special installer just to download the source.
Nevermind, here is a copy of the regex(7) manual page.
If you build queries in a programming language before sending them to MySQL, you have to pay particular attention to escaping contentious characters in the regex string.
Your language probably has a function that will get the string ready to be passed to the database.
If you are used to Perl-like regular expressions, MySQL's POSIX flavor will sound like baby talk.
If you need more power, you may consider the PCRE library for MySQL.
Since I upgrade my MySQL server for major releases, the risk of forward incompatibility is a bit high and I stick with regex(7).
Subject: what about the contexts for sed and egrep?
I'm an old unix guy and lived with the early regular expressions for a very long time.
I still script with sed.
I spent a number of hours today reading and evaluating many of the web-, linux- and windows-based tools to assist in testing and creating regular expressions.
What I find interesting is that I can't find one of these tools that allows one to restrict the engine to the sed (or egrep) contexts.
This would be extremely helpful.
I was actually surprised at seeing all the "other flavors" leaving the stalwarts no where to be found.
Why is this? I would like to see these supported because it is best to use the most efficient method and one can't get much more efficient than sed.
Regards
oldunixguy
Reply to rich painter
You're quite right Rich,
The tools I use don't have an egrep or sed mode.
regexbuddy has does have a Perl mode, and there's a lot you can do on 'nix with perl one-liners (there's a page on the site with examples, in case you haven't seen it yet.)
The Match is Just Another Capture GroupBasically, you can imagine that there is a set of parentheses around your entire regex. These parentheses are just implied. They capture Group 0, which by convention we call "the match". In fact, your language may already think that way. In PHP, if $match is the match array, $match[1] will contain Group 1, $match[2] will contain Group 2… and $match[0] ("Group 0") will contain the overall match. Likewise, in JavaScript the overall match will be in matchArray[0]. In Python and C#, you can (although those are not the only options) retrieve the overall match as match.group(0) and matchResult.Groups[0].Value. Likewise, in regex replace operations, \1, \2, \3 (or $1, $2, $3, depending on the flavor) usually refer to capture groups 1, 2 and 3. Not by coincidence, \0 ($0) usually refer to the overall match. Once you see that the match is just another group, the question of whether to match or to capture loses importance: You will be capturing anyhow. What does this mean? You are the one who knows what data you want to match. Knowing this, use whatever means you need (whether it's an overall match or a sneaky capture in Group 3) in order to grab what you want. In the example about keeping the regex in sync with your string, we'll look at a technique that makes many captures—some useful, some not—and then leaves it to the code outside the regex to decide which of the capture groups are important. It's not a particularly efficient technique, but it works. The only moderation I would add to the advice to "use whatever means you need" is that it's generally considered poor programming practice to spawn unnecessary capture groups, as they create overhead. So if you need parentheses in order to evaluate an expression but don't need to capture the data, make it a non-capturing group by using the (?: … ) syntax.
Matching All and Splitting are two sides of the same coin.Consider a list of fruits separated the word "and": apple and banana and orange and pear and cherry. You are interested in obtaining an array with all the names of fruits. To do so, you could match all the words that are not and (something like
\b(?!and)\S+
would do).
Another approach would be to split the string using the delimiter " and ".
Both approaches would provide you with the same array: it's a bit like one of those drawings that can be interpreted in different ways depending on whether you focus on the white background or on the inked parts.
When you to want to match, I'll split you...
When you want to split, I'll match you...
This is a simple example, but often you will gain considerable advantage by deciding to match rather than to split, or vice-versa.
You'll often find that one way is easy and the other nearly impossible.
Therefore, if someone tells you "I want to match all the…" or "I am trying to split by…", try not to rush down the first alley because they said "split" or "match": remember the other side of the coin.
To write good regex, say what you mean. Say it clearly.The more specific your expressions, the faster your regex will match; and, often more importantly, the faster your regex will fail when no match is there to be found. Here are a few "golden rules" that every regex craftsperson should keep in mind. If some of these rules don't make complete sense to you right now, don't worry about it—just come here again after you've read some other sections, or in a couple months' time. Whenever Possible, Anchor. Anchors, such as the caret ^ for the beginning of a line and the dollar sign $ for the end of a line often provide the needed clue that ensures the engine finds a match in the right place. For instance, when we validate a string, they ensure that the engine matches the whole string, rather than a substring embedded in the string being examined. And anchors often save the engine a lot of backtracking. Be aware that anchors are not limited to ^ and $. Most engines have other useful built-in anchors, such as \A and \G (see the cheat sheet). When You Know what You Want, Say It. When You Know what You Don't Want, Say It Too! When you feed your regex engine a lot of .* "dot-star soup", the engine can waste a lot of energy running down the string then backtracking. Be as specific as possible, whether by using a literal B character, a \d digit class or a \b boundary. Another great way to be specific is to say what you don't want—whether what you don't want is… a double quote:
[^"]
… a digit: \D
… or for the next three letters to be "boo": (?!boo)[a-z]{3}
.
Contrast is Beautiful—Use It.
When you can, use consecutive tokens that are mutually exclusive in order to create contrast.
This reduces backtracking and the need for boundaries in the broad sense of the term, in which I include lookarounds.
For instance, let's say you're trying to validate strings that contain exactly three digits located at the end, as in ABC123.
Something like ^.+\d{3}$
would not work, because .
and \d are not mutually exclusive—this regex would match ABC123456.
You may think to add a negative lookbehind: ^.+(?<!\d)\d{3}$
.
But if you use tokens that are mutually exclusive in the first place, you no longer need a lookaround: ^\D+\d{3}$
works straight out of the box.
With time, you come to relish the beautiful contrast between \D and \d, between [^a-z] and [a-z].
This is a variation on When you know what you want, say it.
Want to Be Lazy? Think Twice.
Let's say you want to match all the characters between a set of curly braces.
At first you might think of {.*?}
because the lazy quantifier ensures you don't overshoot the closing brace.
However, a lazy quantifier has a cost: at each step inside the braces, the engine tries the lazy option first (match no character), then tries to match the next token (the closing brace), then has to backtrack.
Therefore, the lazy quantifier causes backtracking at each step (see Lazy Quantifiers Are Expensive).
This is more efficient: {[^}]*}
.
This is a variation on Use Contrast and When you know what you want, say it.
A Time for Greed, a Time for Laziness.
A reluctant (lazy) quantifier can make you feel safe in the knowing that you won't eat more characters than needed and overshoot your match, but since lazy quantifiers cause backtracking at each step, using them can feel like bumping on a country road when you could be rolling down the highway.
Likewise, a greedy quantifier may shoot down the string then backtrack all the way back when all you needed was a few nudges with a lazy quantifier.
On the Edges: Really Need Boundaries or Delimiters? Use Them—or Make Your Own!
Most regex engines provide the \b boundary, and sometimes others, which can be useful to inspect an edge of a substring.
Depending on the engine, other boundaries may be available, but why stop there? In the right context, I believe in DIY boundaries.
For instance, using lookarounds, you can make a boundary to check for changes from upper- to lower-case, which can be useful to split a CamelCase string: (?<=[a-z])(?=[A-Z])
However, do not overuse boundaries, because good contrast often make them redundant (see Use Contrast.)
Don't Give Up what You Can Possess.
Atomic groups (?> … ) and the closely-related possessive quantifiers can save you a lot of backtracking.
Structured data often gives you chances to incorporate those in your expressions.
Don't Match what Splits Easily, and Don't Split what Matches Nicely.
I explained this point in the section about splitting vs.
matching.
Design to Fail.
As Shakespeare famously wrote, "Any fool can write a regex that matches what it's meant to find.
It takes genius to write a regex that knows early that its mission will fail." Take (?=.*fleas).*
.
It does a reasonable job of matching lines that contain fleas.
But what of lines that don't have fleas? At the very start of the string, the engine looks all the way down the line.
The lookahead fails, the regex engine moves to the second position in the string, and once again looks for fleas all the way down the line.
At each position in the string, the engine repeats the lookahead, so that the pattern takes a long time to fail… In comparison, consider ^(?=.*fleas).*
.
The only difference is the caret anchor.
It doesn't look like a big deal, but once the engine fails to find fleas at the start of the string, it stops because the lookahead is anchored at the start.
This pattern is designed for failure, and it is much more efficient—O(N) vs.
O(N2) for the first.
Trust the Dot-Star to Get You to the End of the Line
With all the admonishments against the dot-star, here is one of many cases where it can be useful.
In a string such as @ABC @DEF, suppose you wish to match the last token that starts with @, but only if there is more than one token.
If you simply wanted the last, you could use an anchor: @[A-Z]+$
… but that will match the token even if it is the only one in the string.
You might think to use a lookahead: @[A-Z].*\K@[A-Z]+(?!.*@[A-Z])
.
However, there is no need because the greedy .* already guarantees you that you are getting the last token! The dot-star matches all the way to the end of the line then backtracks, but only as far as needed: You can therefore simplify this to @[A-Z].*\K@[A-Z]+
Trust the dot-star to take you to the end of the line!
Greedy atoms anchor again."Greedy" reminds you to check if some greedy quantifiers should be made lazy, and vice-versa. It also reminds you of the performance hit of lazy quantifiers (backtracking at each step), and of potential workarounds. "Atoms" reminds you to check if some parts of the expression should be made atomic (or use a possessive quantifier). "Anchor" reminds you to check if the expression should be anchored. By extension, it may remind you of boundaries, and whether to add them—or remove them. "Again" reminds you to check if parts of the expression could use the repeating subpattern syntax. If you prefer short mnemonic devices, you may prefer the acronym AGRA, helpful to build the Taj Mahal of regular expressions, and named after the Indian city Agra, best known for the Taj Mahal: A for Anchor G for Greed R for Repeat A for Atomic
(?
In the regex tutorials and books I have read, these various points of syntax are introduced in stages.
But (?: … ) looks a lot like (?= … ), so that at some point they are bound to clash in the mind of the regex apprentice.
To facilitate study, I have pulled all the (? … ) usages I know about into one place.
I'll start by pointing out three confusing couples; details of usage will follow.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
\d+(?= dollars)
Sample Match: 100 in 100 dollars
Explanation: \d+ matches the digits 100, then the lookahead (?= dollars) asserts that at that position in the string, what immediately follows is the characters "dollars"
Lookahead Before the Match: (?=\d+ dollars)\d+
Sample Match: 100 in 100 dollars
Explanation: The lookahead (?=\d+ dollars) asserts that at the current position in the string, what follows is digits then the characters "dollars".
If the assertion succeeds, the engine matches the digits with \d+.
Note that this pattern achieves the same result as \d+(?= dollars) from above, but it is less efficient because \d+ is matched twice.
A better use of looking ahead before matching characters is to validate multiple conditions in a password.
Negative Lookahead After the Match: \d+(?!\d| dollars)
Sample Match: 100 in 100 pesos
Explanation: \d+ matches 100, then the negative lookahead (?!\d| dollars) asserts that at that position in the string, what immediately follows is neither a digit nor the characters "dollars"
Negative Lookahead Before the Match: (?!\d+ dollars)\d+
Sample Match: 100 in 100 pesos
Explanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string, what follows is not digits then
the characters "dollars".
If the assertion succeeds, the engine matches the digits with \d+.
Note that this pattern achieves the same result as \d+(?!\d| dollars) from above, but it is less efficient because \d+ is matched twice.
A better use of looking ahead before matching characters is to validate multiple conditions in a password.
Lookbehind Before the match: (?<=USD)\d{3}
Sample Match: 100 in USD100
Explanation: The lookbehind (?<=USD) asserts that at the current position in the string, what precedes is the characters "USD".
If the assertion succeeds, the engine matches three digits with \d{3}.
Lookbehind After the match: \d{3}(?<=USD\d{3})
Sample Match: 100 in USD100
Explanation: \d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in the string, what immediately precedes is the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<=USD)\d{3} from above, but it is less efficient because \d{3} is matched twice.
Negative Lookbehind Before the Match: (?<!USD)\d{3}
Sample Match: 100 in JPY100
Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string, what precedes is not the characters "USD".
If the assertion succeeds, the engine matches three digits with \d{3}.
Negative Lookbehind After the Match: \d{3}(?<!USD\d{3})
Explanation: \d{3} matches 100, then the negative lookbehind (?<!USD\d{3}) asserts that at that position in the string, what immediately precedes is not the characters "USD" then three digits.
Note that this pattern achieves the same result as (?<!USD)\d{3} from above, but it is less efficient because \d{3} is matched twice.
Support for Lookarounds
All major engines have some form of support for lookarounds—with some important differences.
For instance, JavaScript doesn't support lookbehind, though it supports lookahead (one of the many blotches on its regex scorecard).
Ruby 1.8 suffered from the same condition.
Lookbehind: Fixed-Width / Constrained Width / Infinite Width
One important difference is whether lookbehind accepts variable-width patterns.
At the moment, I am aware of only three engines that allow infinite repetition within a lookbehind—as in (?<=\s*): .NET, Matthew Barnett's outstanding regex module for Python, whose features far outstrip those of the standard re module, and the JGSoft engine used by Jan Goyvaerts' software such as EditPad Pro.
I've also implemented an infinite lookbehind demo for PCRE.
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range.
For instance, (?<=cats?) is valid because it can only match strings of three or four characters.
Likewise, (?<=A{1,10}) is valid.
PCRE (C, PHP, R …), Java and Ruby 2+ allow lookbehinds to contain alternations that match strings of different but pre-determined lengths (such as (?<=cat|raccoon))
Perl and Python require a lookbehind to match strings of a fixed length, so (?<=cat|racoons) will not work.
To master lookarounds, there is a bit more you should really know.
For these finer details, visit the lookaround page.
(?>A|.B)C
This will fail against ABC, whereas (?:A|.B)C would have succeeded.
After matching the A in the atomic group, the engine tries to match the C but fails.
Because it is atomic, it is unable to try the .B part of the alternation, which would also succeed, and allow the final token C to match.
Example 2: With Quantifier
(?>A+)[A-Z]C
This will fail against AAC, whereas (?:A+)[A-Z]C would have succeeded.
After matching the AA in the atomic group, the engine tries to match the [A-Z], succeeds by matching the C, then tries to match the token C but fails as the end of the string has been reached.
Because the group is atomic, it is unable to give up the second A, which would allow the rest of the pattern to match.
If, before the atomic group, there were other options to which the engine can backtrack (such as quantifiers or alternations), then the whole atomic group can be given up in one go.
When are Atomic Groups Important?
When a series of characters only makes sense as a block, using an atomic group can prevent needless backtracking.
This is explored on the section on possessive quantifiers.
In such situations atomic quantifiers can be useful, but not necessarily mission-critical.
On the other hand, there are situations where atomic quantifiers can save your pattern from disaster.
They are particularly useful:
In order to avoid the with patterns that contain lazy quantifiers whose token can eat the delimiter
To avoid certain forms of the
Supported Engines, and Workaround
Atomic groups are supported in most of the major engines: .NET, Perl, PCRE and Ruby.
For engines that don't support atomic grouping syntax, such as Python and JavaScript, see the well-known pseudo-atomic group workaround.
Alternate Syntax: Possessive Quantifier
When an atomic group only contains a token with a quantifier, an alternate syntax (in engines that support it) is a possessive quantifier, where a + is added to the quantifier.
For instance,
(?>A+) is equivalent to A++
(?>A*) is equivalent to A*+
(?>A?) is equivalent to A?+
(?>A{…,…}) is equivalent to A{…,…}+
This works in Perl, PCRE, Java and Ruby 2+.
For more, see the possessive quantifiers section of the quantifiers page.
Non-Capturing
Atomic groups are non-capturing, though as with other non-capturing groups, you can place the group inside another set of parentheses to capture the group's entire match; and you can place parentheses inside the atomic group to capture a section of the match.
Watch out, as the atomic group syntax is confusingly similar to the .
^(?<intpart>\d+)\.(?<decpart>\d+)$
or
^(?P<intpart>\d+)\.(?P<decpart>\d+)$
would both match a string containing a decimal number such as 12.22, storing the integer portion to a group named intpart, and storing the decimal portion to a group named decpart.
To create a back-reference to the intpart group in the pattern, depending on the engine, you'll use \k<intpart> or (?P=intpart).
To insert the named group in a replacement string, depending on the engine, you'll either use ${intpart}, \g<intpart>, $+{intpart}or the group number \1.
For the gory details, see Naming Groups—and referring back to them.
To name, or not to name?
I'll admit that I don't use named groups a whole lot, but some people love them.
Sure, named captures are bulkier than a quick (capture) and reference to \1—but they can save hassles in expressions that contain many groups.
Do they make your patterns easier to read? That's subjective.
For my part, if the regex is short, I always prefer numbered groups.
And if it is long, I would rather read a regex with numbered groups and good comments in free-spacing mode than a one-liner with named groups.
(?i=bob)
, (?iP<name>bob)
and (?i>bob)
Using Inline Modifiers in the Middle of a Pattern
Usually, you'll use your inline modifiers at the start of the regex string to set the mode for the entire pattern.
However, changing modes in the middle of a pattern can be useful, so I'll give you two examples.
(\b[A-Z]+\b)(?i).*?\b\1\b
This ensures that an upper-case word is repeated somewhere in the string, in any letter-case.
First we capture an upper-case word to Group 1 (for instance DOG), then we set case-insensitive mode, then .*? matches any characters up to the back-reference \1, which could be dog or dOg.
As a neat variation, (\b[A-Z]+\b).*?\b(?=[a-z]+\b)(?i)\1\b ensures that the back-reference is in lower-case.
^(\w+)\b.*\r?\n(?s).*?\b\1\b
This ensures that the first word of the string is repeated on a different line.
First we capture a word to Group 1, then we get to the end of the line with .*, match a line break, then set DOTALL mode—allowing the .*? to match across lines, which brings us to our back-reference \1.
Unsetting all modifiers: (?^)
As of PCRE 10.32, (?^) unsets all ismnx modifiers.
(\w+) (?1)
will match Hey Ho.
The parentheses in (\w+) not only capture Hey to Group 1—they also define Subroutine 1, whose pattern is \w+.
Later, (?1) is a call to subroutine 1.
The entire regex is therefore equivalent to (\w+) \w+
Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors.
Relative Subroutines
Instead of referring to a subroutine by its number, you can refer to the relative position of its defining group, counting left or right from the current position in the pattern.
For instance, (?-1) refers to the last defined subroutine, and (?+1) refers to the next defined subroutine.
Therefore,
(\w+) (?-1)
and
(?+1) (\w+)
are both equivalent to our first example with numbered group 1.
In Ruby 2+, for relative subroutine calls, you would use \g and \g.
Named Subroutines
Instead of using numbered groups, you can use named groups.
In that case, in Perl and PHP the syntax for the subroutine call will be (?&group_name).
In Ruby 2+ the syntax is \g<some_word>.
For instance,
(?<some_word>\w+) (?&some_word)
is equivalent to our first example with numbered group 1.
Pre-Defined Subroutines
So far, when we defined our subroutines, we also matched something.
For instance, (\w+) defines subroutine 1 but also immediately matches some word characters.
It so happens that Perl, PCRE and Python's alternate
Subroutines and Recursion
If you place a subroutine such as (?1) within the very capture group to which it refers—Group 1 in this case—then you have a recursive expression.
For instance, the regex ^(A(?1)?Z)$ contains a recursive sub-pattern, because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1.
If you try to trace the matching path of this regex in your mind, you will see that it matches strings like AAAZZZ, strings which start with any number of letters A and end with letters Z that perfectly balance the As.
After you open the parenthesis, the A matches an A… then the optional (?1)? opens another parenthesis and tries to match an A… and so on.
We'll look at recursion syntax in the next section.
There is also a page dedicated to recursion.
Warning
Note that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.
A(?R)?Z
matches strings or substrings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a number of letters Z at the end.
The initial token A matches an A… Then the optional (?R)? tries to repeat the whole pattern right there, and therefore attempts the token A to match an A… and so on.
Recursion of a Subroutine: (?1) and (?-1)
You also have recursion when a subroutine calls itself.
For instance, in
^(A(?1)?Z)$
subroutine 1 (defined by the outer parentheses) contains a call to itself.
This regex matches entire strings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a number of letters Z at the end.
As we saw in the section on subroutines, you can also call a subroutine by the relative position of its defining group at the current position in the pattern.
Therefore,
^(A(?-1)?Z)$
performs exactly like the above regex.
There is much more to be said about recursion.
See the page dedicated to recursive regex patterns.
^({)?\d+(?(1)})$
Likewise, (?(foo)…) checks if the capture group named foo has been set.
This pattern matches a string of digits that may or may not be embedded in curly braces.
The optional capture Group 1 ({)? captures an opening brace.
Later, the conditional checks if capture 1 was set, and if so it matches the closing brace.
Let's expand this example to use the "else" part of the syntax:
^(?:({)|")\d+(?(1)}|")$
This pattern matches strings of digits that are either embedded in double quotes or in curly braces.
The non-capture group (?:({)|") matches the opening delimiter, capturing it to Group 1 if it is a curly brace.
After matching the digits, (?(1)}|") checks whether Group 1 was set.
If so, we match a closing curly brace.
If not, we match a double quote.
Lookaround in Conditions
In (?(A)B), the condition you'll most frequently see is a check as to whether a capture group has been set.
In .NET, PCRE and Perl (but not Python and Ruby), you can also use lookarounds:
\b(?(?<=5D:)\d{5}|\d{10})\b
If the prefix 5D: can be found, the pattern will match five digits.
Otherwise, it will match ten digits.
Needless to say, that is not the only way to perform this task.
Checking if a relative capture group was set
(?(1)A) checks whether Group 1 was set.
In PCRE, instead of hard-coding the group number, we can also check whether a group at a relative position to the current position in the pattern has been set: for instance, (?(-1)A) checks whether the previous group has been set.
Likewise, (?(+1)A) checks whether the next capture group has been set.
(This last scenario would be found within a larger repeating group, so that on the second pass through the pattern, the next capture group may indeed have been set on the previous pass.)
Checking if a recursion level was reached
This is not the place to be talking in depth about recursion, which has a section below and a dedicated page, but for completion I should mention two other uses of conditionals, available in Perl and PCRE:
(?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from a recursive call to the whole pattern or a subroutine).
(?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1.
See examples here.
Availability of Regex Conditionals
Conditionals are available in PCRE, Perl, .NET, Python, and Ruby 2+.
In other engines, the work of a conditional can usually be handled by the careful use of lookarounds.
Similar Syntax
Note that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine, where the regex pattern defined by Group 1 must be matched.
(?&noun_phrase)\ (?&verb)\ (?&noun_phrase)
The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern, and if we decide to change the definition of noun_phrase, that immediately trickles to the two places where it is used.
Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (?&object) uses the quant, adj and object subroutines.
With this kind of modularity, you can build regex cathedrals.
There is a beautiful example on the page with the regex to match numbers in plain English.
A Note on Group Numbering
Please be mindful that each named subroutine consumes one capture group number, so if you use capture groups later in the regex, remember to count from left to right.
The gory details are on the page about Capture Group Numbering & Naming.
(?|A(\d+)|(\d+)B|C(\d+)D)
After the initial (?|, which introduces a branch reset, the group has a three-piece alternation (two |).
Each of those contains a capture group (\d+).
The number of all of those capture groups is the same: Group 1.
You are not limited to one group.
For instance, if you are also interested in capturing a potential suffix after the number (which can happen in the situations 11B and C55D), place another set of parentheses wherever you find a suffix:
(?|A(\d+)|(\d+)(B)|C(\d+)(D))
Using this regex to match the string A00 11B C22D, you obtain these groups:
Match Group 1: Number Group 2: Suffix
----- --------------- ---------------
A00 00 (not set)
11B 11 B
C22D 22 D
How Useful is Branch Reset?
When I first read about branch reset in the PCRE documentation a few years ago, I was excited and certain I'd use it often.
Since then, I've written several thousand regular expression patterns, but I've used branch reset less than a handful of times.
It's probably my fault for always jumping on other ways to do things first, but this leaves me with a sense that the feature is not all that useful after all.
That being said, on rare occasions, it's just the most direct and elegant way of doing things.
Let's look at one more example, less contrived than the first—which was pared down in order to explain the feature.
A Branch Reset Example: Tokenization with Variable Formats
To me, this is an example where branch reset seems to offer benefits over competing idioms.
Suppose you want to parse strings such as
song:"Sweet Home Alabama" fruit:apple color:blue motto:"Don't Worry"
into pairs of keys and values.
When the value following the colon is between quotes, you only want the inside of the quotes.
Therefore, you expect something like:
Group 1 Group 2
------- -------
song Sweet Home Alabama
fruit apple
color blue
motto Don't Worry
This branch reset regex will get you there:
(\S+):(?|([^"\s]+)|"([^"]+))
Group 1 (\S+) is a straight capture group that captures the key.
In the branch reset, the two sets of capturing parentheses allow you to capture different kinds of values in different formats to the same group, i.e.
Group 2.
You can check the group captures in the right pane of this online regex demo.
To me, this alternative with a conditional and a lookbehind…
(\S+):"?((?(?<!")[^"\s]+|[^"]+))
…feels a little less satisfying.
But hey, it works too.
(?# the year)\d{4}
\d{4} matches four digits, while (?# the year) tells you what we are trying to match.
How useful is this? Not very.
I almost never use this feature: when I want comments, I just turn on free-spacing mode for the whole regex.
(?(VERSION>=10)YES)
If it matches, the version is 10 or later.
As another example, you could use LATER // EARLIER as your subject, and match it with this:
(?(VERSION>=10.5)LATER|EARLIER)
Depending on your version, PCRE2 will either match LATER or EARLIER.
Assert that immediately to the left of the current position, we can find the left wall, while to the right of the current position we cannot find the left wall.Yep, in that light, our anchor is a boundary—we look left and right. We'll keep anchors and boundaries on separate pages because there's a lot of ground to cover, but just keep that in mind.
(?<=^> )(?=[a-zA-Z])
After asserting that what precedes the current position is a "greater than" and a space, we assert that what follows is a letter.
Note that the order of the lookahead and the lookbehind do not matter, as they do not consume any characters: they look to the left and to the right with our feet firmly planted in the same spot in the string.
Therefore, the reverse-order boundary
(?=[a-zA-Z])(?<=^> )
works equally well.
After either of these patterns, we can confidently use any regex meta-character—such as the dot—and be sure that it will match a letter: they are true boundaries.
Generalizing the idea: home-made word boundary
We can use this technique to construct any boundary we like.
The coming sections will show some examples in detail, but to whet our appetite, how would you build a word boundary if your regex engine didn't support \b?
When it matches on the left of word characters, a word boundary is able to check that what follows is a word character but what precedes is not.
In lookaround terms, this is (?=\w)(?<!\w).
When it matches on the right of word characters, a word boundary is able to check that what precedes is a word character but what follows is not.
In lookaround terms, this is (?<=\w)(?!\w)
A word boundary must match either of these positions.
Grouping them together inside an alternation, our homemade word boundary becomes:
(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))
Yes, \b is a bit shorter.
(?i)(?<=^|[^a-z])cat(?=$|[^a-z])
The left side asserts that what precedes is either the beginning of the string or a character that is a non-letter.
The right side asserts that what follows is either the end of the string or a non-letter.
Your next step could be to combine the two to form a boundary that can be popped on either side:
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
On the left side, of the alternation, we have our earlier left boundary, and we add a lookahead to check that what follows is a letter.
On the right side of the alternation, we have our earlier right boundary, and we add a lookbehind to check that what precedes us is a letter.
Needless to say, if you need to paste this wherever you want a "real word boundary", this is a bit heavy.
With engines that support pre-defined subroutines—Perl, PCRE (PHP, R, …)—you can define the boundary once and for all, then use it wherever you like by referring to its name:
(?x) # free-spacing mode
(?(DEFINE) # Define some subroutines
(?<alphaB> # Define "alphaB" boundary
# This boundary matches when
# only one side is a letter
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
) # End alphaB definition
) # End DEFINE
# The actual regex matching starts here
# We can use our "alphaB" boundary wherever we like
(?&alphaB)cat(?&alphaB)
This would work really well as a component of a large parsing regex.
(?:(?i)(?<=^|\d)(?=[a-z])|(?<=[a-z])(?=$|\d))
(?<![^#])\d(?![^#])
This is a bit of a brain twister.
On the left side, the negative lookbehind (?<![^#]) asserts that what precedes the current position is not one character that is not a hash.
Flipping the double negative back to a positive assertion, this says that if there is a character behind us, it must be a hash.
What is allowed behind us is therefore either a hash character or "not a character" (the beginning of the string).
Why the double negative? Isn't that the same as the positive lookbehind (?<=#)? Well, no: this positive lookbehind requires a hash character—whereas we also want to allow the absence of any character on the left.
The negative lookahead at the end of the string follows the same principle: (?![^#]) asserts that what follows is not a character that is not a hash—i.e., if it is a character, it must be a hash.
Limitation
This technique works for single-line strings.
As soon as you move to multiple lines, 0# no longer matches at the beginning of lines 2 and beyond.
That is because there is a character before the 0: the \n, and it is not a hash.
Likewise, #5 no longer matches at the end of any line but the last, because there is now a line break character—not a hash—after the 5.
Extension
To get your eyes accustomed to the technique, let's apply it to other tasks.
To match A, B or E in A0 1B1 2C D3 4E, i.e capital letters that have either a digit or a string-end on each side, you can use this pattern:
(?<!\D)[A-Z](?!\D)
To match A, C or F in A -B- C -D -E F, i.e capital letters that have either a space or a string-end on each side, you can use this pattern:
(?<!\S)[A-Z](?!\S)
Finally, an unlikely example: to match the tilde, hash or colon in ~A ? 2! _#4 @5 6:, i.e special characters that have either a word character or a string-end on each side, you can use this pattern:
(?<!\W)[~#:@?!](?!\W)
Subject: Nicely done
Well presented and great examples, thank you.
Subject: Thanks
Just wanted to say thanks for the good explanation of anchors and boundaries.
Subject: boundaries
Ahh,
Finally to find the logical brain that can express concisely.
(?:Jane|\G) \w+:(\d+)
There are other ways to do this, especially if you have infinite lookbehind (.NET), but this approach is particularly economical.
How does it work? When the engine tries to match at the beginning of the string, the first token (?:Jane|\G) succeeds because \G matches at the beginning of the string.
However, the next token (a space character) fails against Tarzan's T.
The next chance for the pattern to match is at the position preceding Jane.
The engine matches "Jane A:35", capturing the 35 to Group 1.
At the starting position of the next match attempt, \G matches, and the engine matches " B:33".
Finally, \G matches again, and the engine matches " C:31".
Incidentally, in PCRE, Perl and Ruby, you don't need to retrieve the times from Group 1: you can match them directly with this small variation, where \K tells the engine to drop what it matched so far from the match to be returned:
(?:Jane|\G) \w+:\K\d+
"Beginning of String" Match: Using or Bridling \G
The fact that \G matches at the beginning of the string is neither convenient nor inconvenient.
Half the time, we use that property.
The other half, we work around it.
For our second example, consider this string, which might represent two potential positions for placing a "submarine" on a paper grid in preparation for a naval battle:
A1B1C1vsA1A2A3
Each position (on either side of vs) has three tokens composed of one letter and one digit.
Suppose we want to match the first three tokens, i.e.
A1, B1, C1.
We can do this quite easily with this regex:
\G[A-Z]\d
The \G matches at the beginning of the string, allowing us to match A1.
Then \G matches before the next token, so we match it, as well as the following token.
\G succeeds again before the vs, but [A-Z] cannot match the v, so the match fails.
There is no more position for \G to match, and we therefore avoid the tokens to the right, as we wanted.
Now suppose we want to match the second position's tokens, i.e.
A1, A2, A3.
Remembering the Tarzan and Jane example, we could try (?:vs|\G)([A-Z]\d)… but the strings in these two examples are not built the same way, and this regex would match all the tokens! Let's see how.
After the \G matches at the beginning of the string, [A-Z]\d is able to match the first token.
Then \G matches again, so we match the second token, and the third.
Then, when we hit vs, \G still matches, but [A-Z] fails against vs.
The engine backtracks and tries the other side of the alternation, vs, which matches.
[A-Z]\d matches the fourth token, then \G helps us with tokens 5 and 6.
Clearly, this time \G is in our way: we wish it didn't match at the beginning of the string.
To solve this, we can "bridle \G" by placing the negative lookahead (?!\A) right next to it.
It asserts that what immediately follows the current position is not the beginning of the string, so \G can no longer match there.
The regex becomes:
(?:vs|\G(?!\A))([A-Z]\d)
It may sound strange that we used (?!\A) to negate the anchor \A.
As it turns out, (?<!\A) would also have worked.
We'll explore this in the section about anchors within a lookaround.
^cat|^mouse
Here we use the beginning-of-string anchor twice, on both sides of an alternation.
This ensures that whichever word we match will be at the beginning of the string.
In contrast, if we used ^cat|mouse, the ^ would only apply to cat.
On the other side of the alternation, mouse is not anchored, so it could match anywhere in the string—perhaps not what we intend.
Another way to ensure the anchor applies to both sides would be to enclose the alternation in a group, as in ^(?:cat|mouse)
Anchor in an Alternation
\bcat(\w+|$)
Here we use the end-of-string anchor $ in an alternation.
After the word boundary \b and the letters cat, the engine must either match some word characters \w+ or be able to assert that the current position is the end of the string.
As a result, the word cat on its own can match, but only at the end of the string.
In contrast, catch can match anywhere in the string.
Anchor Within a Lookaround
\w+(?=,|$)
Here we use the end-of-string anchor $ within a lookahead.
The engine matches word characters, then asserts that what follows is either a comma or the end of the string.
In the string one apple, two peaches, three plums, this regex would match the fruits but not the numbers.
It's worth taking a moment to examine the meaning of an anchor within a lookaround.
When the engine is standing at the beginning of the string, you could say that what immediately follows or precedes this position is also the beginning of the string.
If immediately refers to an infinitely small distance to the left or to the right, that is indeed the case.
This is a matter of semantics and perspective, of course, but it's a perspective that the engine takes on board: ^, (?<=^) and (?=^) all match at the beginning of a string.
The same applies to other anchors: whenever an anchor is within a lookaround, the meaning of the anchor is the same as if the lookaround weren't there at all.
This is convenient for several reasons:
We can use a negative lookahead or a negative lookbehind to assert that the current position does not correspond to an anchor.
For instance, (?<!^) checks that the current position is not the beginning of the string.
We can place an anchor in an alternation within a lookaround to make a complex delimiter, as in the \w+(?=,|$) example a few paragraphs above.
To take another example, (?<=>>|^) ensures that what precedes the current position is either two "greater than" characters >, or—assuming we're in multiline mode—the beginning of a line.
This could be useful, for instance, in parsing the text of an email.
Note that as in any alternation, the order of tokens is important: (?<=^|>>) matches at the beginning of the string if possible, or past >> if not.
A few lines above, the priority was reversed.
\b(\w+)\b\s+\1\b
matches repeated words, such as regex regex, because the parentheses in (\w+) capture a word to Group 1 then the back-reference \1 tells the engine to match the characters that were captured by Group 1.
Yes, capture groups and back-references are easy and fun.
But when it comes to numbering and naming, there are a few details you need to know, otherwise you will sooner or later run into situations where capture groups seem to behave oddly.
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:
([A-Z]_)+
, it never becomes Group 2.
For good and for bad, for all times eternal, Group 2 is assigned to the second capture group from the left of the pattern as you read the regex.
What happens to the number of the first group when it gets captured multiple times? It remains Group 1.
The Returned Value for a Given Group is the Last One Captured
Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured.
For instance, if you match the string A_B_C_D_ with ([A-Z]_)+
, when you inspect the match, Group 1 will be D_.
With the exception of the .NET engine, all intermediate values are lost.
In essence, Group 1 gets overwritten each time its pattern is matched.
The .NET Exception: Capture Collections
As far as I know, the only engine that doesn't throw away intermediate captures is the .NET engine, available for instance through C# and VB.NET.
In the above example, when you inspect the match and request the value for Group 1, C# will also return D_.
But it will also make available to you a CaptureCollection object that stores all the values captured for Group 1 along the way.
To see how this works, see the CaptureCollection section of the C# page.
Perl, PHP, R, Python: Group Numbering with Subroutines and Recursion
Some engines—such as Perl, PCRE (PHP, R, Delphi…) and Matthew Barnett's regex module for Python—allow you to repeat a part of a pattern (a subroutine) or the entire pattern (recursion).
For instance, ([A-Z])_(?1)
could be used to match A_B, as (?1) repeats the pattern inside the Group 1 parentheses, i.e.
[A-Z].
The subroutine should be considered as a function call: in a sense, it has its own "local variable", i.e.
its own version of Group 1.
Likewise, each depth level of a recursion has its own version of Group 1 (and therefore no matter how many times you recurse, Group 1 is always Group 1 for a given depth).
What this means is that when ([A-Z])_(?1)
is used to match A_B, the Group 1 value returned by the engine is A.
It also means that (([A-Z])\2)_(?1)
will match AA_BB (Group 1 will be AA and Group 2 will be A).
Whatever Group 1 values were used in the subroutine or recursion are discarded.
In PCRE but not Perl, one interesting twist is that the "local" version of a Group in a subroutine or recursion starts out with the value set at the next depth level up the ladder, until it is overwritten.
This means that in PCRE, ([A-Z]\2?)([A-Z])_(?1)
would match AB_CB (but not in Perl).
I covered this point in the section on group contents and numbering in recursive patterns.
Perl, PHP, R: Group Numbering in Pre-Defined Subroutines
Perl and PCRE (PHP, R…) allow you to pre-define and name a subpattern.
This allows you to build beautifully modular expressions.
The main syntax page explains the (?(DEFINE) … ) syntax in detail, but we'll look at a short example to refresh our memory.
There is also a beautiful example on the page on matching numbers in plain English.
(?(DEFINE)(?<CAPS>[A-Z]+)) lets you define a subpattern called CAPS that matches uppercase letters.
Thereafter, you can drop (?&CAPS) anywhere in your expression to match upper-case letters.
How does group numbering work with these defined subpatterns?
You should think of these defined subpatterns as function calls: capture groups used by a subpattern won't be available outside.
To complicate matters, the subroutine itself is assigned a number, so remember to count it when counting group numbers from left to right.
For instance, in (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)
, the subpattern (?<TWOCAPS>([A-Z])\2)) counts as Group 1 and can in fact be called with (?1) instead of (?&TWOCAPS).
In turn, ([A-Z]) counts as Group 2, so that the entire TWOCAPS pattern matches two identical upper-case letters.
However, the regex (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)\2
will fail on AAA because the final \2 is outside the subroutine and therefore refers to a group that has not been set at that depth.
However, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)(?2)
would match AAB, as (?2) refers to the pattern of Group 2, i.e.
[A-Z].
Remember that we number groups from left to right: therefore, after the TWOCAPS definition, the next available group number is 3.
As a result, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)([BC])\3
will happily match AABB, as Group 3 ([BC]) captured the first B.
Branch Reset (Perl and PCRE, e.g.
PHP and R)
There are two exceptions to the strict left-to-right numbering.
One is the numbering with duplicated group names in .NET.
The other is branch reset syntax (?|…(this)…|…(that)…)available in Perl and PCRE.
A branch reset is introduced by (?|.
In the pseudo-example above, the groups capturing (this) and (that) would be assigned the same number.
For details and examples, see the Branch Reset section of the main regex syntax page.
Duplicating Group Names
In .NET, PCRE (C, PHP, R…), Perl and Ruby, you can use the same group name at various places in your pattern.
(In PCRE you need to use the (?J) modifier or PCRE_DUPNAMES option.) In these engines, this regex would be valid:
:(?<token>\d+)|(?<token>\d+)#
This particular example could be handled by branch reset syntax (supported by Perl and PCRE), but in more complex constructions the feature can come in handy.
In PCRE, Perl and Ruby, the two groups still get numbered from left to right: the leftmost is Group 1, the rightmost is Group 2.
In .NET, there is only one group number—Group 1.
This is one of the two exceptions to capture groups' left-to-right numbering, the other being branch reset.
If the group matches in several places, all captures get added to the capture collection for Group 1 (or named group token).
See the section on named group reuse.
Conclusion on Group Numbering
Once you understand the strict left-to-right numbering of capture groups, much potential confusion about "which group should capture what" melts away.
This numbering mode may seem annoying when it stands in the way of intricate logic you would love to inject in your regex.
But it's simple and consistent, and that wins the day.
The two following sections present examples that illustrate the main traps of group numbering we just saw, hopefully helping drill them down.
(\d)+
When I was a young dinosaur, I romantically hoped that if I applied that regex to the string 1234, the engine would eat each of the digits, one at a time, capturing them into four groups that could later be referenced using \1, \2, \3 and \4.
Was I deluded!
Things don't work that way—although alone among the crowd, .NET capture collections offer something nearly identical.
The capturing parentheses you see in a pattern only capture a single group.
So in (\d)+
, capture groups do not magically mushroom as you travel down the string.
Rather, they repeatedly refer to Group 1, Group 1, Group 1… If you try this regex on 1234 (assuming your regex flavor even allows it), Group 1 will contain 4—i.e.
the last capture.
In essence, Group 1 gets overwritten every time the regex iterates through the capturing parentheses.
The same happens if you use recursive patterns instead of quantifiers.
And the same happens if you use a named group: (?P<MyDigit>\d)+
The group named MyDigit gets overwritten with each digit it captures.
That is less surprising, and this scenario helps explain the first, because the two phrasings are equivalent.
It may not jump out at you from the raw symbols, but in the regex (\d)+
, the set of parentheses refers to a specific group—Group 1—even though that group is not explicitly named, as it is in (?P<MyDigit>\d)+
.
The name is implied.
If you were hoping to use a quantifier to spawn multiple captures as the engine travels down the string, forget about it—unless you use .NET capture collections, a feature that lets you inspect the successive captures made by a quantified group.
But there's always a solution.
For instance, the "match all" feature of your language lets you break down the string into chunks, each with its set of captures, which you can put back together if need be.
For instance, you could use:
C#: Matches() or iterate with Match() and NextMatch()
Python: finditer or findall
PHP: preg_match_all()
Java: matcher() with while… find()
JavaScript: match() or iterate with exec()
Perl: $subject =~ m!
Ruby: subject.scan
var catRegex = new Regex("cat", RegexOptions.IgnoreCase);
Perl
Apart from the (?i) inline modifier, Perl lets you add the i flag after your pattern's closing delimiter.
For instance, you can use:
if ($the_subject =~ m/cat/i) { … }
PCRE (C, PHP, R…)
Note that in PCRE, to use case-insensitive matching with non-English letters that aren't part of your locale, you'll have to turn on Unicode mode—for instance with the (*UTF8) special start-of-pattern modifier.
Apart from the (?i) inline modifier, PCRE lets you set the PCRE_CASELESS mode when calling the pcre_compile() (or similar) function:
cat_regex = pcre_compile( "cat", PCRE_CASELESS,
&error, &erroroffset, NULL );
In PHP, the PCRE_CASELESS option is passed via the i flag, which you can add in your regex string after the closing delimiter.
For instance, you can use:
$cat_regex = '~cat~i';
In R, the PCRE_CASELESS option is passed via the ignore.case=TRUE option.
For instance, you can use:
grep("cat", subject, perl=TRUE, value=TRUE, ignore.case=TRUE);
Python
Apart from the (?i) inline modifier, Python has the IGNORECASE option.
For instance, you can use:
cat_regex = re.compile("cat", re.IGNORECASE)
Java
Apart from the (?i) inline modifier, Java has the CASE_INSENSITIVE option.
For instance, you can use:
Pattern catRegex = Pattern.compile( "cat",
Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE );
The UNICODE_CASE option added here ensures that the case-insensitivity feature is Unicode-aware.
If you're only working with ASCII, you don't have to use it.
JavaScript
In JavaScript, your only option is to add the i flag after your pattern's closing delimiter.
For instance, you can use:
var catRegex = /cat/i;
Ruby
Apart from the (?i) inline modifier, Ruby lets you add the i flag after your pattern's closing delimiter.
For instance, you can use:
cat_regex = /cat/i
if ($the_subject =~ m/BEGIN .*? END/s) { … }
PCRE (C, PHP, R…)
Apart from the (?s) inline modifier, PCRE lets you set the PCRE_DOTALL mode when calling the pcre_compile() (or similar) function:
block_regex = pcre_compile( "BEGIN .*? END", PCRE_DOTALL,
&error, &erroroffset, NULL );
In PHP, the PCRE_DOTALL option is passed via the s flag, which you can add in your regex string after the closing delimiter.
For instance, you can use:
$block_regex = '~BEGIN .*? END~s';
Python
Apart from the (?s) inline modifier, Python has the DOTALL option.
For instance, you can use:
block_regex = re.compile("BEGIN .*? END", re.IGNORECASE | re.DOTALL)
Java
Apart from the (?s) inline modifier, Java has the DOTALL option.
For instance, you can use:
Pattern blockRegex = Pattern.compile( "BEGIN .*? END",
Pattern.CASE_INSENSITIVE |
Pattern.DOTALL );
Ruby: (?m) modifier and m flag
In Ruby, you can use the inline modifier (?m), for instance in (?m)BEGIN .*? END.
This is an odd Ruby quirk as other engines use (?m) for the "^ and $ match on every line" mode.
See the section on inline modifiers for juicy details about three additional features: turning it on in mid-string, turning it off with (?-m), or applying it only to the content of a non-capture group with (?m:foo)
Ruby also lets you to add the m flag at the end of your regex string.
For instance, you can use:
block_regex = /BEGIN .*? END/m
Origins of DOTALL
The single-line mode is also often called DOTALL (which stands for "dot matches all") because of the PCRE_DOTALL option in PCRE, the re.DOTALL option in Python and the Pattern.DOTALL option in Java.
I've heard it claimed several times that "DOTALL is a Python thing" but this seemed to come from people who hadn't heard about the equivalent options in PCRE and Java.
Still this made me wonder: where did DOTALL appear first? Looking at the PCRE Change Log and old Python documentation, it seems that it appeared in PCRE with version 0.96 (October 1997), in Python with version 1.5 (February 1998), then in Java 1.4 (February 2002).
The gap between the PCRE and Python introductions wasn't conclusive—the word might have been in circulation in earlier beta versions, or even in other tools—so I asked Philip Hazel (the father of PCRE) about it.
He replied:
I believe I invented it — I certainly had not seen it elsewhere when I was trying to think of a name for the PCRE option that corresponds to Perl's /s option. ("S" there stands for "single-line" (…) so I wanted a better name.)So there. Those who like a bit of history might enjoy this tasty nugget.
if ($the_subject =~ m/^cat/m) { … }
PCRE (C, PHP, R…)
Apart from the (?m) inline modifier, PCRE lets you set the PCRE_MULTILINE mode when calling the pcre_compile() (or similar) function:
cat_regex = pcre_compile( "^cat",
PCRE_CASELESS | PCRE_MULTILINE,
&error, &erroroffset, NULL );
In PHP, the PCRE_MULTILINE option is passed via the m flag, which you can add in your regex string after the closing delimiter.
For instance, you can use:
$cat_regex = '~^cat~m';
Python
Apart from the (?m) inline modifier, Python has the MULTILINE option.
For instance, you can use:
cat_regex = re.compile("^cat", re.IGNORECASE | re.MULTILINE)
Java
Apart from the (?m) inline modifier, Java has the MULTILINE option.
For instance, you can use:
Pattern catRegex = Pattern.compile( "^cat",
Pattern.CASE_INSENSITIVE |
Pattern.MULTILINE );
JavaScript
In JavaScript, your only option is to add the m flag after your pattern's closing delimiter.
For instance, you can use:
var catRegex = /^cat/m;
^(?=(?!(.)\1)([^\DO:105-93+30])(?-1)(?<!\d(?<=(?![5-90-3])\d))).[^\WHY?]$
Luckily, many engines support a free-spacing mode that allows you to aerate your regex.
For instance, you can add spaces between the tokens.
In PHP, you could write this—note the x flag after the final delimiter ~:
$word_with_digit_and_cap_regex = '~ ^ (?=\D*\d) \w*[A-Z]\w* $ ~x';
But why stay on one line? You can spread your regex over as many lines as you like—indenting and adding comments—which are introduced by a #.
For instance, in C# you can do something like this:
var wordWithDigitAndCapRegex = new Regex(
@"(?x) # Free-spacing mode
^ # Assert that position = beginning of string
######### Lookahead ##########
(?= # Start lookahead
\D* # Match any non-digits
\d # Match one digit
) # End lookahead
######## Matching Section ########
\w* # Match any word chars
[A-Z] # Match one upper-case letter
\w* # Match any word chars
$ # Assert that position = end of string
");
This mode is called free-spacing mode.
You may also see it called whitespace mode, comment mode or verbose mode.
It may be overkill in a simple regex like the one above (although anyone who has to maintain your code will thank you for it).
But if you're building a serious regex pattern like the one in the trick to match numbers in plain English… Unless you're a masochist, you have no choice.
Note that inside a character class, the space character and the # (which otherwise introduces comments) are still honored—except in Java, where they both need to be escaped if you mean to match these characters.
For several engines, there are two ways of turning the free-spacing mode on: as an inline modifier or as an option in the regex method or function.
Free-spacing mode is wonderful, but there are a couple of minor hazards you should be aware of, as they may leave you scratching your head wondering why a pattern is not working as you expect.
Trip Hazard #1: The Meaning of Space
First, you can no longer use Number: \d+ to match a string such as Number:24.
The reason is that the space in : \d no longer matches a space.
We're in free-spacing mode, remember? That's the whole point.
To match a space character, you need to specify it.
The two main ways to do so are to place it inside a character class, or to escape it with a backslash.
Either of those would work: Number:[ ]\d+ or Number:\ \d+
Of course Number:\s\d+ would also match, but remember that \s matches much more than a space character.
For instance, it could match a tab or a line break.
This may not be what you want.
Trip Hazard #2: Late Start
Second, you may get overconfident in the power of free-spacing and try something like this in order to let the regex stand on its own:
var wordWithDigitAndCapRegex = new Regex(@"
(?x) # Free-spacing mode
^ # Beginning of string
etc # Match the literal chars e,t,c
");
The problem with this is that although it may look as though the free-spacing modifier (?x) is the first thing in your regex, it is not.
After the opening double-quote ", we have a line break and a number of spaces.
The engine tries to match those, because at that stage we are not yet in free-spacing mode.
That mode is turned on only when we encounter (?x).
This regex will never match the string etc and more, because by the time we encounter the beginning of string anchor ^, we're supposed to already have matched a line break and space characters!
This is why if you look at the first example, you will see that the free-spacing modifier (?x) is the very first thing after the opening quote character.
Whitespace is not just trimmed out of the pattern
Even though whitespace is ignored, the position of a whitespace still separates the previous token from the next.
For instance,
(A)\1 2 is not the same as (A)\12.
The former matches AA2, the latter matches A\n in .NET, PCRE, Perl and Ruby (12 is the octal code for the linefeed character)
\p{Nd} is valid, but \p{N d} is not—except in Perl and Ruby
JavaScript
JavaScript does not support free-spacing mode.
In JavaScript, to match any character including line breaks, use a construct such as [\D\d].
This character class matches one character that is either a non-digit \D or a digit \d.
Therefore it matches any character.
Another JavaScript solution is to use the XRegExp regex library.
If you've got infinite time on your hands, you can also try porting PCRE to JavaScript using Emscripten, as Firas seems to have done on regex 101.
Inline Modifier (?s)
In .NET, PCRE (C, PHP, R…), Perl, Python, Java and Ruby (but not JavaScript), you can use the inline modifier (?x), for instance, this is an aerated regex to match repeated words:
(?x)(\w+)[ \r\n]+\1\b
Also see the section on inline modifiers.
.NET
Apart from the (?x) inline modifier, .NET languages have the IgnorePatternWhitespace option.
For instance, in C# you can use:
var repeatedWordRegex = new Regex(@"
(\w+) [ \r\n]+ \1\b",
RegexOptions.IgnorePatternWhitespace
);
Perl
Apart from the (?x) inline modifier, Perl lets you add the x flag after your pattern's closing delimiter.
For instance, you can use:
if ($the_subject =~ m/(\w+) [ \r\n]+ \1\b/x) { … }
PCRE (C, PHP, R…)
Apart from the (?x) inline modifier, PCRE lets you set the PCRE_EXTENDED mode when calling the pcre_compile() (or similar) function:
repeated_word_regex = pcre_compile( "(\w+) [ \r\n]+ \1\b",
PCRE_EXTENDED,
&error, &erroroffset, NULL );
In PHP, the PCRE_EXTENDED option is passed via the x flag, which you can add in your regex string after the closing delimiter.
For instance, you can use:
$repeated_word_regex = '~(\w+) [ \r\n]+ \1\b~x';
Python
Apart from the (?x) inline modifier, Python has the VERBOSE option.
For instance, you can use:
repeated_word_regex = re.compile(r"(\w+) [ \r\n]+ \1\b", re.VERBOSE)
Java
Unlike in other engines, inside a Java character class hashes introduce comments and spaces are ignored, so you need to escape them if you want to use these characters in a class, e.g.
[\#\ ]+
Apart from the (?x) inline modifier, Java has the COMMENTS option.
For instance, you can use:
Pattern repeatedWordRegex = Pattern.compile(
"(\\w+) [ \\r\\n]+ \\1\\b",
Pattern.COMMENTS );
Ruby
Apart from the (?x) inline modifier, Ruby lets you add the x flag at the end of your regex string.
For instance, you can use:
repeated_word_regex = /(\w+) [ \r\n]+ \1\b/x
Lookaround | Name | What it Does |
---|---|---|
(?=foo) | Lookahead | Asserts that what immediately follows the current position in the string is foo |
(?<=foo) | Lookbehind | Asserts that what immediately precedes the current position in the string is foo |
(?!foo) | Negative Lookahead | Asserts that what immediately follows the current position in the string is not foo |
(?<!foo) | Negative Lookbehind | Asserts that what immediately precedes the current position in the string is not foo |
\A(?=\w{6,10}\z)
So far, we have an expression that validates that a string is entirely composed of six to ten word characters.
Note that we haven't matched any of these characters yet: we have only looked ahead.
The current position after the lookahead is still the beginning of the string.
To check the other conditions, we just add lookaheads.
Condition 2
For our second condition, we need to check that the password contains one lowercase letter.
To find one lowercase letter, the simplest idea is to use .*[a-z].
That works, but the dot-star first shoots down to the end of the string, so we will always need to backtrack.
Just for the sport, can we think of something more efficient? You might think of making the star quantifier reluctant by adding a ?, giving us .*?[a-z], but that too requires backtracking as a lazy quantifier requires backtracking at each step.
For this type of situation, I recommend you use something like [^a-z]*[a-z] (or even better, depending on your engine, the atomic (?>[^a-z]*)[a-z] or possessive version [^a-z]*+[a-z]—but we'll discuss that in the footnotes).
The negated character class [^a-z] is the counterclass of the lowercase letter [a-z] we are looking for: it matches one character that is not a lowercase letter, and the * quantifier makes us match zero or more such characters.
The pattern [^a-z]*[a-z] is a good example of the principle of contrast recommended by the regex style guide.
Let's use this pattern inside a lookahead: (?=[^a-z]*[a-z])
The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can match zero or more characters that are not lowercase letters, then we can match one lowercase letter: [a-z]
Our pattern becomes:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])
At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead twice.
We still haven't matched any characters.
Note that on a logical level it doesn't matter which condition we check first.
If we swapped the order of the lookaheads, the result would be the same.
We have two more conditions to satisfy: two more lookaheads.
Condition 3
For our third condition, we need to check that the password contains at least three uppercase letters.
The logic is similar to condition 2: we look for an optional number of non-uppercase letters, then one uppercase letter… But we need to repeat that three times, for which we'll use the quantifier {3}.
We'll use this lookahead: (?=(?:[^A-Z]*[A-Z]){3})
The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can do the following three times: match zero or more characters that are not uppercase letters (the job of the negated character class [^A-Z] with the quantifier *), then match one uppercase letter: [A-Z]
Our pattern becomes:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})
At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead three times.
We still haven't matched any characters.
Condition 4
To check that the string contains at least one digit, we use this lookahead: (?=\D*\d).
Opposing \d to its counterclass \D makes good use of the regex principle of contrast.
The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can match zero or more characters that are not digits (the job of the "not-a-digit" character class \D and the * quantifier), then we can match one digit: \d
Our pattern becomes:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)
At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead four times to check our four conditions.
We still haven't matched any characters, but we have validated our string: we know that it is a valid password.
If all we wanted was to validate the password, we could stop right there.
But if for any reason we also need to match and return the entire string—perhaps because we ran the regex on the output of a function and the password's characters haven't yet been assigned to a variable—we can easily do so now.
Matching the Validated String
After checking that the string conforms to all four conditions, we are still standing at the beginning of the string.
The five assertions we have made (the anchor \A and the four lookaheads) have not changed our position.
At this stage, we can use a simple .* to gobble up the string: we know that whatever characters are matched by the dot-star, the string is a valid password.
The pattern becomes:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d).*
Fine-Tuning: Removing One Condition
For n conditions,
use n-1 lookaheads
If you examine our lookaheads, you may notice that the pattern \w{6,10}\z inside the first one examines all the characters in the string.
Therefore, we could have used this pattern to match the whole string instead of the dot-star .*
This allows us to remove one lookahead and to simplify the pattern to this:
\A(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)\w{6,10}\z
The pattern \w{6,10}\z now serves the double purpose of matching the whole string and of ensuring that the string is entirely composed of six to ten word characters.
Generalizing this result, if you must check for n conditions, your pattern only needs to include n-1 lookaheads at the most.
Often, you are even able to combine several conditions into a single lookahead.
You may object that we were able to use \w{6,10}\z because it happened to match the whole string.
Indeed that was the case.
But we could also have converted any of the other three lookaheads to match the entire string.
For instance, taking the lookahead (?=\D*\d) which checks for the presence of one digit, we can add a simple .*\z to get us to the end of the string.
The pattern would have become:
\A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})\D*\d.*\z
By the way, you may wonder why I bother using the \z after the .*: shouldn't it get me to the end of the string? In general, not so: unless we're in DOTALL mode, the dot doesn't match line breaks.
Therefore, the .* only gets you to the end of the first line.
After this, the string may have line breaks and many more line.
A \z anchor ensures that after the .* we have reached not only the end of the line, but also the end of the string.
In this particular pattern, the first lookaround (?=\w{6,10}\z) already ensures that there cannot be any line breaks in the string, so the final \z is not strictly necessary.
(?!Q)\w
After the negative lookahead asserts that what follows the current position is not a Q, the \w matches a word character.
Not only is this solution easy to read, it is also easy to maintain if we ever decide to exclude the letter K instead of Q, or to exclude both: (?![QK])\w
Note that we can also perform the same exclusion task with a negative lookbehind:
\w(?<!Q)
After the \w matches a word character, the negative lookbehind asserts that what precedes the current position is not a Q.
Using the same idea, if we wanted to match one character in the Arabic script as long as it is not a number, we could use this pattern:
(?!\p{N})\p{Arabic}
This would work in Perl, PCRE (C, PHP, R…) and Ruby 2+.
In .NET and Java, you would use (?!\p{N})\p{IsArabic}
Likewise, we can use this technique to perform a DIY character class intersection.
For instance, to match one character in the Arabic script as long as it is a number, we transform the negative lookahead above to a positive lookahead.
In the Perl / PCRE / Ruby version, this gives us:
(?=\p{N})\p{Arabic}
This is basically the password validation technique with two conditions applied to a single character.
Needless to say, you can interchange the content of the lookahead with the token to be matched:
(?=\p{Arabic})\p{N}
Tempering the scope of a token
This use is similar to the last.
Instead of removing characters from a class, it restricts the scope within which a token is allowed to match.
For instance, suppose we want to match any character as long as it is not followed by {END}.
Using a negative lookahead, we can use:
(?:(?!{END}).)*
Each .
token is tempered by (?!{END}), which specifies that the dot cannot be the beginning of {END}.
This technique is called on the Quantifiers page.
Another technique is:
(?:[^{]++|{(?!END}))*+
On the left side of the alternation, [^{]++ matches characters that are not an opening brace.
On the right side, {(?!END}) matches an opening brace that is not followed by END}.
This technique appears in the section of the Quantifiers page.
Delimiter
Do you have a string where you want to start matching all characters once the first instance of #START# is passed? No problem, just use a lookbehind to make a delimiter:
(?<=#START#).*
After the lookbehind asserts that what immediately precedes the current position is #START#, the dot-star .* matches all the characters to the right.
Or would you like to match all characters in a string up to, but not including the characters #END#? Make a delimiter using a lookahead:
.*?(?=#END#)
You can, of course, combine the two:
(?<=#START#).*?(?=#END#)
See the page on boundaries for advice on building fancy DIY delimiters.
Inserting Text at a Position
Someone gave you a file full of film titles in CamelCase, such as HaroldAndKumarGoToWhiteCastle.
To make it easier to read, you want to insert a space at each position between a lowercase letter and an uppercase letter.
This regex matches these exact positions:
(?<=[a-z])(?=[A-Z])
In your text editor's regex replacement function, all you have to do is replace the matches space characters, and spaces be inserted in the right spot.
This regex is what's known as a "zero-width match" because it matches a position without matching any actual characters.
How does it work? The lookbehind asserts that what immediately precedes the current position is a lowercase letter.
And the lookahead asserts that what immediately follows the current position is an uppercase letter.
Splitting a String at a Position
We can use the exact same regex from the previous example to split the string AppleOrangeBananaStrawberryPeach into a list of fruits.
Once again, the regex
(?<=[a-z])(?=[A-Z])
matches the positions between a lowercase letter and an uppercase letter.
In most languages, when you feed this regex to the function that uses a regex pattern to split strings, it returns an array of words.
Note that Python's re module does not split on zero-width matches—but the far superior regex module does.
Finding Overlapping Matches
Sometimes, you need several matches within the same word.
For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D.
You can do it with this single regex:
(?=(\w+))
When you allow the engine to find all matches, all the substrings will be captured to Group 1
How does this work?
At the first position in the string (before the A), the engine starts the first match attempt.
The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1.
The lookahead succeeds, and so does the match attempt.
Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string).
It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt.
Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1.
The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
In .NET, which has infinite lookbehind, you can find overlapping matches from the other side of the string.
For instance, on the same string ABCD, consider this pattern:
(?<=(\w+))
It will capture A, AB, ABC and ABCD.
To achieve the same in an engine that doesn't support infinite lookbehind, you would have to reverse the string, use the lookahead version (?=(\w+))
then reverse the captures.
(?<=_(?=\d{2}_))\d+
Wowzy, what does this do? The lookbehind asserts that what immediately precedes the current position in the string is an underscore, then a position where the lookahead (?=\d{2}_) can assert that what immediately follows is two digits and an underscore.
This is interesting for several reasons.
First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore.
That's acrobatic.
Second, note that even though it looks complex, this is a fixed-width lookbehind (the width is one character, the underscore), so it should work in all flavors of lookbehind.
(However, it does not work in Ruby as Ruby does not allow lookaheads and negative lookbehinds inside lookbehind.)
Another interesting feature is how the notion of "current position in the string" is not the same for the lookbehind and for the lookahead.
You'll remember that lookarounds stand their ground, so that after checking the assertion made by a lookaround, the engine hasn't moved in the string.
Are we breaking that rule?
We're not.
In the string 10 _16_ 20, let's say the engine has reached the position between the underscore and the 1 in 16.
The lookbehind makes an assertion about what can be matched at that position.
When the engine exits the lookbehind, it is still standing in that same spot, and the token \d{2} can proceed to match the characters 16.
But within the lookbehind itself, we enter a different little world.
You can imagine that outside that world the engine is red, and inside the little world of the lookbehind, there is another little engine which is yellow.
That yellow engine keeps track of its own position in the string.
In most engines (.NET proceeds differently), the yellow engine is initially dropped at a position in the string that is found by taking the red engine's position and subtracting the width of the lookbehind, which is 1.
The yellow engine therefore starts its work before the leading underscore.
Within the lookbehind's little world, after matching the underscore token, the yellow engine's position in the string is between the underscore and the 1.
It is that position that the lookahead refers to when it asserts that at the current position in the string (according to the little world of the lookbehind and its yellow engine), what immediately follows is two digits and an underscore.
After the digits
Here is a second version where the "back-to-the-future lookbehind" comes after the digits:
\d+(?<=_\d{2}(?=_))
The lookbehind states: what immediately precedes this position in the string is an underscore and two digits, then a position where the lookahead (?=_) can assert that what immediately follows the current position in the string (according to the yellow engine and the lookbehind's little world) is an underscore.
This too is a fixed-width lookbehind (the width is three character, i.e.
the leading underscore and the two digits), so it should work in all flavors of lookbehind except Ruby.
\d+(?=_(?!_))
The lookahead asserts: what follows the current position in the string is one underscore, then a position where the negative lookahead (?!_) can assert that what follows is not an underscore.
A less elegant variation would be \d+(?=(?!__)_)
Token preceded by one character, but not more
How can you match a number that is preceded by one underscore, but not more?
You can use this:
(?<=(?<!_)_)\d+
The lookbehind asserts: what precedes the current position in the string is a position where the negative lookbehind (?<!_) can assert that what immediately precedes is not an underscore, then an underscore.
A variation would be (?<=_(?<!__))\d+
Multiple Compounding
Needless to say, it won't be long until you find occasions to add levels of compounding beyond the two we've just seen.
But that quickly becomes obnoxious, and it becomes simpler to rearrange the regex.
For instance, building on the previous pattern,
(?<=(?<!(?<!X)_)_)\d+
matches a number that is precede by an underscore that is not preceded by an underscore unless that underscore is preceded by an X.
In .NET, PCRE, Java and Ruby, this could be simplified to (?<=(?<!_)_|X__)\d+
In Perl and Python, you could use (?:(?<=(?<!_)_)|(?<=X__))\d+
_(\w+)\b(?=.*:\1\b)
After matching the underscore, we capture a word to Group 1.
Then the lookahead (?=.*:\1\b) asserts what follows the current position in the string is zero or more characters, then a colon, then the word captured to Group 1.
As hoped, this matches both _dog and _mouse.
Now suppose we try a "reversed" approach:
_(?=.*:(\w+)\b)\1\b
This only matches _mouse.
Why?
First let's try to understand what this regex hopes to accomplish.
It may not be that obvious, but it illustrates an important feature of lookarounds.
After the engine matches the underscore, the lookahead (?=.*:(\w+)\b) asserts that what follows the current position in the string is any number of characters, then a colon, then a word (captured to Group 1).
After passing that assertion, the back-reference \1 matches what was captured into Group 1.
Let's see how this works out.
Remember that our string is
_rabbit _dog _mouse DIC:cat:dog:mouse
After the underscore that precedes rabbit, we expect the lookahead to fail because there is no rabbit in the DIC section—and it does.
The next time we match an underscore is before dog.
At that stage, inside the lookahead (?=.*:(\w+)\b), the dot-star shoots down to the end of the string, then backtracks just far enough to allow the colon to match, after which the word mouse is matched and captured to Group 1.
The lookahead succeeds.
The next token \1 tries to match mouse, but the next character in the string is the d from dog, so the token fails.
At this stage, having learned everything about backtracking, we might assume that the regex engine allows the dot-star to backtrack even more inside the lookahead, up to the previous colon, which would then allow (\w+) to match and capture mouse.
Then the back-reference \1 would match mouse, and the engine would return a successful match.
However, it does not work that way.
Once the regex engine has left a lookaround, it will not backtrack into it if something fails somewhere down the pattern.
On a logical level, that is because the official point of a lookaround is to return one of two values: true or false.
Once a lookahead evaluates to true at a given position in the string, it is always true.
From the engine's standpoint, there is nothing to backtrack.
What would be the point—since the only other available value is false, and that would fail the pattern?
The fact that the engine will not backtrack into a lookaround means that it is an atomic block.
This property of lookarounds will rarely matter, but if someday, in the middle of building an intricate pattern, a lookahead refuses to cooperate… This may be the reason.
(?<=\b\d+_)[A-Z]+
That is because the width of the text matched by the token \d+ can be anything.
Most engines require the width of the subexpression within a lookbehind to be known in advance, as in (?<=\d{3})
Some engines allow the width of the subexpression within a lookbehind to take various pre-determined values found on the various sides of an alternation, as in (?<=0|128|\d{6}).
Yet others allow the width to vary within a pre-determined range, as in (?<=d{2,6})
For details of what kinds of widths various engines allow in a lookbehind, see the Lookbehind: Fixed-Width / Constrained Width / Infinite Width section of the main syntax page.
To honor the winners, I'll just repeat here that the only two programming-language flavors that support infinite-width lookbehind are .NET (C#, VB.NET, …) and Matthew Barnett's regex module for Python.
I've also implemented an infinite lookbehind demo for PCRE.
Capture Group Inside Variable Lookbehind: Difference between Java and .NET
Both Java and .NET allow this pattern:
(?<=(\d{1,5}))Z
.NET allows it because it supports infinite-width lookbehind.
Java allows it because it supports lookbehind whose width falls within a defined range.
However, they operate differently.
As a result, against the string 123Z, this pattern will return different Group 1 captures in the two engines.
Java captures 3 to Group 1.
The engine sees that the width of the string to be matched inside the lookbehind must fall between one and five characters.
Java tries all the possible fixed-width patterns in the range, from the shortest to the longest, until one succeeds.
The shortest possible fixed-width pattern is (?<=(\d{1})).
The engine temporarily skips back one character in the string, tries to match \d{1} and succeeds.
The lookaround succeeds, and Group 1 contains 3.
.NET captures 123 to Group 1.
The .NET engine has a far more efficient way of processing variable-width lookbehinds.
Instead of trying multiple fixed-width patterns starting at points further and further back in the string, .NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string.
Therefore, in 123Z, to try the lookbehind at the point before Z, it reverses the portion of string to be tested from 123 to 321.
Likewise, the lookbehind (?<=(\d{1,5})) is flipped into the lookahead (?=(\d{1,5})).
\d{1,5} matches 321.
Reversing that string, Group 1 contains 123.
To only capture 3 as in Java, you would have to make the quantifier lazy: (?<=(\d{1,5}?))Z
Like .NET, the regex alternate regular expressions module for Python captures 123 to Group 1.
Workarounds
There are two main workarounds to the lack of support for variable-width (or infinite-width) lookbehind:
Capture groups.
Instead of (?<=\b\d+_)[A-Z]+
, you can use \b\d+_([A-Z]+), which matches the digits and underscore you don't want to see, then matches and captures to Group 1 the uppercase text you want to inspect.
This will work in all major regex flavors.
The \K "keep out" verb, which is available in Perl, PCRE (C, PHP, R…), Ruby 2+ and Python\'s alternate regex engine.
\K tells the engine to drop whatever it has matched so far from the match to be returned.
Instead of (?<=\b\d+_)[A-Z]+, you can therefore use \b\d+_\K[A-Z]+
Compared with lookbehinds, both the \K and capture group workarounds have limitations:
When you look for multiple matches in a string, at the starting position of each match attempt, a lookbehind can inspect the characters behind the current position in the string.
Therefore, against 123, the pattern (?<=\d)\d
(match a digit preceded by a digit) will match both 2 and 3.
In contrast, \d\K\d can only match 2, as the starting position after the first match is immediately before the 3, and there are not enough digits left for a second match.
Likewise, \d(\d) can only capture 2.
With lookbehinds, you can impose multiple conditions (similar to our password validation technique) by using multiple lookbehinds.
For instance, to match a digit that is preceded by a lower-case Greek letter, you can use (?<=\p{Ll})(?<=\p{Greek})\d
.
The first lookbehind (?<=\p{Ll}) ensures that the character immediately to the left is a lower-case letter, and the second lookbehind (?<=\p{Greek}) ensures that the character immediately to the left belongs to the Greek script.
With the workarounds, you could use \p{Greek}\K\d to match a digit preceded by a character in the Greek script (or \p{Greek}(\d) to capture it), but you cannot impose a second condition.
To get over this limitation, you could capture the Greek character and use a second regex to check that it is a lower-case letter.
\A(?=\D*\d)\w+\z
The \A anchor asserts that the current position is the beginning of the string.
The lookahead (?=\D*\d) asserts that at the current position (which is still the beginning of the string), we can match zero or more non-digits, then one digit.
Next, \w+ matches our word.
Finally, the \z anchor asserts that the current position is the end of the string.
Now consider what happens when we forget the anchor \A and use (?=\D*\d)\w+\z.
To make our oversight seem less severe, let's assume we know that our string always contains an uninterrupted string of word characters.
This guarantees that if we find a match, it will have to be the right one—at the beginning of the string, as we wanted.
So what's the problem?
Suppose we use our regex on a string composed of one hundred characters V.
Since the string doesn't contain a single digit, you and I can immediately see that the regex must fail.
Let's see how fast the engine comes to the same conclusion.
As always, the engine begins by trying to match the pattern at the first position in the string.
Starting with the first token (?=\D*\d), it tries to assert that at the current position, i.e.
the beginning of the string, it can match zero or more non-digits, then one digit.
Within the subexpression, the \D* matches all the V characters.
The engine then tries to match a digit, but since we have reached the end of the string, that fails.
If we're using a smart engine such as PCRE, at this stage the engine fails the lookaround for this first match attempt.
That's because before starting the match attempt, the engine has studied the pattern and noticed that the \D and \d tokens are mutually exclusive, and it has turned the * quantifier into a possessive quantifier *+, a process known to PCRE as auto-possessification (see footnote).
A less clever engine will backtrack, giving up all the \D characters it has matched one by one, each time attempting to match a \d after giving up a \D.
Eventually, the engine runs out of characters to backtrack, and the lookahead fails.
Once the engine understands that the lookahead must fail (whether it comes to this conclusion cleverly or clumsily), it gives up on the entire first match attempt.
Next, as always in such cases, the engine moves to the next position in the string (past the first V) and starts a new match attempt.
Again, the \D* eats up all the V characters—although this time, there are only 99 of them.
Again, the lookahead fails, either fast if the engine is smart, or, more likely, after backtracking all the way back to the starting position.
After failing a second time, the engine moves past the second V, starts a new match attempt, and fails… And so on, all the way to the end of the string.
Because the pattern is not anchored at the beginning of the string, at each match attempt, the engine checks whether the lookahead matches at the current position.
In doing so, in the best case, it matches 100 V characters, then 99 on the second attempt, and so on—so it needs about 5000 steps before it can see that the pattern will never match.
In the more usual case, the engine needs to backtrack and try the \d at each position, adding two steps at each V position.
Altogether, it needs about 15,000 steps before it can see that the pattern will never match.
In contrast, with the original anchored pattern \A(?=\D*\d)\w+\z, after the engine fails the first match attempt, each of the following match attempts at further positions in the string fail instantly, because the \A fails before the engine gets to the lookahead.
In the best case, the engine takes about 200 steps to fail (100 steps to match all the V characters, then one step at each of the further match attempts.) In the more usual case, the engine takes about 400 steps to fail (300 steps on the first match attempt, then one step at each of the further match attempts.)
Needless to say, the ratio of (15,000 / 400) steps is the kind of performance hit we try to avoid in computing.
This makes a solid case for helping the engine along by minimizing the number of times lookaheads must be attempted, either by using anchors such as ^
and \A, or by matching literal characters immediately before the lookahead.
One Exception: Overlapping Matches
There are times when we do want the engine to attempt the lookahead at every single position in the string.
Usually, the purpose of such a maneuver is to match a number of overlapping substrings.
For instance, against the string word, if the regex (?=(\w+)) is allowed to match repeatedly, it will match four times, and each match will capture a different string to Group 1: word, ord, rd, then d.
The section on overlapping matches explains how this works.
.*apple
, the token .* starts out by greedily matching every single character in the string.
The engine then advances to the next token a, which fails to match as there are no characters left in the string.
The engine backtracks into the .*, which gives up the e in apple.
The engine once again advances to the next token, but the a fails to match the e.
The engine again backtracks into the .*, which gives up the l.
The process repeats itself until the .* has given up the a, at which stage the text tokens a, p, p, l and e are all able to match and the overall match is successful.
When you hear that A+ means "one or more A characters", that is therefore not the whole story.
It is "one or more, but as many as possible (greedy), and giving back characters if needed in order to allow the rest of the pattern to match (docile)".
Suppose our entire string is AAA.
Depending on which pattern we use to match the string, the quantified token A+ could end up matching A, AA or AAA.
Consider these three patterns:
A+
—A+ matches AAA (as many as possible).
(A+).
—A+ (captured to Group 1) matches AA, because to allow the dot to match, A+ (which starts out by matching AAA) has to give up one A.
(A+)..
—A+ (captured to Group 1) matches A, because to allow the two dots to match, A+ (which starts out by matching AAA) has to give up two A characters.
Lazy: As Few As Possible (shortest match)
In contrast to the standard greedy quantifier, which eats up as many instances of the quantified token as possible, a lazy (sometimes called reluctant) quantifier tells the engine to match as few of the quantified tokens as needed.
As you'll see in the table below, a regular quantifier is made lazy by appending a ? question mark to it.
Since the * quantifier allows the engine to match zero or more characters, \w*?E tells the engine to match zero or more word characters, but as few as needed—which might be none at all—then to match an E.
In the string 123EEE, starting from the very left, "zero or more word characters then an E" could be 123E, 123EE or 123EEE.
Which of these does \w*?E match?
Because the *? quantifier is lazy, \w*? matches as few characters as possible to allow the overall match attempt to succeed, i.e.
123—and the overall match is 123E.
For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match.
Do beware of this notion of "shortest match": it refers to the shortest match that can be found with a match attempt that starts at a given position in the string — not to the shortest possible match that can be found if a pattern is applied repeatedly to various sections of a string.
For more on this, see the section about the longest and shortest match traps.
Helpful: Expand When Needed
Lazy…
but helpful.
With a lazy quantifier, the engine starts out by matching as few of the tokens as the quantifier allows.
For instance, with A*, the engine starts out matching zero characters, since *allows the engine to match "zero or more".
But if the quantified token has matched so few characters that the rest of the pattern can not match, the engine backtracks to the quantified token and makes it expand its match—one step at a time.
After matching each new character or subexpression, the engine tries once again to match the rest of the pattern.
I call this behavior of lazy quantifiers helpful.
For instance, against the string Two_apples, using the regex .*?apples
, the token .*? starts out by matching zero characters—the minimum allowed by the * quantifier.
The engine then advances in the pattern and tries to match the a token against the T in Two.
That fails, so the engine backtracks to the .*?, which expands to match the T.
The engine advances both in the pattern and in the string and tries to match the a token against the w in Two.
Once again, the engine has to backtrack.
The .*? expands to match the w, then the a token fails against the o in Two.
This process of advancing, failing, backtracking and expanding repeats itself until the .*? has expanded to match Two_.
At that stage, the following token a is able to match, as are the p and all the tokens that follow.
The match attempt succeeds.
As this example showed, because lazy quantifiers expand their match one step at a time in order to match only as much as needed, they cause the engine to backtrack at each step.
They are expensive.
To fully grasp how lazy quantifiers work, let's look at one more example.
The quantified token A*? matches zero or more A characters—as few as possible, expanding as needed.
Against the string AA, depending on the overall pattern, A*? could end up matching no characters at all, A or AA.
Consider how these three patterns match AA:
^(A*?)AA$
—A*? (captured to Group 1) matches no characters.
After the anchor ^ asserts that the current position is the beginning of the string, A*? tries to match the least number of characters allowed by *, which is zero characters.
The engine moves to the next token: the A, which matches the first A in AA.
The next token matches the second A.
The match attempt succeeds, and Group 1 ends up containing no characters.
^(A*?)A$
—A*? (captured to Group 1) matches one A.
Initially, the A*? matches zero characters.
The next token A matches the first A in AA.
The engine advances to the next token, but the anchor $ fails to match against the second A.
The engine sees that the A*? can expand.
It backtracks and gives up the A, which the A*? now expands to match.
The engine moves to the next token: the A matches the second A in the string.
The $ anchor now succeeds.
Group 1 ends up containing one A.
^A*?$
—A*? matches AA.
After the A*? matches zero characters, the $ fails to match.
The engine backtracks and allows the A*? to match one A.
Once again, the $ fails to match (there is one A left in the string).
The engine backtracks again and allows the A*? to expand to match the second A.
This time, the $ anchor matches.
Group 1 ends up containing AA.
Possessive: Don't Give Up Characters
In contrast to the standard docile quantifier, which gives up characters if needed in order to allow the rest of the pattern to match, a possessive quantifier tells the engine that even if what follows in the pattern fails to match, it will hang on to its characters.
As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it.
Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back.
Whereas the regex A+.
matches the string AAA, A++.
doesn't.
At first, the token A++ greedily matches all the A characters in the string.
The engine then advances to the next token in the pattern.
The dot .
fails to match because there are no characters left to match.
The engine looks if there is something to backtrack.
But A++ is possessive, so it will not give up any characters.
There is nothing to backtrack, and the pattern fails.
In contrast, with A+., the A+ would have given up the final A, allowing the dot to match.
Possessive quantifiers match fragments of string as solid blocks that cannot be backtracked into: it's all or nothing.
This behavior is particularly useful when you know there is no valid reason why the engine should ever backtrack into a section of matched text, as you can save the engine a lot of needless work.
In particular, when a match must fail, a possessive quantifier can help it to fail faster.
For instance, suppose we want to match a string of digits that ends with E, as in 123E.
We can use a possessive quantifier with the \d digit token:
\b\d++E
When we use this pattern against 123E, it matches in the same way as if we had used a non-possessive \d.
Actually, in theory the match could be a hair faster because the \d++ quantifier doesn't need to remember positions where it may later need to backtrack—it's all or nothing.
Now let's use the same pattern against 13245.
We expect the match to fail because the string doesn't end with an E.
Let's see how the possessive and non-possessive versions compare.
In the possessive version \b\d++E, after matching all the digits, the engine advances in the pattern and attempts the next token E.
There are no characters left in the string, so this fails.
Since the engine has nowhere to backtrack to, the match fails.
In the non-possessive version \b\d+E, after failing to match the E token at the end of the string, unless the engine has been optimized to detect that the \d+ token and the E are mutually incompatible, it has positions to backtrack to.
It backtracks into the \d+ and gives up the last character matched, which was the 5.
It then advances in the pattern and tries the next token E against the 5.
That fails, so the engine backtracks into the \d+ again, gives up the 4, advances in the pattern and tries the E against the 4.
This fails, and the process repeats itself until the \d+ has given up everything except the 1, at which stage there is nothing left to backtrack and the pattern can finally fail.
As you can see, in the regular version the engine spends a lot of time in needless backtracking, whereas in the possessive version the "all or nothing" \d++ allows the match to fail right away.
It's worth noting that certain engines (such as PCRE) study the pattern before starting the match, notice that the token \d is mutually exclusive with the token E, and optimize the pattern by automatically turning the \d+ into a possessive \d++.
This process is called auto-possessification.
PCRE even allows you to turn it off with the special start-of-pattern modifier (*NO_AUTO_POSSESS)
Possessive quantifiers are supported in Java (which introduced the syntax), PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python.
In .NET, where possessive quantifiers are not available, you can use the (this also works in Perl, PCRE, Java and Ruby).
The atomic group (?>A+) tells the engine that if the pattern fails after the A+ token, it is not allowed to backtrack into A+.
This means that A+ will not give up any of its characters—it is like a solid block (hence the name atomic).
As you can see, this is the same behavior as A++.
In fact, A++ is syntactic sugar for (?>A+), as internally most engines convert the first to the second.
+ | once or more |
A+ | One or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile) |
A+? | One or more As, as few as needed to allow the overall pattern to match (lazy) |
A++ | One or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive) |
* | zero times or more |
A* | Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile) |
A*? | Zero or more As, as few as needed to allow the overall pattern to match (lazy) |
A*+ | Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive) |
? | zero times or once |
A? | Zero or one A, one if possible (greedy), giving up the character if the engine needs to backtrack (docile) |
A?? | Zero or one A, zero if that still allows the overall pattern to match (lazy) |
A?+ | Zero or one A, one if possible (greedy), not giving the character if the engine tries to backtrack (possessive) |
{x,y} | x times at least, y times at most |
A{2,9} | Two to nine As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile) |
A{2,9}? | Two to nine As, as few as needed to allow the overall pattern to match (lazy) |
A{2,9}+ | Two to nine As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive) |
A{2,} A{2,}? A{2,}+ | Two or more As, greedy and docile as above. Two or more As, lazy as above. Two or more As, possessive as above. |
A{5} | Exactly five As. Fixed repetition: neither greedy nor lazy. |
{START}.*{END}
Note that Java will require that you escape the opening braces:\{
However, you will find that this pattern matches this entire string from start to finish:
{START} Mary {END} had a {START} little lamb {END}
…whereas we wanted to find two matches:
{START} Mary {END}
{START} little lamb {END}
Here is what happens.
After matching {START}, the engine moves to the next token: .*
Because of the greedy quantifier, the dot-star matches all the characters to the very end of the string.
The engine then moves to the next token: the { at the beginning of {END}.
This fails to match because there are no characters left in the string.
But the engine sees that it can backtrack into the dot-star.
First, the dot-star gives up the very last character in the string, i.e.
}.
The engine now tries to match the { token against this character, but fails.
The dot-star then gives up the D.
Again, the engine fails to match the { token against that character.
Repeating this process, the dot-star gives up the N, the E and the {, and and the { token can finally match.
Then the rest of the pattern END} matches.
Therefore, the final match is the entire string.
The dot-star has only given up as many characters as were needed to allow an overall match to succeed.
The best-known way to solve this problem is with lazy quantifiers.
But lazy quantifiers have their own problems, and it is worth understanding other techniques to overcome the greed of an unfettered dot-star.
We will look at five distinct solutions, which you all need to master on your way to your regex black belt.
{START}.*?{END}
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.
Therefore, the pattern only matches one {START}…{END} item at a time, which is what we want.
Containing a Lazy Quantifier that Can Eat the Delimiter: Atomic Group
Suppose our regex pattern must match not only a {START}…{END} block, but some characters beyond that block, for instance \d+ digits.
In such cases, we must tweak the lazy quantifier solution by embedding the lazy dot-star and the {END} delimiter together in an atomic group — like so:
{START}(?>.*?{END})
This is because if tokens (such as \d+) beyond {END} fail to match, the engine will backtrack and require the .*? to expand beyond the first {END}, perhaps reaching a second {END} where a match is possible.
We don't want this.
The atomic group (?>.*?{END}) forbids the engine from backtracking into the lazy .*? after the first {END} has been matched.
There are other ways to solve this problem, which is discussed in the Lazy Trap section.
Whenever the token quantified by a lazy quantifier is able to eat the delimiter, as in the above example or something like \d*?9, remember to embed the token and the delimiter together in an atomic group: (?>\d*?9)
Lazy Quantifiers are Expensive
It's important to understand how the lazy .*? works in this example because there is a cost to using lazy quantifiers.
When it first encounters .*? the engine starts out by matching the minimum number of characters allowed by the quantifier—which is zero.
The engine then advances in the pattern and tries the next token (which is {) against the M in Mary.
This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the M.
Once again, the engine advances in the pattern.
It now tries the { against the a in Mary.
This fails, so the engine backtracks and allows the .*? to expand and match the a.
The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on.
As you can see, for each character matched by the .*?, the engine has to backtrack.
From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive".
On a modern processor, for simple patterns, this will likely not matter.
But if you want to craft efficient regular expressions, you must pay attention to use lazy quantifiers only when they are needed.
Lower on the page, I will introduce you a far more efficient way of doing things.
{START}[^{]*{END}
The negated character class [^{]* greedily matches zero or more characters that are not an opening curly brace.
Therefore, we are guaranteed that the dot-star will never jump over the {END} delimiter.
This is a more direct and efficient way of matching between {START} and {END}.
Note that in this solution, we can fully trust the * that quantifies the [^{].
Even though it is greedy, there is no risk that [^{] will match too much as it is mutually exclusive with the { that starts {END}.
This is the contrast principle from the regex style guide.
{START}(?:(?!{END}).)*{END}
If you look closely, you'll see that we still have a kind of dot-star—a more complex one.
In (?:(?!{END}).)*, the * quantifier applies to a dot, but it is now a tempered dot.
The negative lookahead (?!{END}) asserts that what follows the current position is not the string {END}.
Therefore, the dot can never match the opening brace of {END}, guaranteeing that we won't jump over the {END} delimiter.
When Not to Use this Technique
For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}.
Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}.
The comparative performance of these two versions will depend on your engine's internal optimizations.
The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version.
On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.
Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.
When to Use this Technique
Suppose our boss now tells us that we still want to match up to and including {END}, but that we also need to avoid stepping over a {MID} section, if it exists.
Starting with the lazy dot-star version to ensure we match up to the {END} delimiter, we can then temper the dot to ensure it doesn't roll over {MID}:
{START}(?:(?!{MID}).)*?{END}
If more phrases must be avoided, we just add them to our tempered dot:
{START}(?:(?!{MID})(?!{RESTART}).)*?{END}
This is a useful technique to know about.
{START}(?:[^{]|{(?!END}))*{END}
We still have a greedy quantifier *.
This time, it does not apply to a dot but to a non-capturing group (?:…) that contains an alternation.
On the left side of the alternation, [^{] matches one character that is not an opening brace.
We can safely do this because we know that a non-{ character will never make us roll over the {END} delimiter.
On the right side of the alternation, we are allowed to match a { as long as it is not followed by END}: the negative lookahead (?!END}) asserts that what follows the position after { is not END}.
In our language of quantifier techniques, this is a tempered opening brace.
The pattern can be further optimized.
If we have several non-{ characters in a row (which will be the typical case), at the moment we have to enter and exit the alternation for every single character because the quantifier * applies to the non-capturing group (?:[^{]|{(?!END})).
This seems inefficient.
If we also had a quantifier on the [^{], we could match multiple non-{ characters without leaving the alternation.
To do so, the first idea would be to use [^{]+.
However, this leads to a situation where the * quantifier applies to the + quantifier.
If the pattern fails, the engine will explore all the ways that the two quantifiers can divide up the "pie of characters", leading to needlessly long backtracking and the situation I call an explosive quantifier (we'll explore in a later section).
What we want is to match any non-{ characters as a solid block that cannot be backtracked into.
We do this with a possessive quantifier [^{]++ or an atomic group (?>[^{]+).
While we're at it, we should also lock up the entire quantified alternation once we exit, because if {END} fails to match, backtracking into the alternation won't help.
We also do this either with possessive quantifiers (turning the * into *+) or by wrapping the quantified alternation into an atomic group.
We can use the possessive version in Java, PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python:
{START}(?:[^{]++|{(?!END}))*+{END}
We can use the atomic version in every major engine except Python and JavaScript:
{START}(?>(?:(?>[^{]+)|{(?!END}))*){END}
In any version of this solution, we do away with the generic dot by explicitly stating what we want: either any number of non-{ characters; or a { as long as it is not followed by END}.
This is a prime example of the Say What You Want (and What You Don't Want) principle from the regex style guide.
Note that for all the "normal" characters matched by the general case [^{]+ on the left side of the alternation, we don't need to look ahead.
Indeed, we only look ahead when we encounter an opening brace—which might only be once, when we hit the {END} delimiter.
Because we avoid the look-ahead-fail-backtrack rigmarole, we should expect this pattern to match faster than both the lazy dot-star and tempered-dot solutions, which both require "looking" at each step.
This is confirmed by pcretest:
Running the patterns a million times each on the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the atomic lazy-dot-star version, 800 for the tempered-dot version and 400 for the Explicit Greedy Alternation solution.
Lengthening the test string to {START} Mary Ate a Little Lamb {END}, the gaps between the versions increase drastically: 800 milliseconds per 10,000 runs for the lazy-dot-star, 2,300 for the tempered-dot, and only 500 for the explicit-greedy-alternation solution.
This solution takes a little more effort to write as you need to separate the brace case from the non-brace case, but it is well worth it if performance matters.
{START}[^{]*(?:(?:{(?!END}))+[^{]*)*{END}
This solution has pros and cons.
On the plus side, it is even faster than the Explicit Greedy Alternation solution it unrolls.
pcretest reports that per 10,000 runs, the performance on the short test string is identical, but on the longer test string this solution clocks in at 400 milliseconds, compared to 500 milliseconds for the other.
If you are looking to squeeze out every last drop of performance, this is the way to go.
On the minus side, the pattern is harder to read.
While the intent of the alternation in the original is immediate, it is not the case once the alternation is unrolled.
Moreover, one of the elements of the alternation is now repeated—when (A|B)* becomes A*(?:B+A*)* there are now two As.
If you ever change A in one place, you may forget to do it in the other—a maintenance hazard.
In my view, this is the kind of tweak that should be performed by the engine as an optimimization behind the curtain.
{START}.*?{END}\d+B
Looking at the bold string a few lines above, what do you think this pattern matches? Keep reading when you've made up your mind.
The pattern matches the entire string from the very beginning to the very end.
Do you see why?
Lazy quantifiers can jump the fence.
The .*? is supposed to expand until {END}\d+B is able to match.
Starting the match at the very start of the string, the .*? has no reason to stop expanding after the first {END} — where \d+B cannot match.
The .*? therefore continues to expand until a position where {END}\d+B is able to match.
Starting the match at the beginning of the string, the shortest match is the whole string.
The lesson: remember that the engine allows a lazy quantifier to expand its match as much as needed to allow an overall match.
If forced to, a lazy quantifier may jump the fence you thought you had made for it.
To contain the .*? in .*?{END} to the section before the first instance of {END}, you need to tweak it or replace it using one of four techniques we have already seen:
Bundle the characters preceding {END} together with {END} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {END}: (?>.*?{END})
Use a Tempered Greedy Token: (?:(?!{END}).)*{END}
Use an Explicit Greedy Alternation: (?:[^{]++|{(?!END}))*+{END}
Use an Unrolled Star Alternation Solution: [^{]*(?:(?:{(?!END}))+[^{]*)*+{END}
\b\d+E
to match a series of digits ending with an E.
Using this pattern on the string 1234, after the \d+ has finished its work the E token will fail to match.
At that stage, it is wasteful for the engine to backtrack into the \d+ and explore to see if the token E might have matched after the 2 or the 3.
Possessive quantifiers and atomic groups help us handle such situations by turning a quantified token or a subpattern into a block that cannot be backtracked into.
In this example, the syntax for those is \b\d++E
and \b(?>\d+)E
.
In this last example, the amount of potential backtracking needed is proportional to the length of the string.
The potential damage isn't too severe.
However, you can write regular expressions where the potential for backtracking in relation to the length of the string is exponential.
This is so wildly inefficient that your regex engine may well choke.
It is therefore vital to learn to recognize such expressions.
When there is potential for wild backtracking, quantifiers are always at fault.
To describe these situations, I speak of explosive quantifiers.
In Mastering Regular Expressions, Jeffrey Friedl refers to these situations as exponential matches, while in The Regular Expressions Cookbook Jan Goyvaerts and Steven Levithan speak of catastrophic backtracking.
^(A+)*B
It is not as contrived as it looks: for instance, it could be a window to look at the problem raised by ^(A+|X)*B, where A might stand [aeiou]
Let's see what happens when we try to match the string AAAC with that pattern ^(A+)*B.
First, A+ matches all the A characters.
The greedy * tries to repeat the A+ token, but there are no characters left to match.
The engine advances to the next token: B fails to match.
The engine backtracks, the A+ gives up the third A.
The greedy * tries to repeat A+ and matches the third A.
The engine advances to the next token: B fails to match.
The greedy * gives up the second A+ token, i.e.
the third A.
The engine advances to the next token: B fails to match.
Now the original A+ gives up the second A… Do you see where this is going?
Table of Combinations
The table below shows the combinations the engine will try for (A+)*.
Since the * quantifies the A+ token, several A+ tokens can contribute to what (A+)* matches at any given time.
Each column corresponds to the text matched by one of these A+ substrings.
But don't worry too much about the details of the table.
What matters is the number of rows.
A+ | A+ | A+ |
---|---|---|
AAA | — | — |
AA | A | — |
AA | — | — |
A | AA | — |
A | A | A |
A | A | — |
A | — | — |
— | — | — |
^(A+)*B
against longer strings.
The number of steps required to fail explodes.
Number of As | Steps to Fail |
---|---|
1, e.g. AC | 7 |
2, e.g. AAC | 14 |
3, e.g. AAAC | 28 |
4 | 56 |
5 | 112 |
10 | 3,584 |
20 | 3,670,016—RegexBuddy has given up. How about your program? |
100 | 4,436,777,100,798,802,905,238,461,218,816 |
(?:\D+|0(?!1))*
.
Unless you pay attention, you can miss that the (\D+…)* constitutes an explosive quantifier.
The lesson here is that when a quantifier needs to apply to another quantifier, we need to prevent the engine from backtracking.
We achieve this either by:
making the outer quantifier possessive, e.g.
(?:\D+|0(?!1))*+
or
enclosing the expression in an atomic group, e.g.
(?>(?:\D+|0(?!1))*)
^\d+\w*@
The \d and the \w are both able to match digits: they are not mutually exclusive.
Against a string such as 123, the pattern must fail.
While trying all the possibilities in order to find the match, the engine will let the \d+ give up characters that will be matched by the \w*.
Exploring these paths takes time: the engine takes 16 steps to reach failure.
Adding one digit to the test string, e.g.
1234, the engine takes 25 steps to fail.
With ten digits, it takes 121 steps.
With 100 digits, it takes 10,201 steps.
The situation is clearly far better than in the first example.
The number of steps required to fail in relation to the size of the string does not grow exponentially, but it still explodes—without looking at it closely its complexity seems to be quadratic or thereabouts, i.e.
O(n2).
It takes 1,100 digits to reach a million steps.
That's a lot more than many subject strings but a lot less than others—that's only a page-and-a-half of average text.
The lesson here is to try to use contiguous tokens that are mutually exclusive, following the rule of contrast from the regex style guide.
^(?:\d|\w)+@
This too will fail against 123.
On the first attempt, each digit will be matched by a \d token, as it is the leftmost side of the alternation.
When the @ token fails to match, the engine will backtrack into each alternation and let the \w side match characters that were previously matched by the \d.
The engine takes 60 steps to fail.
Adding one digit to the test string, e.g.
1234, the engine takes 124 steps to fail.
With ten digits, it takes 8,188 steps.
With 16 digits, it takes 524,284.
For longer strings, RegexBuddy maxes out.
The complexity of exploring all the combinations is O(2n).
Clearly, this is far worse than the previous pattern ^\d+\w*@
, which at first sight looks fairly similar.
Why? With the earlier pattern, the engine must find a match that is a series of digits \d, then optionally a series of word characters \w.
The pie is always divided in that order—first \d tokens, then \w tokens.
In contrast, the second pattern ^(?:\d|\w)+@
gives us many more ways to divide up the pie.
The pie can be claimed by word characters tokens first, then digit tokens.
Or by word character tokens and and digit tokens intermingled in every way imaginable.
In the literature, this symptom is usually shown in the form (A|AA)+, but in my view that's not really helpful.
Why would you ever write such a silly pattern? Of course ^(?:\d|\w)+@ is silly too, but it brings out the salient symptom, which is that various components in a quantified alternation are able to "compete" for the same characters.
The lesson here is that when we build an alternation that is quantified, we must make sure that distinct branches cannot match the same characters.
Do character classes present the same risk?
Our vicious pattern ^(?:\d|\w)+@
could be written with a character class: ^[\d\w]+@
Let's forget for a moment that we would never write such a silly pattern—like the others, it is only meant to help us explore potentially explosive patterns.
On the face of it, this pattern does the exact same thing as the version with the alternation: at each step, the engine can match either a digit or a word character.
Surely it too must explode when the engine fails to find a match, right?
It is not so.
Suppose we try ^[\d\w]+@
against the string 123.
First, [\d\w]+ greedily matches all the digits.
For a moment, let's assume that each of those digits (1, 2, 3) was matched by the \d token inside the character class.
Please note that we don't know this for a fact.
One engine might notice that \d is a subset of \w and optimize the entire character class to \w before even starting the match attempt.
Another engine might have its own set of rules about which tokens in a character class to try first.
After the @ token fails to match, the engine looks for positions to backtrack.
First, the [\d\w]+ gives up the 3.
The engine tries to match the 3 with the token @, and fails.
At this stage, in the alternation version, the engine would have tried to match the 3 with the \w token on the right side of the alternation.
In this case, however, the engine does not attempt the \w inside the [\d\w].
A character class constitutes a solid block, an atomic token.
Once it matches a character, you don't backtrack into it to try different ways to make it match.
When you give it up, you give it up.
Therefore, after the @ token fails to match the 3, the engine's next move is to backtrack once more and force the [\d\w] to give up the 2.
Next, the @ token fails to match the 2.
There is nothing left to backtrack, and the match attempt fails.
In RegexBuddy's way of counting, reaching that point takes seven steps.
The number of steps required to explore all the paths is directly proportional to the length of the string: the complexity is O(n), which is the best you can ask for, short of making the character class's quantifier possessive — [\d\w]++ — or enclosing it in an atomic group (?>[\d\w]+).
^.*A.*AB
Suppose our string is AAAAA.
The first dot-star can match the whole string, nothing at all, or anything in between.
The second dot-star can match a considerable portion of the string, nothing at all, or anything in between.
Before the engine can determine that the match must fail, there will be a tug of war between the two dot-stars.
It takes 53 steps for the RegexBuddy engine to fail on this short string, and 178 steps on a string that contains ten A characters.
The regex in this example is so short—and we are so used to distrusting dot-stars—that it probably jumps out at you that one dot-star can overreach into the other's territory.
But the same situation can arise in less obvious ways.
Consider this pattern, which is only slightly longer than the previous one:
^\d*?1\d*?1B
The lazy \d*? seems to only want to extend up to the first 1, while the second \d*? extends to the second 1.
That seems legit.
But when the engine has trouble finding a match, the first \d*? can in fact jump over the first 1 if there are more ones to swallow.
You may indeed remember from the Lazy Trap that lazy quantifiers can jump over the fence you thought you had made for them, because they expand as far as needed in order to allow a match.
The delimiter 1 is not a true fence because the \d token can match it if it needs to.
For instance, against the string 11111C, where the match must clearly fail, at one stage the first \d*? will match all the ones.
It takes 59 steps for the engine to fail.
With ten ones, it takes 189 steps, and with a hundred ones, it takes 15,354 steps.
Once again, we have an explosive quantifier—although nowhere as bad as in our exponential example.
If you thought the ^\d*?1\d*?1B was easy to spot, consider that the same phenomenon could be embedded in something like this:
.*?{START} (lots of stuff in between) .*?{END}
In my view, this is a lot harder to spot—unless you are sensitive to whether the tokens quantified by a lazy quantifier are able to match their intended delimiters.
The lesson here is to carefully consider whether a quantified token might reach over into a section of the string that you had intended for another token to match.
To contain the .*? in .*?{START} to the section before {START}, you need to tweak it or replace it using one of four techniques we have already seen:
Bundle the characters to be matched before {START} together with {START} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {START}: (?>.*?{START})
Use a Tempered Greedy Token: (?:(?!{START}).)*{START}
Use an Explicit Greedy Alternation: (?:[^{]++|{(?!START}))*+{START}
Use an Unrolled Star Alternation: [^{]*(?:(?:{(?!START}))+[^{]*)*{START}
(?(A)X|Y)
This means "if proposition A is true, then match pattern X; otherwise, match pattern Y."
Often, you don't need the ELSE case or the THEN case:
(?(A)X)
says "if proposition A is true, then match pattern X." (?(A)X|)
means the same — but the alternation bar can be dropped.
(?(A)|X)
amounts to saying "if proposition A is not true, then match pattern X." If you translate the IF…THEN…ELSE construction literally, it says "if proposition A is true, then match the empty string (which always matches at every position), otherwise match pattern X."
Proposition A
Proposition A can be one of several kinds of assertions that the regex engine can test and determine to be true or false.
These various kinds of assertions are expressed by small variations in the conditional syntax.
Proposition A can assert that:
a numbered capture group has been set
a named capture group has been set
a capture group at a relative position to the current position in the pattern has been set
a lookaround has been successful
a subroutine call has been made
a recursive call has been made
embedded code evaluates to TRUE
(?(1)foo|bar)
In this exact pattern, if Group 1 has been set, the engine must match the literal characters foo.
If not, it must match the literal characters bar.
But the alternation can contain any regex pattern, for instance (?(1)\d{2}\b|\d{3}\b)
A realistic use of a conditional that checks whether a group has been set would be something like this:
^(START)?\d+(?(1)END|\b)
Here is how this works:
The ^ anchor asserts that the current position is the beginning of the string
The parentheses around (START) capture the string START to Group 1, but the ? "zero-or-one" quantifier makes the capture optional
\d+ matches one or more digits
The conditional (?(1)END|\b) checks whether Group 1 has been set (i.e., whether START has been matched).
If so, the engine must match END.
If not, the engine must match a word boundary.
The net result is that the pattern matches digits that are either embedded within START…END at the beginning of the string, or standing by themselves at the beginning of the string.
To achieve the same effect without a conditional, we could use ^(?:START\d+END|\d+\b)
, which forces us to repeat the \d+ token.
^(?<UC>[A-Z])?\d+(?(UC)_END)$
With (?<UC>[A-Z]) the optional capture group named UC captures one upper-case letter
\d+ matches digits
The conditional (?(UC)_END) checks whether the group named UC has been set.
If so, it matches the characters _END
This pattern would match the string A55_END as well as the string 123.
(?(-1)X|Y)
says: if the nearest capture group to the left of this conditional has been set, match pattern X; otherwise, match pattern Y.
Using a relative group in a conditional comes in handy when you are working on a large pattern, some of whose parts you may later decide to move.
It can be easier to count a relative position such as -2 than an absolute position.
Checking a relative group to the right
Although this is far less common, you can also use a forward relative group.
This time, we use a + sign in front of an integer:
(?(+1)X|Y)
This says: if the nearest capture group to the right of this conditional has been set, match pattern X; otherwise, match pattern Y.
But how, you may ask, can a capture group to the right of the current position in the pattern already have been set? This can happen in various ways:
The conditional and the group live inside a quantified group.
For instance,
(?:A(?(+1)B)(C))+
matches ACABC.
On the first pass through the repeated group, the conditional fails as C has not yet been captured.
On the second pass, the conditional succeeds.
The conditional has been reached through a subroutine call.
For instance,
(A(?(+1)B)(C))(?1)
matches ACABC.
Inside the parentheses that define Group 1, the conditional fails as C has not been captured.
On the subroutine call (?1), the conditional succeeds.
The conditional has been reached through a recursive call.
For instance,
(A(?(+1)B)(C)(?R)?D)
matches ACABCDD.
At the outer level, the conditional fails as C as not been captured.
At the first depth of recursion, it succeeds.
^(?(?=.*_FRUIT$)(?:apple|banana)|(?:carrot|pumpkin))\b
After the ^ anchor asserts that the current position is the beginning of the string, the conditional (?(?=.*_FRUIT$)…|…) checks whether the lookahead (?=.*_FRUIT$) can succeed.
That lookahead asserts that at the current position, the engine can match any characters, then _FRUIT and the end of the string.
If the lookahead succeeds, we match a fruit: (?:apple|banana).
Otherwise, we match a vegetable: (?:carrot|pumpkin)
Without a conditional, this would be a bit heavier to express:
^(?:(?:apple|banana)(?=.*_FRUIT$)|(?:carrot|pumpkin)(?!.*_FRUIT$))\b
(A(?(R1)B|C))(?1)
It matches the string ACAB.
The parentheses around (A…) define Group 1 and Subroutine 1.
First, we match the character A.
The conditional (?(R1)B|C) checks whether we are in the middle of a call to subroutine 1.
After matching the string's initial A, it is not true that we have reached this point in the pattern via a subroutine call, so we must match the pattern in the ELSE branch of the conditional, which is the letter C.
(?1) is a call to subroutine 1.
First, we match another A.
The conditional check succeeds as we have reached this point via a call to subroutine 1, so we must match the pattern in the THEN branch, which is the letter B.
Here is the same, but using a named subroutine:
(?<foo>A(?(R&foo)B|C))(?&foo)
The parentheses around (?<foo>A…) define a capture group and subroutine named foo.
First, we match the character A.
The conditional (?(R&foo)B|C) checks whether we are in the middle of a call to the subroutine named foo.
After matching the string's initial A, it is not true that we have reached this point in the pattern via a subroutine call, so we must match the pattern in the ELSE branch of the conditional, which is the letter C.
(?&foo) is a call to the subroutine named foo.
First, we match another A.
The conditional check succeeds as we have reached this point via a call to the subroutine named foo, so we must match the pattern in the THEN branch, which is the letter B.
Nested Subroutine Calls
Suppose a part of the pattern calls subroutine 2, which then calls subroutine 1.
Once inside subroutine 1, the engine encounters a conditional check on whether subroutine 2 has been called.
Even though we are currently within a call to subroutine 2, the conditional test fails because what matters is the last subroutine call that was made—which is the call to subroutine 1.
We can see this with these two patterns:
(A(?(R1)C))(B(?1))(?2)
matches ABACBAC.
Within it, (A(?(R1)C)) matches A, (B(?1)) matches BAC and (?2) matches BAC again.
(A(?(R2)C))(B(?1))(?2)
matches ABABA but not all of ABABAC.
Within it, (A(?(R2)C)) matches A, (B(?1)) matches BA, and (?2) matches BA again.
The conditional (?(R2)C) fails even when reached via (?2), as the most recent subroutine call when it is reached is the one made by (?1).
A(?(R)B)(?R)?C
It matches the string AABCC.
The first time we encounter the conditional, we have not made a recursive call, so we do not have to match a B.
The outer level of the recursive match will be A…C
The second time we encounter the conditional, we are in the middle of a recursive call, so we must match a B.
If we don't recurse again, the depth 1 match is ABC, and the pattern can match ACABC.
\d+(?(?{$currency}) dollars)
matches two kinds of strings.
When $currency is set to FALSE, the conditional test fails and the pattern only matches a series of digits, such as 122.
When $currency is set to TRUE, the conditional test succeeds and the pattern matches strings such as 55 dollars.
(?:(BEGIN:)|({{)).*?(?(1):END)(?(2)}})
This will match {{foo}} and BEGIN:bar:END
The non-capturing group (?:(BEGIN:)|({{)) matches the opening delimiter, either capturing BEGIN: to Group 1 or capturing {{ to Group 2.
.*? matches any characters, lazily expanding up to a point where the rest of the pattern can match.
The conditional (?(1):END) checks if Group 1 has been set.
If so, the engine must match :END
The conditional (?(2)}}) checks if Group 2 has been set.
If so, the engine must match }}
Alternative Solution
This can also be solved with a plain alternation:
BEGIN:.*?:END|{{.*?}}
However, this expression becomes increasingly more complex when
we add potential delimiter pairs, such as <== … ==>, or
the content to be matched between the delimiters turns into a longer pattern—as this pattern must be repeated on each branch of the alternation.
^(BEG)?\d+(?:END|_end(?(1)(?!)))$
(BEG)? optionally matches BEG, capturing the characters to Group 1.
\d+ matches the digits.
(?:END|_end(?(1)(?!))) matches either END or _end.
On the _end branch, the conditional (?(1)(?!)) checks if Group 1 has been set (i.e., we matched BEG earlier), and if so, the THEN branch (?!) forces the match attempt to fail.
Fail Unless Y
Let's give a slight tweak to the context in which we'd like to match digits.
The digits must still be followed by either END or _end.
However, if they end with END, then BEG is the only allowable prefix.
Therefore, 00END cannot match, whereas BEG00END, BEG12_end and 00_end all match.
We can use this pattern:
^(BEG)?\d+(?:_end|END(?(1)|(?!)))$
(BEG)? optionally matches BEG, capturing the characters to Group 1.
\d+ matches the digits.
(?:_end|END(?(1)|(?!))) matches either _end or END.
On the END branch, the conditional (?(1)|(?!)) checks if Group 1 has been set (i.e., we matched BEG earlier); if not so, the ELSE branch (?!) forces the match attempt to fail.
In the example on self-referencing groups, one of the alternate solutions will show a powerful way to use conditionals to control failure in the context of .NET balancing groups.
\A(A(?:(?1)|[^AB]*)B)\z
(This also works in Ruby if we replace the (?1) with a \g<1>)
But if we want to balance a greater number of tokens, as in AAA foo BBB bar CCC baz DDD, it can becomes interesting to use self-referencing groups, as seen on the page about Quantifier Capture and on the trick to match line numbers.
For our task of balancing As with Bs in strings such as AAA foo BBB, we could use something like:
^(?:A(?=A*+[^AB]*+((?(1)\1)B)))++[^B]*+\1$
I know… Please don't scream, we'll ease in gently.
One feature of this pattern is that capture Group 1 ((?(1)\1)B) refers to itself with the conditional (?(1)\1).
This conditional says:
If Group 1 has already been set, match the current content of the Group 1 capture buffer.
Match B — regardless of whether Group 1 has been set.
This construction has the effect that with each pass through Group 1, the Group 1 capture buffer gets longer by one character B.
On the first pass, Group 1 has not been set, so the THEN branch of the conditional does not apply, and Group 1 captures one single B.
On the second pass, the conditional applies, so the parentheses must match \1 (a back-reference to Group 1, which at this stage is B) and one additional B.
At this stage, Group 1 contains BB.
On the third pass, \1 is BB, so the parentheses must capture BBB… and so on.
Thanks to this construction, the quantified group (?:A…)+ matches all the characters one by one, and for each A that is matched, the Group 1 capture buffer grows by one B.
By the time we exit (?:A…)+, we have matched as many As as the number of Bs captured in Group 1.
Later in the pattern a simple back-reference \1 to Group 1 matches these Bs.
Alternate Solutions
Inside the self-referencing group ((?(1)\1)B), instead of using a conditional, we could use an optional (but possessive) back-reference to Group 1 \1?+.
If Group 1 is set, it is matched.
And the possessive + forbids the engine from backtracking and giving up the back-reference.
We've already looked at the recursive solution.
Let's look at a beautiful solution in .NET.
Balancing Groups
In .NET, we can use balancing groups.
This solution also uses a conditional, which is another example of a conditional to control failure.
As a reminder, the task is to match strings where the number of As and Bs is balanced, as in AAA foo BBB.
We can use this:
^(?<Count_A>A)+[^AB]*(?<-Count_A>B)+(?(Count_A)(?!))$
(?<Count_A>A)+ matches all the As, adding each individual A to the CaptureCollection named Count_A.
I gave the group that name because we use the group as a virtual counter.
[^AB]*+ matches all the non-A, non-B characters.
(?<-Count_A>B)+ matches all the B characters, popping individual A characters from the CaptureCollection as it does do ("decrementing the counter").
(?(Count_A)(?!)) checks if the named capture Group Count_A is set, which can only be the case if we have not removed enough As from the CaptureCollection.
This would mean there are fewer Bs then As in the string.
In that case, the engine matches the THEN branch of the conditional, which is the classic trick (?!) to force the regex engine to fail and attempt to backtrack.
For efficiency, each quantified group should be made atomic:
^(?>(?<Count_A>A)+)(?>[^AB]*)(?>(?<-Count_A>B)+)(?(Count_A)(?!))$
I know, the atomic version (which is far preferable for the engine) looks awful… Do you happen to know the people at Microsoft in charge of .NET regex? If so, please lobby them to support possessive quantifiers (and subroutines, and recursion).
And if you don't mind, please shoot me a message as I'd love to know how to reach them.
Subject: About : 3.
Not so useful: checking if a lookaround is successful.
Let me add my opinion about the 3rd point : I had a problem for which I found a solution with this syntax, and that not seems to work if I use only the lookaround.
Please consider a matching test : "if the string ends with END, it should contain WORD, otherwise all is permitted" :
- with the conditional regex I write this :
R1 : ^(? (? =.
*END$).
*WORD.
*END|.
*)$
with this R1 regex, "abcd" matches, "theWORD is END" matches, but "only END" doesn't match because it ends with END but WORD is missing.
That's what I need : presence of WORD is tested only if string ends with END.
- without the conditional it becomes :
R2 : ^(? =.
*END$).
*WORD.
*END|.
*$
with R2 regex, the last test "only END" matches and that's not what I need
So I think that there are cases for which checking if a lookaround is successful is so useful.
Otherwise, please give me another regex that works for my problem (maybe it exists one, I'm not a regex guru ^^).
Regards,
Yosh
Reply to YosheE
Hi Yoshe,
Sorry about the delay, I have been traveling then had to catch up on a million things.
Finally looked at your message today.
Congratulations for building an interesting example!
With two lookarounds, there are several solutions.
Here's a simple solution with a single lookaround:
(?x)
^.*?WORD.*END$
|
^(?:(?!END$).)*$
In a majority of cases, your conditional implementation probably runs faster.
I've added that as an additional example for case 3.
Wishing you a beautiful day,
Rex
Subject: Hi!
Thi is really help full thank you
\w{3}
, then three digits: \d{3}
.
On our test string, with "aaa111", the expression can match on these first steps.
Next, since the expression has worked thus far, the (?R)
"pastes" the whole pattern in its own place.
(Bear in mind that this is only a manner of speaking.
The actual regex engine wouldn't know pasting from pasta.) The question mark at the end of (?R)?
is the usual "one or nothing" operator.
It makes it so that if the expression in the "pasted pattern" fails, the engine can match "empty" in that same spot.
At this stage, the expanded expression looks like this:
Pattern 1 (first expansion): \w{3}\d{3}(?:\w{3}\d{3}(?R)?)?
The bold part of the pattern is where the full pattern has been pasted in place of (?R).
As you can see, I have encapsulated the pasted pattern in a non-capturing group.
Why? Because I needed to apply the question mark that was at the end of the (?R)? to the pasted pattern—the non-capturing group is just a way to express that syntax.
Without the final question mark, we would not need the non-capturing group.
By the way, even though the (?R) is inside parentheses, the parentheses do not capture anything.
That is why I used a non-capturing group rather than simple parentheses.
After matching "aaa111", we are now at the beginning of the bold part of the expression.
Our first job is to match three more word characters and three more digits.
Luckily, with "bbb222", our test string supplies these.
Next, we bump against (?R)? once again.
The (?R) pastes itself in place.
If you wrote out the whole expression at this stage, it would look like this:
Pattern 1 (second expansion): \w{3}\d{3}(?:\w{3}\d{3}#(?:\w{3}\d{3}(?R)?)?)?
This time, I have inserted a bold hash character (#) to show where we are in the expression at the moment.
I have also bolded the question mark of the pattern we just pasted in order to emphasize that this new pattern is optional.
At this stage, we try to match three word characters again.
But we are at the end of our subject string (aaa111bbb222), so the new sub-pattern fails.
Thanks to the bold question mark, we can go back to where we were before trying to match the pasted pattern.
In the first expansion, that location is the point just before the (?R)?.
The engine rolls over the (?R)?, successfully completes the expression, and returns a match: "aaa111bbb222".
As you can see, if you approach them like this, recursive expressions are nothing to be scared of.
But the record should show that our recursion accomplished nothing more than the puny plus quantifier in:
Alternate to recursive pattern #1: (?:\w{3}\d{3})+
And as you can imagine, with anything a bit complex, the paste method would quickly become difficult to follow.
Fortunately, the "trace method" eliminates the problem.
abc(?:$|(?R))
This pattern matches series of the string "abc" strung together.
This series must be located at the end of the subject string, as it is anchored there by the dollar sign.
The pattern matches "abc", "abcabc", but not "abc123".
How does it work?
After each "abc" match, the regex engine meets an alternation.
On the left side, if it finds the end of string position (expressed by the dollar symbol in the regex), that's the end of the expression.
On the other hand, if the end of the string has not yet been reached, the engine moves to the right side of the alternation, goes down one level, and tries to find "abc" once again.
Without some kind of way out, the expression would never match, as eventually any string must run out of "abc"s to feed the regex engine.
(\w)(?:(?R)|\w?)\1
What does this do? This pattern matches palindromes, which are "mirror words" that can be read in either direction, such as "level" and "peep".
Let's unroll it to see how it works.
(?x)# activate comment mode
(\w)# capture one word character in Group 1
(?:(?R)# non-capturing group: match the whole expression again,
|# OR
\w?)# match any word character, or "empty"
\1# match the character captured in Group 1
The pattern starts with one word character.
This character is mirrored at the very end with the Group 1 back reference.
These are the basic mechanics of how we are "building our mirror".
In the very middle of the mirror, we are happy to have either a single character (the \w in the alternation) or nothing (made possible by the question mark after the \w).
Note that the pattern is not anchored, so it can match mirror words inside longer strings.
Here is some php code that tests the pattern against a few strings.
<?php
$subjects=array('dontmatchme','kook','book','paper','kayak','okonoko','aaaaa','bbbb');
$pattern='/(\w)(?:(?R)|\w?)\1/';
foreach ($subjects as $sub) {
echo $sub." ".str_repeat('-',15-strlen($sub))."-> ";
if (preg_match($pattern,$sub,$match)) echo $match[0]."<br />";
else echo 'sorry, no match<br />';
}
?>
And here is the output:
dontmatchme -----> sorry, no match
kook ------------> kook
book ------------> oo
paper -----------> pap
kayak -----------> kayak
okonoko ---------> okonoko
aaaaa -----------> aaaaa
bbbb ------------> bbb
It worked perfectly! Well, almost perfectly.
For the last string ("bbbb"), a match is found, but not the one we expected.
What is happening there? This has to do with the atomic nature of PHP recursion levels.
To explain it properly we will need to use the trace method in order to see every little step taken by the PHP regex engine.
For the fully-traced match, click the image.
(\w)(?:(?R)|\w?)\1
, it returns bb, and D0 can match the final b, returning the complete intended match: bbbb.
However, this is not how the PCRE engine used by PHP's preg_match function works.
Instead of going back into D1, the engine gives up D1 as a block.
D0 then completes the match by eating two more "b"s, leaving the last one untouched, and returns "bbb".
May this serve as a warning about the potentially unexpected outcomes of recursive regular expressions!
^(\w)(?:(?R)|\w?)\1$
Wait, not so fast… If you add anchors to the expression, when you hit the (?R), the anchors will be pasted back into the middle of the expression, yielding something like this:
Attempt at anchored recursive pattern (first expansion):
^(\w)(?:^(\w)(?:(?R)|\w?)\1$|\w?)\1$
Now you have two carets preceding two distinct characters.
This pattern can never match.
Fortunately, you can build a recursive expression without using (?R)
, which repeats then entire expression.
Instead, using the "subroutine expression" (or sub-pattern) syntax, you can paste a sub-pattern specified by a capture group.
For instance, to paste the regex inside the Group 1 parentheses, you would use (?1) instead of (?R).
Here is how our corrected anchored recursive pattern looks:
Anchored recursive pattern: ^((\w)(?:(?1)|\w?)\2)$
Everything between the two anchors now lives in a set of parentheses.
This is Group 1.
Therefore, the captured word at the start is now Group 2.
In the middle, the repeating expression pastes the subpattern defined by the Group 1 parentheses in place of the alternation: the anchors are left out.
At the end, the first character is mirrored by the back reference to Group 2.
This works perfectly.
Except, once again, for the "bbbb" string.
To see exactly why, you can use the trace method as shown in the earlier example.
[\p{InArabic}&&\p{L}]
When negation is involved, I recommend always using brackets because things can get messy.
For instance, what does [^a&&[ab]] mean? For Java, it means [[^a]&&[ab]] and therefore only matches b.
For Ruby, it means [^[a&&[ab]]], i.e.
"not the intersection of two subclasses" (which is a), and it therefore matches any character that is not an a.
Combined with negated classes, intersection is very useful to create character class subtraction.
Workaround for Engines that Don't Support Character Class Intersection
For engines that don't support character class intersection, we can simply use a lookahead.
For instance, our Arabic letter regex [\p{InArabic}&&\p{L}] can be written like this in Perl:
(?=\p{InArabic})\p{L}
The lookahead asserts that the following character belongs to the Arabic Unicode block.
Then \p{L} matches a letter, which is guaranteed to be an Arabic letter.
[a-z&&[^aeiou]]
matches characters that are both English lowercase letters and not vowels.
In effect, it subtracts the vowel class [aeiou] from the class of letters [a-z].
The effect is to match all English lowercase consonants.
Subtraction becomes particularly useful with Unicode properties.
For instance,
[\p{InArabic}&&[^\P{L}]]
subtracts non-letters \P{L} from the set of characters in the Arabic Unicode block—guaranteeing we match an Arabic letter.
Character Class Subtraction in .NET
In .NET, with […-[…]], you can specify that the character to be matched belongs to a certain class (everything before the hyphen), except if it belongs to another class (the embedded character class, which is "subtracted" by the hyphen).
For instance, the class
[a-z-[aeiou]]
matches an English lower-case consonant.
Using Unicode properties, you can use this feature to zoom in on a useful character range.
For instance, you could try [\p{IsArabic}-[\d]] to match one character in the Arabic code block, except if it is a digit.
Do not think that gives you an Arabic letter, though, as the Arabic code block also includes punctuation and various marks and symbols.
In contrast, [\p{IsArabic}-[\D]] is much more useful: it gives us one character in the Arabic code block, except if it is a non-digit—guaranteeing that it is an Arabic digit.
You can have nested subtraction—using subtraction within a class being subtracted.
For instance, having defined [a-z-[aeiou]] as English lowercase consonants, we could subtract those from the word character class: [\w-[a-z-[aeiou]]]
Note that [^a-z-[0-9]] is not interpreted as (using pseudoregex) [^{a-z-[0-9]}], but as (pseudoregex) [{^a-z}-[0-9]].
But you would never do that: [^a-z0-9] is much simpler.
Character Class Subtraction in the regex module for Python
The syntax is the same as for .NET, except for one added hyphen.
For instance, the class
[a-z--[aeiou]]
matches an English lower-case consonant.
In addition, when the subtracted class does not include a range, its brackets are optional.
The above can therefore also be written as [a-z--aeiou]
Workaround for Engines that Don't Support Character Class Subtraction
For engines that don't support character class subtraction, we can simply use a negative lookahead.
For instance, our English consonant regex [a-z&&[^aeiou]] can be written like this in PCRE (PHP, R…), Perl, Python and JavaScript:
(?![aeiou])[a-z]
The negative lookahead asserts that the following character is not a lowercase vowel.
Then [a-z] matches a letter, which is guaranteed not to be a vowel.
(?:0|[^\W\d])
Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)
We'll use it against this string:
Comedian: Bill Burr -- Comedian: Peter Sellers
The match is Peter Sellers.
Here's a top-level explanation.
What we have here can be seen as a construct of the form
if:then… elseif:then… else:then
The B\w+, E\w+ and P\w+ fragments are meant to match a first name.
The idea is that if the first name starts with a B, THEN the last name must be Murray…
If the engine matches a first name starting with a B but the last name is something else than Murray, the engine is instructed not to slowly backtrack each of the tokens of the failed branch's pattern B\w+, but to give up the entire branch (this might remind you of an atomic group) and skip directly to the next branch in the alternation, which is E\w+ (*THEN)Murphy.
In that branch, the idea is the same: if the first name starts with an E, THEN the last name must be Murphy, otherwise don't bother backtracking across the first name and skip to the next branch: P\w+ Sellers
In pseudo-regex, the expression reads:
- Match Comedian:
- If the first name starts with B, THEN match Murray
- Elseif the first name starts with E, THEN match Murphy
- Else a match first name starting with P, a space, and Sellers
In our example, the engine starts a match attempt at the beginning of this string:
Comedian: Bill Burr -- Comedian: Peter Sellers
After matching Comedian:, a space character, the Bill in Bill Burr and the space character, the engine matches the always-true (*THEN), but at that stage it fails to match the M in Murray.
It cannot backtrack across the (*THEN), so it gives up the content of the alternation's first branch (Bill and a space) and tries to match the E at the start of the middle branch.
That fails, so the engine tries to match the P in the third branch.
That too fails, so the entire match attempt has failed.
The engine advances in the string to the position preceding the o in Comedian and tries a second match attempt.
That fails immediately.
After a number of other immediately-failed match attempts, the engine tries a match attempt at the position before the C in Comedian: Peter Sellers.
This match attempt succeeds.
Here are code snippets if you'd like to try it with the two engines that currently support (*THEN).
# Perl
$comedian_regex =
qr/Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)/;
if ('Comedian: Bill Burr -- Comedian: Peter Sellers'
=~
$comedian_regex
) { print "\$&='$&'\n"; }
else { print "No match\n"; }
// PHP
$comedian_regex =
'~Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)~';
echo preg_match($comedian_regex,
'Comedian: Bill Burr -- Comedian: Peter Sellers',
$m
) ? "$m[0]\n" : "No match\n";
Is this useful?
Not for this simple example.
Consider the simpler alternate, where the two (*THEN) have been removed:
Comedian: (?:B\w+ Murray|E\w+ Murphy|P\w+ Sellers)
This matches the same strings.
Autopossessification
The (*THEN) are just meant to speed up the process by cutting down on backtracking.
But the time lost to compile this more complex pattern more than offsets any time gained during matching.
Furthermore, for this example, there is no time gain during matching (as least with the PCRE engine: I haven't timed Perl).
That is because PCRE would never backtrack into the first name in the first place.
Before matching, the engine studies the pattern, and an optimization kicks in that turns the \w+ token into a possessive \w++.
PCRE is able to do this because the token that follows \w+ is a space character.
The \w token and the space character are mutually exclusive: even if it backtracks into \w+, the engine will never be able to match a space character where a word character was matched earlier.
Therefore PCRE can treat \w as a \w++.
This process is known by a charming term: autopossessification.
Try not to say it while chewing on oatmeal.
When you turn off PCRE optimizations by using PCRE's start of pattern modifiers (*NO_START_OPT) and (*NO_AUTO_POSSESS), the (*THEN) pattern edges out the plain pattern.
But only by a hair: this makes sense because Bill does not give \w+ much to backtrack.
That being said, when the expression to the left of a (*THEN) (within the same branch of the alternation) is particularly complex and time-consuming, I am sure there are situations when the verb would be useful.
For my part, I have never used it except to try it out.
When (*THEN) is not found within an alternation
The (*THEN) verb works by sending the engine to the next branch of an alternation.
When (*THEN) is not found within an alternation, it behaves like (*PRUNE).
\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers
Against the same string Bill Burr -- Peter Sellers, this pattern never matches Bill Burr.
Instead, it matches Peter Sellers.
The difference is that on the first match attempt, in the leftmost side of the alternation, after the M in Murray fails to match, the engine tries to backtrack across the (*PRUNE).
At that stage, the match attempt explodes—the other parts of the alternation are never visited.
The engine advances to the next position in the string (between the initial B and the i) and starts a whole new match attempt.
After this and several more match attempts fail, the engine starts a new match attempt at the string position immediately preceding the P in Peter Sellers, and the match succeeds with the rightmost branch of the alternation.
Here are some code snippets if you'd like to try this in the three engines that currently support (*PRUNE).
# Perl
$actor_regex = qr/\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers/;
if ('Comedian: Bill Burr -- Comedian: Peter Sellers'
=~ $actor_regex
) { print "\$&='$&'\n"; }
else { print "No match\n"; }
// PHP
$actor_regex = '~\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers~';
echo preg_match($actor_regex,
'Bill Burr -- Peter Sellers',
$m
) ? "$m[0]\n" : "No match\n";
# Python
# if you don't have the regex package, pip install regex
import regex as mrab
# print(regex.__version__) should output 2.4.76 or higher
actor_regex = r'\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers'
print(mrab.search(actor_regex, 'Bill Burr -- Peter Sellers'))
# <regex.Match object; span=(13, 26), match='Peter Sellers'>
Is this (*PRUNE) useful?
For moderately complex expressions that may entail a lot of backtracking, (*PRUNE) can save the engine a lot of time.
It's a powerful weapon to have in your regex arsenal: you drop it anywhere, and it causes the match attempt to explode if the engine ever tries to backtrack across it.
Certain conditions must be met before (*PRUNE) becomes efficient:
- the time saved must outweigh the longer time needed to compile the regex.
If the only potential savings is backtracking across two letters (as with a \w{2,4}, this is not a place for (*PRUNE).
- (*PRUNE) must save real-life backtracking.
On paper, it may look like a (*PRUNE) saves you some backtracking, but your engine may have performed some optimizations that would prevent backtracking from ever happening anyway: see the section on autopossessification.
When (*PRUNE) does work, in many cases (*SKIP) works even better.
Teaser: in PCRE, you can accomplish similar results to (*PRUNE) with a callout that returns a positive value.
aa(*SKIP)ard\w+
The engine starts a match attempt at the beginning of the string.
It matches aa, passes over the (*SKIP), matches the third a then chokes on the r token because the next character in the string is a.
Internally, our engines are smart enough to fail the match right away because there is nothing to backtrack (no quantifiers, no alternations).
Externally, the engines behave as though they were conducting a naive path exploration that would cause them to give up the last a then attempt to backtrack across the (*SKIP) in search of a different match.
The (*SKIP) is triggered: the match attempt explodes; the engine skips in the string past the initial aa before starting its second match attempt.
At this position, there is only one a character, so this match attempt fails, as do other attempts until we reach the position preceding aaardwolf.
In contrast, without the (*SKIP), the pattern would match the string aaardwark starting on the second a.
Perl: Inconsistent behavior on later failure
Unlike PCRE and Python, when a pattern looks like it should fail to the right of a (*SKIP), the Perl engine does not always fire the (*SKIP).
For instance, with the earlier pattern aa(*SKIP)ard\w+ Perl matches aaardvark in aaaardvark aaardwolf.
Perhaps even worse, with a{1,2}(*SKIP)ard\w+, when matching against aaaardvark aaardwolf, after matching aaa and failing on the r token, Perl doesn't try to backtrack across (*SKIP) despite the quantified a{1,2}—a backtrackable expression which should clearly lure the engine to trigger the (*SKIP).
The (*SKIP) doesn't fire, and the engine matches aaardvark instead of the expected aaardwolf.
When I reported this as a bug, a Perl team dev kindly explained that (*SKIP) doesn't fire if the "study phase" at the start of a match attempt is able to decide that the match should not even be attempted.
This leads me to guess that at the first position in the string, the optimizer sees that the characters ar or aar must be matched but can't be.
The match attempt aborts before it even starts, so technically (*SKIP) is never crossed and the engine starts the next match attempt at the position following the first a.
This "optimized behavior" contradicts the pattern writer's directions, so it seems undesirable to me.
Since backtracking control verbs are described as experimental, it's entirely possible that the Perl team will decide to change the behavior at some stage.
Here are some code snippets if you'd like to try (*SKIP) in the three engines that currently support it.
# Perl
if ('123ABC' =~
/123(*SKIP)B|.{3}/
) { print "\$&='$&'\n"; }
# $&='ABC'
# with (*PRUNE) instead, the match would be 23A
// PHP
echo preg_match('~aa(*SKIP)ard\w+~',
'aaaardvark aaardwolf',
$m) ?
"$m[0]\n" : "No match\n";
# matches aaardwolf, whereas Perl matches aaardvark
# Python
# if you don't have the regex package, pip install regex
import regex as mrab
# print(regex.__version__) should output 2.4.76 or higher
print(mrab.search(r'aa(*SKIP)ard\w+', 'aaaardvark aaardwolf'))
# <regex.Match object; span=(11, 20), match='aaardwolf'>
Apart from potentially avoiding lots of fruitless match attempt, (*SKIP) is particularly useful as part of the (*SKIP)(*FAIL) construct, which we'll study shortly.
In the section about (*MARK), we'll see that you can also make (*SKIP) cause the engine to skip to a specific "bookmark" in the string, rather than to the position where (*SKIP) was encountered.
B(?:A|I|O(*ACCEPT))Z
If the engine follows the O branch, it never matches BOZ because it returns BO as soon as (*ACCEPT) is encountered—which is what we wanted.
Here is sample code to try (*ACCEPT) in the two languages that support it.
For each language, the second fragment demonstrates the behavior inside a capture group.
# Perl
if ('BOZ' =~
/B(?:A|I|O(*ACCEPT))Z/
) { print "\$&='$&'\n"; }
# BO
# with 'BAZ' as subject, the match would be BAZ
if ('BOZX' =~
/B((?:A|I|O(*ACCEPT))Z)X/
) { print "\$1='$1'\n"; }
# Group 1: O
# with 'BAZX' as subject, Group 1 would be AZ
// PHP
echo preg_match('~B(?:A|I|O(*ACCEPT))Z~',
'BOZ',
$m) ? "$m[0]\n" : "No match\n";
// BO
// with 'BAZ' as subject, the match would be BAZ
echo preg_match('~B((?:A|I|O(*ACCEPT))Z)X~',
'BOZX',
$m) ? "$m[1]\n" : "No match\n";
// Group 1: O
// with 'BAZX' as subject, Group 1 would be AZ
Are there other ways to factor the B and Z patterns? Sure.
Conditionals come to mind, for instance:
- Option 1: B(?:(A|I)|O)(?(1)Z)
(if A or I have been captured to Group 1, then match Z)
- Option 2: B(?:A|I|(O))(?(1)|Z)
(if O has been captured to Group 1, then match the empty string, otherwise match Z)
Please note that (*ACCEPT) is not the opposite of (*FAIL)
'abc' =~ /\w+(?{print "$&\n";})(*F)/
This prints abc, ab, a, bc, b, c.
After the \w+ matches the whole string, the code capsule prints the match, then (*F) forces the engine to backtrack.
The engine gives up the c, the callback prints ab, the (*F) forces the engine to backtrack, and so on.
(*FAIL) is not the opposite of (*ACCEPT)
It's natural to imagine that (*FAIL) would be the opposite of (*ACCEPT), but that is not the case.
If it were the opposite of (*ACCEPT), then (*FAIL) would mean at this point in the match, fail the match attempt. The engine would then throw away whatever had been matched thus far, and perhaps begin a new match attempt at the next starting position in the string.
Forcing the Match Attempt to Fail
If you truly want the opposite of (*ACCEPT) in order to abort the match attempt, you will need to use something like (*PRUNE)(*FAIL) or (*COMMIT)(*FAIL):
In the case of (*PRUNE)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match.
When it tries to backtrack across the (*PRUNE), the match attempt fails.
The engine then tries a new match attempt at the next starting position in the string, if any.
In the case of (*COMMIT)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match.
When it tries to backtrack across the (*COMMIT), the match attempt fails, and the engine also abandons any further match attempts.
Soon we'll study a surprisingly useful variation on these themes: (*SKIP)(*FAIL).
{[^}]*}(*SKIP)(*FAIL)|\b\w+\b
Note that for this pattern, we assume we know that our text can never contain {nested{braces}}.
Here is how the pattern works.
First, notice the central pivot around the alternation |.
On the left side, we use {[^}]*} to try to match what we don't want: anything within a set of curly braces.
For this, we match a left curly brace, then [^}]* matches zero or more characters that aren't a right curly brace (another example of saying what we don't want), then we match a right curly brace.
If the engine matches a set of curly braces, it then encounters the (*SKIP) verb, which always matches.
Next, it encounters the token (*FAIL), which always fails.
At that stage, the engine tries to backtrack across (*SKIP) in the hope of finding a different way to match within the current attempt.
But (*SKIP) causes the match attempt to explode, and (*SKIP) also tells the engine to start the next match attempt at the string position corresponding to where (*SKIP) was encountered.
This means the engine will never again look at the set of curly braces it just matched.
The content we want to avoid has been skipped (matched then thrown away).
At the start of each match attempt, if the engine is unable to match a set of curly braces, it jumps to the right branch of the | alternation.
There, \b\w+\b attempts to match an individual word.
If this fails, the match attempt fails.
The engine then advances to the next position in the string, and tries a whole new match attempt (once again trying to match a set of curly braces first.)
In effect, (*SKIP)(*FAIL) says:
Throw away anything you can match to the left of me.This is a very useful technique, and it also appears on the page about the best regex trick. If you'd like to try (*SKIP)(*FAIL), here is sample code for the three engines that support it. # Perl while ('good words {and bad} {ones}' =~ /{[^}]*}(*SKIP)(*FAIL)|\b\w+\b/g ) { print "matched: '$&'\n"; } # matched: 'good' # matched: 'words' // PHP if(preg_match_all('~{[^}]*}(*SKIP)(*FAIL)|\b\w+\b~', 'good words {and bad} {ones}', $matches)) var_dump($matches[0]); /* array(2) { [0]=> string(4) "good" [1]=> string(5) "words" } */ # Python # if you don't have the regex package, pip install regex import regex as mrab # print(regex.__version__) should output 2.4.76 or higher print(mrab.findall(r'{[^}]*}(*SKIP)(*FAIL)|\b\w+\b', 'good words {and bad} {ones}')) # ['good', 'words']
\bcat\b
Improved word boundary
The regex above will not find cat in _cat25, because there is no boundary between an underscore and a letter, nor between a letter and a digit: these are all what regex defines as word characters.
If you think that digits and underscores should count as a word boundary, \b will therefore not work for you.
If you would like to use a boundary that detects the edge between an ASCII letter and a non-letter, you can make it yourself.
See the section about a DIY "real word boundary" on the page about regex boundaries.
print( re.sub("X*", "Y", "A") )
_(?=.*:(\w+)\b)\1\b
But it only matches _mouse.
It looks like the lookahead is not trying all the options.
Why?
Because lookarounds are atomic (the link explains this example in detail).
Once the engine leaves a lookaround, its assertion has either returned true or false.
From the engine's standpoint, that is all it wants to know.
If a lookaround returns true, the engine tries to match the next tokens.
If something fails further down the pattern, the engine has no reason to revisit the lookaround: true is always true.
(?!)
Its compactness is unbeatable.
How does it work? (?! … ) is a negative lookahead asserting that what is in the parentheses after the ?! cannot be matched.
In (?!) there is nothing between the (?! and the closing parentheses, an absence that the engine interprets as the empty string.
The regex engine translates (?!) as "assert that at this position, the empty string cannot be matched." The empty string can always be matched, so engine fails to match the (?!) assertion.
This causes it to backtrack in search of a different match, if any can be found.
I first came across this little gem in the , but I don't know who invented it.
In Perl, PCRE (C, PHP, R…) and Python's alternate regex engine, there is a special token that can never match, forcing the engine to fail and causing it to backtrack in search of another match.
It can be written (*FAIL) or (*F).
Internally, PCRE optimizes (?!) to (*F).
You can read all about (*FAIL) on my page about backtracking control verbs.
Why would you want a token that never matches?
Among other uses, tokens that always fail come in handy:
- in conditionals that check for balancing conditions in the string (e.g.
the engine found an "opening construct", now you want it to match the corresponding closing construct),
- to explore the branches of a match tree.
For details, please read the use cases for on my page about backtracking control verbs.
When I use this trick, is the regex match guaranteed to fail?
Not necessarily.
While the (?!) construct and its equivalents never match, that doesn't imply that the patterns containing them won't match: after failing to match (?!), the engine attempts to backtrack, and it may find an alternate path that succeeds.
If your pattern is designed in such a way that backtracking is fruitless, the match attempt will fail.
If you want to guarantee failure in all cases, please read how to force the match attempt to fail on my page about backtracking control verbs.
^(?:..)+$
The non-capturing group (?:..) matches two characters.
The + quantifier repeats that one or more times.
To check that a string's length is a multiple of n, a more general version would be ^(?:.{n})+$
If your string spans several lines, you could start out with something like (?s)\A(?:.{n})+\z, using the single-line (DOTALL) mode s to ensure that the dot matches line breaks, and the A and z anchors to anchor the whole string.
Each line break character will count as one character, so this is probably less useful.
(\b[A-Z]+\b)(?=.*(?=\b[a-z]+\b)(?i)\1)
The upper-case word is captured to Group 1.
Then we match characters up to a point where the lookahead (?=[a-z]+\b) can assert that what follows is a lower-case word, then we set case-insensitive mode on and we match the content of Group 1.
In .NET we could use a lookbehind instead of the double lookahead:
(\b[A-Z]+\b)(?=.*\b[a-z]+\b(?i)(?<=\1))
(?>A+)[A-Z]C
translates to
(?=(A+))\1[A-Z]C
and
(?>A|.B)C
translates to
(?=A|.B)\1C
(?:(?<foo> … )(?!))?
Here the name of the subroutine is foo, and the actual subroutine goes in the place of the ellipsis ….
For instance, (?:(?<uc_word>\b[A-Z]+\b)(?!))? defines a subroutine for a word in all-caps.
Once the subroutine is defined, you call it with \g<foo> (Ruby).
This makes it very convenient to call patterns repeatedly (especially if they are long), to maintain them in one place, and so on.
How does the trick work?
Before we look at a full-length example, let's discuss how this works.
The external parentheses in (?:(?<foo> … )(?!))? define a non-capturing group.
The group is made optional by the ? quantifier.
As we'll soon see, the non-capturing group always fails, and the zero-or-one quantifier ? is what allows us to go on after that internal failure.
Within the non-capturing group, we start with a standard named group definition (?<foo> … ).
This both creates a named group and a named subroutine.
The engine doesn't know that at this stage we only aim to define the subroutine without actually matching characters.
Therefore, if it finds characters that match the pattern in the named group, it matches them.
We don't want that—as our goal in our definitions is to simply define without moving our position in the string.
This is why after defining the named group and subroutine, we force the non-capturing group to fail by using (?!), which we saw earlier in the page with the forced failure trick.
Of course we don't want the whole regex to fail, which is why we use the ?quantifier to make the non-capturing group optional.
Full Example
This example is identical to the one in the section on pre-defined subroutines: it defines a basic grammar, allowing us to match sentences such as five blue elephants solve many interesting problems and many large problems resemble some interesting cars
This is for Ruby:
(?x) # whitespace mode
####### DEFINITIONS #########
# pre-define quant subroutine
(?:(?<quant>many|some|five)(?!))?
# pre-define adj subroutine
(?:(?<adj>blue|large|interesting)(?!))?
# pre-define object subroutine
(?:(?<object>cars|elephants|problems)(?!))?
# pre-define noun_phrase subroutine
(?:(?<noun_phrase>\g<quant>\ \g<adj>\ \g<object>)(?!))?
# pre-define verb subroutine
(?:(?<verb>borrow|solve|resemble)(?!))?
####### END DEFINITIONS #######
##### The regex matching starts here #####
\g<noun_phrase>\ \g<verb>\ \g<noun_phrase>
(?:bob_|joe_)*
But can you perform the same task without using an alternation?
This will do it:
(?:bob_)*(?:(?:joe_)+(?:bob_)*)*
After (?:bob_)* matches zero or more bob_ tokens, we match (zero or more times) one or more joe_ tokens followed by zero or more bob_ tokens.
In more general terms, (A|B)* becomes A*(?:B+A*)*.
We use this technique in the Unrolled Star Alternation solution to the Greedy Trap problem.
if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {}
It prints out substrings of the test string abc:
abc
ab
a
bc
b
c
For the why, please read the explanations.
Here is C# code to do the same.
Granted, it is longer than the Perl one-liner, but you already knew that Perl and C# are different beasts.
using System;
using PCRE;
class Program
{
static void Main()
{
string subject = "abc";
var combo_regex = new PcreRegex(@"\w+(?C'temp: ')(*FAIL)");
combo_regex.Match(subject,
callout =>
{
Console.WriteLine(callout.String + callout.Match.Value);
return PcreCalloutResult.Pass;
}
);
Console.WriteLine("Press Key");
Console.ReadKey();
}
}
Here is the output:
temp: abc
temp: ab
temp: a
temp: bc
temp: b
temp: c
Press Key
Callout Specified as Lambda
The key feature is that when we call the Match constructor, in addition to the standard subject string, we pass the callout function.
There are several ways to pass the callout function.
In this example, for brevity, we pass a lambda.
If you plan to reuse the callout function, it probably makes sense to pass it as a delegate.
We will see how to do that in a later example.
Argument used as Value
One interesting feature is that the callout's argument is the string "temp:"
This string is output on every temporary match report via callout.String
Return Values
Note that the callout returns PcreCalloutResult.Pass
This maps to the zero value that tells the engine to resume the match attempt where it left off.
The other possible return values are:
PcreCalloutResult.Fail, equivalent to 1, telling the engine to fail the current match attempt, after which the engine, as usual, advances to the next position in the string and starts a new match attempt.
PcreCalloutResult.Abort, equivalent to -1, telling the engine to fail the overall match (the current match attempt fails, and the engine does not advance in the string to try other attempts).
Match object discarded
Usually, when we call combo_regex.Match(), we assign the resulting match object to a variable.
In this case, we don't care about the match object, so no assignment was made.
Alternate implementation
In the section on the callout function's return values, I mentioned that a positive value acts like a (*FAIL).
This means we can obtain the same result as above by removing the (*FAIL) and returning a positive value, which PCRE.NET expresses as PcreCalloutResult.Fail.
This fragment outputs the same temporary matches as before:
string subject = "abc";
var combo_regex = new PcreRegex(@"\w+(?C)",
PcreOptions.NoAutoPossess);
combo_regex.Match(subject,
callout =>
{
Console.WriteLine(callout.Match.Value);
return PcreCalloutResult.Fail;
}
);
But there is one subtlety: PcreOptions.NoAutoPossess, coming up next.
The Ghost of Autopossess (and of other Optimizations)
The PcreOptions.NoAutoPossess option sets PCRE's PCRE2_NO_AUTO_POSSESS option, which can also be turned inline by the (*NO_AUTO_POSSESS) start of pattern modifier.
(Except that at the moment of writing there seems to be a bug with this latter syntax.)
As a reminder, the autopossess optimization turns some quantifiers into possessive quantifiers when the token that follows is incompatible with the quantified token (there is no shared ground, so no reason to backtrack).
For instance, \d+\D is automatically optimized to \d++\D.
For the same reason, the \w+ in our pattern is automatically optimized to \w++.
We need to turn that off, otherwise when the callout returns a positive value, the engine cannot backtrack into the atomic \w++, so the match attempt fails without further exploration.
In this case, the engine advances to the next position in the string to try the next match attempt, yielding this much shorter output:
abc
bc
c
For the same reason, if you want to make sure that callouts always work as you expect, you should turn off other optimizations as well.
Putting all the optimization killers in one place:
- PCRE2_NO_AUTO_POSSESS, set inline with (*NO_AUTO_POSSESS) or in PCRE.NET with PcreOptions.NoAutoPossess
- PCRE2_NO_START_OPTIMIZE, set inline with (*NO_START_OPT) or in PCRE.NET with PcreOptions.NoStartOptimize
- PCRE2_NO_DOTSTAR_ANCHOR, set inline with PCRE2_NO_DOTSTAR_ANCHOR or in PCRE.NET with PcreOptions.NoDotStarAnchor
callout_info_regex.Match(subject, callout_info);
Capture Collection
Normally, in PCRE, when a capture group is quantified, as in (?:(\d)\D)+, the engine only returns the last value of the capture group.
That is how most engines work.
In contrast, the standard .NET engine has a feature called capture collections that let you examine all intermediate captures.
PCRE callouts take you some of the way in the direction of capture collections.
In this example, each pass displays the last capture group.
I do not know of any regex flavors that allow you to capture a quantifier by itself or that allow you to use a backreference inside a quantifier in any way.If the situation changes, please write at the bottom of the page.
^@+ "[^"]+" =+ "[^"]+" -+ "[^"]+" /+ "[^"]+"$
Balancing the number of {@,-,=,/} is fairly straightforward in languages that use .NET regex thanks to the balancing groups feature, and I give a demo of this lower down.
In PCRE, it is also possible but far less straightforward, as you need to use some neat tricks: the syntax is far too complex and error-prone for it to be useful on a regular basis.
I explain these tricks (which I did not invent) lower on the page.
What we really need is a syntax such as the following:
(*)
captures the number of characters the star quantifier is able to match.
Likewise, (+)
, (?)
and ({2,9})
capture the number of characters these quantifiers are able to match.
\q1
refers to the first captured quantifier, \q2
refers to the second captured quantifier and so on.
\q+1
refers to the next captured quantifier.
\q+2
refers to the second-next captured quantifier, \q-3
refers to the third-previous quantifier.
With this syntax, we could easily match our desired pattern with the following regex:
^@(+) "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /{\q1} "[^"]+"$
Alternately, using relative addressing, you could use either of the following expressions, where the quantifier is captured further down the string..
^@+{\q1} "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /(+) "[^"]+"$
^@+{\q+2} "[^"]+" ={\q+1} "[^"]+" -(+) "[^"]+" /{\q-1} "[^"]+"$
The syntax would also allow you to use captured quantifiers in range quantifiers such as {\q1,\q2}
.
An Alternative to Recursion
In some cases, capturing quantifiers would elegantly replace recursion.
For instance, suppose you want to match a number of zeros and ones framed by the same number of Ls and Rs, a in "LLL100110100RRR".
With recursion, you can write:
L((?R)|[01]*)R
With captured quantifiers, you can write:
L(+)[01]+R{\q1}
Usually, the non-recursive version would be easier to write and read.
A(+)[TG]+C{2*\q1+1}
A(+)[TG]+C{CALL_VERB somefunction(\q1)}
In conclusion, it seems to me that quantifier capture (as a first step) and quantifier arithmetic (as a second step) would nicely enhance the expressiveness of regular expressions and would be a logical extension of the syntax.
^(?:L(?<c1>)(?<c2>))+[^LM]+(?<-c1>M)+[^MR]+(?<-c2>R)+(?(c1)(?!))(?(c2)(?!))$
This is fairly long, but it is simple.
First, notice that the string is anchored by ^
and $
to prevent the engine from matching a balanced string within an unbalanced string.
The non-capturing group (?:L(?<c1>)(?<c2>))+
is quantified with a +
and matches all the L characters.
It also contains two named capture groups c1 and c2.
These groups are empty, so they don't match anything; or, rather, they match an empty string.
You may know that .NET deals with quantified capture groups in a special way: it adds the successive captures of a given group to a collection of captures.
This means that each time an L is matched, a capture is added to the capture collection for named groups c1 and c2.
We will use c1 and c2 as counters.
By the time the engine exits the quantified non-capturing group, the c1 and c2 groups both hold the same number of captures, which is the number of Ls that were matched.
The [^LM]+
eats up all the characters up to the first M.
The (?<-c1>M)+
is a quantified group that eats up all the Ms.
It may look like it is a named group called "-c1", but -c1 is .NET regex syntax to pop (and discard) the last capture in group c1.
This means that each time we match an M, one c1 capture is thrown away.
In essence, we are decrementing our c1 counter.
If we have already matched as many Ms as Ls, group c1 is empty.
If there are any Ms left at that point, when the engine tries to pop one capture from the c1 group, it cannot do so, and the regex fails.
This ensures that there cannot be more Ms than Ls in the string.
Later, we will add a check to ensure that there are no more Ls than Ls.
The [^MR]+
eats up all the characters up to the first R.
The (?<-c2>R)+
matches all the Rs while decrementing c2, ensuring that there cannot be more Rs than Ms.
The (?(c1)(?!))
is a conditional that checks if capture group c1 is set.
If c1 is set, the engine tries to match (?!)
, which is a trick to force the regex to fail.
The conditional therefore forces the regex to fail if there are captures left in gorup c1, which would mean that we have not matched enough Ms to fully decrement our c1 "counter".
This expression ensures that we cannot have more Ls than Ms.
Likewise, (?(c2)(?!))
ensures that we cannot have more Ls than Rs.
That's a bit of syntax to explain, but I hope you'll agree that once you understand that syntax, writing such an expression is straightforward.
(\2?+M)
.
What does this do? After the first L is matched, Group 2 is undefined, and the "?" in \2?+
makes \2 optional, allowing that part of the expression to match.
Group 2 then matches the first M, and the value of Group 2 becomes "M".
After the second L is matched, Group 2 is still "M", so the \2 in (\2?+M)
matches "M", then we match the second "M", and the value of Group 2 becomes "MM".
The "+" in \2?+
ensures that if we fail to match the M that follows, the engine doesn't backtrack by activating the optional \2? and dropping the first M.
Without the "+", we could match strings such as "LLL1M2R".
See the section on atomic groups.
Please understand that although Group 2 looks like a self-reference, the expression in Group 2 refers to the previously stored value.
Therefore, the value \2 of Group 2 after the closing parenthesis is not what it was inside the parentheses.
Ready? Here we go.
^((?:L(?=[^M]*(\2?+M)[^R]*(\3?+R)))+)\d+\2\d+\3$
Easy, right?… Just kidding.
Here is the commented break-down.
(?xm) # Free-spacing mode, multi-line
^ # Assert Beginning of Line
( # Begin Group 1
(?: # Non-capturing group, which will be repeated
L # Match one L
(?= # Begin Lookahead
[^M]* # Match any chars that are not M
( # Begin Group 2
\2?+ # Match Group 2 if possible, and if so
# do not later give up the match.
# In other words if Group 2 can be matched, match it.
# This could be expressed as (?(2)\2)+
# After we match the first L, Group 2 starts out undefined
# so the ? will be used.
# After we match the 2nd L, Group 2 is M
# so at that point we must match M.
# After we match the 3rd L, Group 2 is MM
# so at that point we must match MM.
M # Match M
) # End Group 2
# After matching the first L, Group 2 is M
# After matching the second L, Group 2 is MM
# After matching the third L, Group 2 is MMM
# etc.
[^R]* # Match any chars that are not R
(\3?+R) # Group 3 follows the same principle as Group 2
# If you have a hard time following, simplify
# the test string
# and remove the Group 3 section
) # End Lookahead
)+ # Repeat the non-capturing group
) # End Group 1
# If we stopped right there, the regex would match strings
# that have x characters L and at least x each of characters {M,R}
# but possibly more: there would be no guarantee of balance
# To validate that we have no more than needed,
# we now match (or lookahead) precisely what we want
# after all the L characters we have matched.
\d+ # Match some digits
\2 # Match the characters captured in Group 2
\d+ # Match some digits
\3 # Match the characters captured in Group 3
$ # Assert End of Line
Hope you enjoyed this one! Working through it is a great exercise.
But if it shows one thing, apart from the cleverness of certain coders, it's that realistically, to balance strings as we have done, you need something like the quantifier capture syntax advocated on this page.
In the unlikely case you'd like to see the same principle applied to the @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas" example from the top, the code is here.
\?day=(\d)&name=([^&]+)&fruit=(\w+)
The values are captured in Groups 1, 2 and 3.
If not all strings contained all the parameters, you could make the components optional:
\?(?:day=(\d))?&?(?:name=([^&]+))?&?(?:fruit=(\w+))?
How do I capture text inside a set of parentheses?
This is a common request on forums.
You have a file with text such as Acapulco airport (ACA) and you want to grab the text in the parentheses.
Here is a recipe to capture the text inside a single set of parentheses: \(([^()]*)\)
First we match the opening parenthesis: \(
.
Then we greedily match any number of any character that is neither an opening nor a closing parenthesis (we don't want nested parentheses for this example: [^()]*
.
This is the content of the parentheses, and it is placed within a set of regex parentheses in order to capture it into Group 1.
Last, we match the closing parenthesis: \)
.
How do I match text inside a set of parentheses that contains other parentheses?
This requires a small tweak and a regex flavor that supports recursion.
We're still going to match the opening parenthesis at the very start and the closing parenthesis at the very end.
Inside, we'll match "stuff that's not parentheses" (or nothing), followed by zero or more sequences of (i) a repeat the whole pattern (expressed below as (?R), and (ii) more "stuff that's not parentheses" (expressed below as (?2)).
\((([^()]*+)(?:(?R)(?2))*)\)
I can't guarantee that this works in every situation as recursive patterns are fickle, but here's PHP code that tests the expression on various sets of nested parentheses.
<?php
$regex='~\((([^()]*+)(?:(?R)(?2))*)\)~';
$strings=array('Airport: (ACA)','equation1: (1+(a+b))','equation2: (1+(a+b)+c)','equation3: (1+(a+b)+(2+2)+c)',
'equation4: (1+(a+b)+(2+(7/5)-2)+c)');
foreach($strings as $string)
if(preg_match($regex,$string,$match)) echo $string.' <b>capture:</b> '.$match[1].'<br />';
?>
This is a bit different from the expression offered by Jeffrey Friedl in : (?:[^()]++|\((?R)\))*
, which you'd have to tweak before you could pop it in the code above in order to capture the contents of the parentheses: \(((?:[^()]++|(?R))*)\)
.
In my tests, I have found this expression to be up to twenty percent faster when the match works as planned, but slower by the same amount when a parenthesis is missing.
^(?:1[5-9]|[2-9]\d|[1-9]\d\d+)$
With this approach, we progress in numerical order with multiple alternations, first trying to match numbers between 15 and 19, then numbers between 20 and 99, then numbers 100 and above.
^(?:1(?:[5-9]|\d\d+)|[2-9]\d+)$
With this approach, we look at two cases: either the first digit is a 1, or it is anything else.
How do I validate that a time string is well-formed?
Here's an expression I came up with.
^(?:([0]?\d|1[012])|(?:1[3-9]|2[0-3]))[.:h]?[0-5]\d(?:\s?(?(1)(am|AM|pm|PM)))?$
It matches times in a variety of formats, such as 10:22am, 21:10, 08h55 and 7.15 pm.
How do I validate that a list is made of certain items, in any order?
Scenario: you want to make sure that the string only contains items from a list, delimited by a comma (for instance).
These items could be objects, numbers, names.
For instance: 212, 415, 850.
Here is a general solution:
Example 1: ^(?:peas|onions|carrots)(?:,(?:peas|onions|carrots))*+$
Example 2: ^(?:415|212|850)(?::(?:415|212|850))*+$
(note that here the delimiter is a colon.)
Explanation: You need one of the words to be present at least once.
Then it is optionally followed by a comma and another word, multiple times.
If you are using PCRE, you can use the repeating syntax for a more compact, maintainable expression: ^(peas|onions|carrots)(?:,(?1))*+$
How can I validate that a string contains the text "75", but only once?
This is similar to the password validation presented on the Lookaround page: you set a number of conditions before matching the string.
^(?=.*?75)(?!.*?75.*?75).*$
How can I validate that a binary string contains ten 1s at the most? ("reverse password validation technique")
This is a variation on the password validation technique: we look ahead to make sure that the string does not contain what we don't want, then we match.
^(?!0*(?:10*){10}1)[01]+$
After anchoring the expression, in the negative lookakead, we build a generic binary string that has at least 11 ones.
This is what we don't want.
To build that string, we state that it can start with any number of zeros.
Once the zeroes are consumed, we have a one, followed by optional zeroes.
That's our first one.
We repeat this ten times, bringing us to ten ones.
Finally, we add one last one to get over the limit.
\b\d{1,10}\b
.
The boundaries are there to make sure you don't match a portion of a twenty-digit number when you really only want to match a number that has ten digits at the most.
For this kind of simple max, I really recommend you print out the cheat sheet.
How can I match all lines except those that contain a certain word?
Typically, this would be used in a case where you want to capture something on each line, except those that present certain features.
Let's go with the simple case where you want to match all lines, except those that contain "BadWord".
This will match your lines:
(?m)^(?!.*?BadWord).*$
If you want to exclude BadWord only when it stands on its own, set it apart with the \b
boundary:
(?m)^(?!.*?\bBadWord\b).*$
Also note that this is a potential application of the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages.
(?m)^.*?\bBadWord\b.*$|(^.*$)
Match numbers followed by letter or end of string
In the string 00-11A22B33_44, suppose you are interested in matching numbers, provided they are followed by a letter or the end of the string.
You can solve that with:
\d+(?=[A-Z]|$)
The lookahead (?=[A-Z]|$) asserts: what follows is either an uppercase character, or the end of the string—exactly what we want.
The trick here is to not be shy to use the $ anchor in a context where it is not on its own, at the end of the string.
Dollars are people too!
If you've only seen basic regex tutorials, you could be forgiven for assuming that the ^ anchor only ever appears at the very beginning of an expression, while the $ anchor always sits quietly at the very end.
You can use anchors anywhere in your pattern.
They are assertions like any other.
How can I match paragraphs that contain MyWord, but only proper paragraphs starting with two carriage returns?
This question is about finding text within specific formatting.
If a paragraph starts with a single carriage return, you are not interested.
You are only interested in the first paragraph or those set off by two carriage returns.
On systems where a carriage return only inserts a newline character (such as Unix), you could start with this:
(?m)^(?<=^\A|\n\n).*?SomeWord.*$
The lookbehind ensures that the line is either the first in the text, or that it is preceded by two newlines.
On Windows, in the place of \n\n
, you would want \r\n\r\n
.
For something portable, on PCRE, use \R, which matches any newline sequence.
Your expression would look like this:
(?m)^(?<=^\A|\R\R).*?SomeWord.*$
Match pairs of characters in the correct slots
Suppose you want to match all two-digit numbers that start with a 6.
Further, you think of your string as a series of pairs, so you would want to match 68 in 116822, but not in 168122.
Let's proceed step by step.
To match the first pair that starts with a 6, you could use
^(?:[^6].)*(6\d)
and retrieve the match from Group 1.
The anchor ^ ensures that we start looking at the beginning of the string.
The non-capture group (?:[^6].)* matches zero or more pairs of characters that do not start with a 6 (using the parity trick to stay in sync with the two-character slots in the string), then the parentheses around (6\d) capture our match to Group 1.
In Perl, PCRE (PHP, R…) or Ruby 2+, we could do away with the capturing group and match the string directly by using \K, which forces the engine to drop what was matched previously: ^(?:[^6].)*\K6\d.
Likewise, in .NET, we could use infinite lookbehind: (?<=^(?:[^6].)*)6\d
But we don't want to match just one pair: from 00611122665564, we want to extract 61, 66 and 64.
This is a place where the match continuation anchor \G comes in very handy.
\G matches the beginning of the string, or the position immediately following the previous match.
It is supported in .NET, Perl, PCRE (PHP, R…), Java and Ruby.
It will ensure that our second and next matches do not fall out of sync with the two-character slots in the string.
Here is the general option with capture groups:
\G(?:[^6].)*(6\d)
In engines that support \K, we would use \G(?:[^6].)*\K(6\d) to get a direct match.
And in .NET, we would use an infinite lookbehind: (?<=\G(?:[^6].)*)(6\d)
(?m)^
and replace with your chosen line prefix.
Likewise, to insert a suffix at the end of lines, you can use this regex to search:
(?m)$
and replace with your chosen line suffix.
How do I replace one tag delimiter with another?
Let's say you want to replace [square brackets] with <pointy brackets> without changing the stuff in the brackets.
Search: \[([^]]+)]
This search expression matches an opening bracket, then anything that is not a closing bracket, then a closing bracket.
The content of the brackets is captured in Group 1.
Replace: <\1>
The replacement expression just places the capture (Group 1) within a brand new set of pointy brackets.
How do I replace the string "//" in a whole file, but only when it is part of a path?
Let's say in a page of text you want to replace all instances of // or \\ with a single forward slash.
No problem, that's what your replace function is designed to do.
In PHP: $string=preg_replace('~//|\\\\~','/',$string);
(the backslashes need to be escaped).
By the way, this is a great example of why something like a tilde (~) often works better than / as a delimiter.
With / as a delimiter, the regex would look like this: $string=preg_replace('/\/\/|\\\\/','/',$string);
.
The real "problem" is if you wanted to replace all instances of //, but only in parts of your text file that look like this: Document=root//folder1//folder2//(maybe_more_folders)//file.extension
You can't do a plain replace, as instances of // that you don't want to touch would also be replaced.
You can't capture the various parts of the file path into groups and build a generic replace string, because you don't know how many subfolders are in the string.
For this kind of problem, I use two distinct solutions depending on the context and my mood.
Solution #1: Variable-Width Lookbehind.
This simple solution works in ABA and RegexBuddy (.NET flavor), which have variable-width lookbehinds.
You search for (?<=Document=.*)//
and replace with a single slash.
Solution #2: Replace function with Callback.
This solution works if your programming language has a replace function that allows you to call another function.
The replace function passes the whole match.
The "callback function" works on the match and returns the replacement string.
In this instance, Document=[^/]*+(?>//[^\s]+)
matches the type of string we are looking for.
In PHP, we can use:
$string=preg_replace_callback('~Document=[^/]*+(?>//[^\s]+)~',
function ($match) {return str_replace('//','/',$match[0]);},
$string);
Solution #3: Multiple replacements.
This solution works in environments where you can run a replace operation multiple times (until you exhaust any replacements to be made).
For instance, in this case, we can safely assume that no path will have a hundred subfolders, so we can run the replace operation a hundred times.
On my system, I can run this kind of operation in Directory Opus (for file renaming) and EditPad Pro.
The trick here is to build an expression that will continue to match the string you want to alter, even after you have made several replacements.
In our example,
(Document=[^/]*+(?>/(?!/)[^/\s]+)*+)(//)
will capture before the first // in Group 1, then capture the first //.
You replace the match with \1 and a single /, then you repeat the operation as many times as necessary.
How do I replace curly Quotes ("smart quotes") with straight quotes?
This is not a hard regex problem: we just want to replace some characters with some other character.
It's a character set problem.
You need to know every unicode code point (or the few ASCII codes) for curly quotes.
The regex is self-explanatory: I'll just give you the solution, first for utf-8 then for ASCII.
For utf-8 text (which is what I have on my website), I use the two replace lines in the code below.
<?php
$string='“Take me to ‘the station’ ”, he said.';
echo 'Before: '.$string.'<br />';
$string=preg_replace('~[\x{0091}\x{0092}\x{2018}\x{2019}\x{201A}\x{201B}\x{2032}\x{2035}]~u',"'",$string); // single curly quotes
$string=preg_replace('~[\x{0093}\x{0094}\x{201C}\x{201D}\x{201E}\x{201F}\x{2033}\x{2036}]~u','"',$string); // double curly quotes
echo 'After: '.$string;
?>
Output:
Before: “Take me to ‘the station’ ”, he said.
After: "Take me to 'the station' ", he said.
For ASCII-encoded text, you can use this:
<?php
$string='“Take me to ‘the station’ ”, he said.';
echo 'Before: '.$string.'<br />';
$string=preg_replace('~[\x145\x146]~',"'",$string); // single curly quotes
$string=preg_replace('~[\x147\x148]~','"',$string); // double curly quotes
echo 'After: '.$string;
?>
How do I convert a whole string to lowercase except certain words?
Input: Tomatoes AND orangeS AND ParsleY
We want to convert the whole sentence to lowercase, except the word AND.
Here are three ways to handle this.
1.
Match all words except AND, and replace them to their lowercase version using a callback function (preg_replace_callback in PHP).
Match: (?!\bAND\b)\s*\b\w+\s*
Here is a working example:
<?php
$string='Tomatoes AND orangeS AND ParsleY';
$regex='~(?!\bAND\b)\s*\b\w+\s*~';
$string=preg_replace_callback($regex,function ($m) {return strtolower($m[0]);} ,$string);
echo $string;
?>
2.
Progressively match the whole string, capturing word groups in Group 1 and 'AND' in Group 2, then rebuild the string.
This is heavier programmatically, but, according to my benchmarks (running each piece of code a million times), it is a 33% faster—thanks to the averted callbacks.
<?php
$string='Tomatoes AND orangeS AND ParsleY';
$regex=',((?!\bAND\b)\s*\b\w+\s*)(\bAND\b|$),';
preg_match_all($regex, $string, $matches, PREG_PATTERN_ORDER );
$size=count($matches[1]);
$string='';
for ($i=0;$i<$size;$i++) $string.=strtolower($matches[1][$i]).$matches[2][$i];
echo $string."<br />";
?>
3.
Use the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages.
(?m)^.*?\bBadWord\b.*$|(^.*$)
How do I replace all words that appear on the black list, but not those on the white list?
Let's say you want to replace all instances of the word sax with with '###', even when it is part of other words such as "saxophone", but not when it is part of "Essax" and other words on a white list.
And let's say you have a whole blacklist of "bad words" words besides "sax", each word with its own whitelist of acceptable uses.
Crafting a custom regex for each word is a bit long.
The easier procedure is to replace each instance of the "bad words" that occur in a white list word with something distinctive.
For instance, add "@@@" to the end of every white list word that contains "sax"—turning "Essax" into "Essax@@@".
With a simple lookahead, you can then replace sax everywhere, except when it is part of a word that ends in "@@@": sax(?!\w*@@@)
.
Last, all you have to do is zap all the "@@@".
How do I fix unclosed tags?
Here is an example I'm particularly fond of because it's a great use of conditionals.
The problem: in this string
a<1bc<2>3>de<<4f5g
the numbers are supposed to live in complete tags, like so: <1>
Sometimes the opening tag is missing, sometimes the closing tag is missing, sometimes there are multiple opening tags, sometimes the tag is properly formed.
To match these numbers, if we make both tags optional, as in <*(\d+)>*
, then we will erroneously match the 5, which is supposed to be tagged.
To ensure there is at least one tag, one solution is to say "match opening tags and optionally match closing tags, OR optionally match opening tags and match closing tags.
This looks like this:
Match: <+(\d+)>*|<*(\d+)>+
Replacement: <\1\2>
This works great, but the alternation can give the engine a lot of work.
Isn't there a way to say "at least one of the tags has to be present"? With conditionals, there is:
Match: (<)*(\d+)(>)*(?(1)|(?(3)|(?!)))
Replacement: <\2>
The first part of the expression matches optional opening tags, a number, and optional closing tags.
The opening tags are captured in Group 1.
The number is captured in Group 2.
The closing tags are captured in Group 3.
After all this matching takes place (without using an alternation), a conditional expression checks that at least one of the two tags was present (and therefore captured).
Here is the logic:
IF Group 1 was captured: (?(1)
…THEN no need to match anything,
OTHERWISE (no Group 1 capture),
IF Group 3 was captured: (?(3)
…THEN no need to match anything,
OTHERWISE (neither tag group was captured), THEN fail: (?!)
.
The key here is to force the regex to fail unless we are happy with the match.
(See forced regex failure on the tricks page for more about forcing a regex to fail.)
Now tell me… how neat is that?
Smiles,
Rex
Hey Dude! This site is awesome! I can tell you've really put a lot of work and thought into it.
It's been a while since I've dipped into Regex, but I'm excited to re-learn it
Under Capturing in the third title you have a spelling error, is says to instead of do.
"How to I match text inside a set of parentheses that contains other parentheses? "
Just wanted to give you a heads up.
I love your site! Jacob Hawks
Reply to Jacob Hawks
Hey man, thanks for the kind words and typo report! Really appreciate it.
Fixed.
Wishing you a great day.
[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
[!-~]
All "Special Characters" in the ASCII Table
(?![a-zA-Z0-9])[!-~]
All "Special Characters" in the ASCII Table—Without Using Lookahead
[!-/:-@\[-`{-~]
All Latin and Accented Characters
(?i)(?:(?![×T÷t])[a-zà-])
All English Consonants
[b-df-hj-np-tv-z]
[^\W_]
This is an interesting class for engines that don't support the POSIX [[:alnum:]].
It makes use of the fact that \w is very close to what we want.
[^\W] is a double negation that matches the same as \w.
By adding _ to the negated class, we are left with ASCII digits and numbers.
Watch out, though: in Python and .NET, \w matches any unicode letter.
But frankly...
Just use [a-zA-Z0-9].
See also Any White-Space Except Newline.
Binary Number
[^\D2-9]+
This is the same idea as the regex above to match alphanumeric characters.
In most engines, the character class only matches digits 0 or 1.
The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need.
Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".
[][]
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped.
While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].
Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
(?i)\b[a-fq-tv-xz]+\b
Words you can Type with your Right Hand (QWERTY keyboard)
(?i)\b[ug-py]+\b
Words that only use Letters from the Top Row (QWERTY keyboard)
(?i)\b[eio-rtuwy]+\b
[^\S\n]
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.
[a-zA-Zàaéèêùüàéèêùü]
German Letters
The controversial capital letter for , now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-zA-Züü]
Polish Letters
[a-pr-uwy-zA-PR-UWY-Zńóó]
Note that there is no Q, V and X in Polish.
But if you want to allow all English letters as well, use [a-zA-Zńóó]
Italian Letters
[a-zA-Zàèéìíòóùúàèéìíòóùú]
Spanish Letters
[a-zA-Záéíóúüáéíóúü]
Subject: Spanish
Thx!! I couldnt figure out how to keep my spanish characters while cleaning up some tweets.
Reply to cesar
Hola Cesar,
Me encanta oír que hayas podido resolver tu problema.
Deseándote un buenísimo día, -Rex
(?:\d|[aeiou])*
and \d*(?:[aeiou]+\d*)*
The original pattern (with the alternation) compiles faster: 1.6 millionth of a second vs.
2.2 for the unrolled version.
However, it executes a lot slower: 1.7 millionth of a second vs.
0.8.
This seems like a potentially useful optimization to implement at the engine level (in one's own code, it is a bit hard to maintain).
$pattern='/\d+\b(day|night)\b/S';
Apparently, this study mode can be useful when parsing long documents such as web pages.
It may not help, but it costs less than a hundred thousandth of a second.
Version | pcregrep | pcretest |
---|---|---|
10.37 (26 May 2021) | download | download |
10.36 (4 Dec 2020) | download | download |
10.35 (9 May 2020) | download | download |
10.34 (21 Nov 2019) | download | download |
10.33 | download | download |
10.22 | download | download |
10.21 | download | download |
10.20 | download | download |
10.10 | download | download |
Version | pcregrep | pcretest |
---|---|---|
8.45 (15 Jun 2021) | download | download |
8.44 (12 Feb 2020) | download | download |
8.43 (23 Feb 2019) | download | download |
8.38 | download | download |
8.37 | download | download |
8.36 | download | download |
8.35 | download | download |
8.34 | download | download |
8.32 | download | download |
8.31 | download | download |
C:\Windows\System32
folder.
This system folder is in the system's path variable, which means that when you try to run a program, Windows looks there.
That will allow you to run pcregrep (now called grep) from any folder.
I copy pcretest to the same folder, that way I don't need to remember where it lives.
To run grep, open a command prompt, which is never more than five keyboard strokes away: Windows key, type "cmd", press Enter.
If you use a marvellous program called Directory Opus (probably the ultimate productivity application for Windows), you can also invoke a command prompt from the current folder by using a keyboard shortcut such as Ctrl + Shift + R.
That's what I do.
From the command prompt… Start grepping, debugging or optimizing!
grep list_of_options regex_pattern files_to_match
The full syntax is in the manual file which is included in the download.
But manual pages can be confusing, so here are some examples that work on my Windows machine.
Command | Description |
---|---|
grep --help | Displays a list of the options you can use with the command. You can send the output to a file with "grep --help > grephelp.txt". But note that the manual page in the download has much more detail. |
grep toto * | Looks for the string "toto" in all files in the current directory. Returns all the lines that match, with a little context. Complains that it cannot open directories. |
grep -s toto * | As above, but the "s" option makes the engine shut up (or silent) about the fact that it cannot open directories. That should probably be the default option on Windows—if you don't want to see the warnings, just get into the habit of adding an s to your options when you are searching all files. |
grep -s toto *.txt | As above, but only looks in all files with a "txt" extension. |
grep -s \btoto\d{3}\b *.txt | As above, except that instead of looking for a simple string, we are using a regex pattern. Note that there is no delimiter. See the rest of the site for pattern syntax. This particular regex will match strings such as "toto123" as long as they are not embedded in a string of "word characters". You get the idea: going forward, to focus on grep features, many examples will use simple text matches instead of regex. |
grep -r toto . | Looks for the string toto in all files, recursively from the current folder. |
grep -r --include=.*\.txt toto . | Looks for the string toto in all files, recursively from the current folder, but only in files with a "txt" extension. Note than pcregrep uses a PCRE regex to specify the names of the files in which to search. |
grep -r --exclude=.*\.dat toto . | Specifies file names to exclude from the search, using a PCRE regex to define the set of files to exclude. |
grep -ri toto . | As four lines above, with the addition of the "i" option, which makes the search case-insensitive and allows the command to match "toTO". The "-ri" also showcases how to combine short options. |
grep -r (?i)toto . | As above, but setting case-insensitivity in the regex itself. See the page about (? syntax. |
grep -f patterns.dat *.txt | Reads patterns from a file called patterns.dat (one pattern per line, up to 100 patterns) and matches each pattern against all files with a "txt" extension! |
grep -v toto *.txt | Inverts the match, so that only lines that do not match the pattern are reported. |
grep -o \btoto\w\b *.txt | The "o" option only reports each line's match, without the context. |
grep -l toto *.txt | The "l" option says to only list the names of the files that contain a match, without showing the matches |
grep -L toto *.txt | The "L" option says to only list the names of the files that do not contain a match. Not the same as "-vl", which shows files that contain lines that do not match (some lines may match). |
grep -n toto *.txt | Adds the line number to the reported match. |
grep -c toto *.txt | Only reports the number of matches in each file. |
grep -NANYCRLF toto *.txt | By default, because this is Windows, grep treats \r\n (CRLF) as a new line. This option makes grep treat CR, LF or CRLF as new lines, which comes in handy if you are testing Unix files. See the manual for other options such as -NLF |
grep -so1 toto(\d{3}) * | We saw the s option before (silent). The "o1" option tells pcregrep to only report the Group 1 matches—in this case, the three digits after "toto". You could likewise specify -o2 to only report Group 2 captures. This option should have an alias "g" for "group", in order to avoid confusion with "o" which "only report the match (no context)". |
grep --color toto *
Sadly, this does not work in the Windows shell (cmd.exe) and results in this strange output: "←[1;31mtoto←[00m", while no color is displayed.
This is a limitation of the Windows shell rather than pcregrep.
The PCREGREP_COLOR option, set to "1;31" by default in the make files, is an ANSI code that can manipulate colors on terminals that accepts these codes, as on unix.
Windows is a different OS, so we shouldn't expect it to work.
You can easily change the overall background and text color of the cmd shell, either from the menu (click on the icon at the top left), or from the command-line, by passing strings such as "color=1B" when launching cmd ("1" stands for a blue background, "B" stands for very light blue text).
But to my knowledge there is no way to manipulate the color of individual text in cmd.exe.
The work-around is to use a different terminal.
I tried pcregrep in Console, and the color option works, but I haven't yet figured out how to integrate Console in my system so that it launches in the right path.
(I normally launch command shells from Directory Opus with a Ctrl + Shift + R shortcut, and they open in the right folder, in admin mode).
Last Words about pcregrep
I hope you get as much pleasure out of having that powerful grep utility at your fingertips as I do.
Okay, it's time to look at pcretest!
~\btoto(\w+)\b~
, and a subject such as "slkjtototatalkj".
With the right parameters, pcretest would show you the exact path it took in order to produce a match:
--->slkjtototatalkj
+0^\b
+2^t
+3^^o
+4^^t
+5^^o
+6^^(\w+)
+7^^\w+
+10^^)
+11^^\b
+13^^
0:tototata
1:tata
I hope you'll agree that this is rather nifty.
It could come in handy for an expression that fails for unknown reasons.
You'll be able to see exactly what is going on.
Now let's talk about optimization.
The pcretest utility lets you run a regex on some data a million times (or however many time you like), and it reports the average time it needed to find a match.
This makes it easy to compare alternate expressions.
When you read about techniques to optimize your regular expressions, you may be interested in running tests on alternate regex phrasings.
You can do that in your programming language—I used PHP to test many tweaks suggested in Jeffrey Friedl's book—but pcretest gives you an even more powerful test bench.
By the way, PCRE must have been seriously optimized since Jeffrey's book came out, because as mentioned in my page about regex tricks, none of the tweaks I tried seemed to make much difference.
Perhaps partly thanks to Jeffrey's hints?
Before looking at the pcretest command itself, you need to know that it usually operates on an input file.
The file contains the regex to be tested, and the lines of text to test.
The regex is on the first line, and must be enclosed in delimiters.
Here is an example of a file that would work, with the regex delimited by tildes ("~").
The regex is shown in bold for emphasis, but that would not be part of the text file.
~toto\d{3}~
aslkj 242
slkj totos lkj
sdlkj toto444 sdfs
sadflkj
If you wanted, you could add more regexes and data to that file.
Just leave a blank line after the end of the data, then start your next regex, then add the data for that regex.
Here is a file that would work with pcretest and contains two regexes.
You can add as many regexes and lines of data as you like.
~toto\d{3}~
aslkj 242
slkj totos lkj
sdlkj toto444 sdfs
sadflkj
~\btoto(\w+)\b~
aslkj 242
slkj tototata lkj
The regexes can use the entire PCRE syntax, whether in the pattern itself (for instance with (?s) or \K), or after the delimiter, for instance with the G flag.
There is one particularly interesting flag for debugging: "C".
It's the flag that produced the "trace output" I showed you a few paragraphs ago.
To get that output, I just added "C" to the pattern in the IN.txt file: ~\btoto(\w+)\b~C
.
But enough about the input file.
Let's now talk about the command itself.
Here are some sample uses of pcretest that you could try with either of the files above.
For a full reference, I highly recommend you read the manual page, which is part of the download.
Command | Description |
---|---|
pcretest -help | Displays a list of the options you can use with the command. You can send the output to a file with "pcretest -help > testhelp.txt". But note that the manual page in the download has much more detail. |
pcretest -C | Ouputs some information about the version of pcretest you are running, such as whether it supports UTF-8. |
pcretest IN.txt OUT.txt | Reads the regex and data from IN.TXT, outputs the result to OUT.txt, reporting the matches for each line. In the case of the second regex, which includes a capture group, pcretest also reports the Group 1 match. |
pcretest -t 100000 IN.txt OUT.txt | As above, but runs 100,000 times, and reports both the matches and the average time per run. Start without the -t option, just in case there is an error in your expression. |
\bc\w+
The idea is to match words that start with the letter c.
In our input, these words are cat, carrot and caramel.
\bc\w+
(which matches words that start with the letter c)
perl -ne 'while(/\bc\w+/g){print "$&\n";}' yourfile
perl -ne 'print if /\bc\w+/' yourfile
perl -ne 'print "$&\n" if /\bc\w+/' yourfile
perl -0777 -ne 'while(m/\bc\w+/g){print "$&\n";}' yourfile
perl -0777 -ne 'print "$&\n" if /\bc\w+/' yourfile
perl -0777 -pe 's/\bc\w+/ZAP/g' yourfile
perl -0777 -pe 's/\bc\w+/ZAP/' yourfile
perl -pe 's/\bc\w+/ZAP/g' yourfile
perl -pe 's/\bc\w+/ZAP/' yourfile
perl -0777 -ne 'if(@r=split(m/\bc\w+/,$_)){foreach(@r){print "$_\n";}}' yourfile
perl -ne 'if(@r=split(m/\bc\w+/,$_)){foreach(@r){print "$_\n";}}' yourfile
(?s)^.*?(?=\n=== SUMMARY ===)
One day your boss tells you that all Python scripts have to be moved over to Ruby.
As you write the code, you pop the regex straight into the new script.
Now you have two problems.
(No, not these two problems.)
1.
The inline modifier (?s) that you had used to flip on DOTALL mode (dot matches line breaks) does not work in Ruby.
2.
The caret anchor ^ that you had used to tell the engine to search at the beginning of the file tells Ruby to search at the beginning of every line.
This is what RegexBuddy's Convert tab is for.
If there is a straightforward conversion, it does it brilliantly.
<a href="regexbuddy:forum/view/1000/1007880">Green title bar</a>
Enjoy!
if subject =~ /(?-i)https?/i
There are simple rules to deal with such situations (namely the case-insensitivity passed as a parameter to the match function is overridden once the pattern turns it off inline), and you can be sure that RegexBuddy knows them.
The regular expression does not match the test subject.You place the pattern under the microscope, you try it in actual code, and there's nothing wrong with it. In these situations, I've learned that a handful of settings are often the culprit. The highest offender by far is the Line by Line vs Whole file setting. I use both and often forget to check it when working fast. Another big one is free-spacing mode that I've left turned on when I'm typing classic-style. Oh, and case-sensitivity is known to interfere. Did I mention line breaks? Did DOTALL or, more insidiously, CR Only get turned on? It sounds silly but sometimes I've just forgot to turn Highlight on—or I've turned off the Update automatically setting inside List all. I've never flipped on the Lazy quantifiers toggle by mistake. But you might! Come to think of it, I've never turned it on at all. Who knows what happens when you do? Music starts playing, a magical door opens? For now I won't look. If all of those fail… Well, bugs do happen. If your language is showing you one thing and RB another, and you've double-checked all the settings, then it's definitely worth posting via the Forum tab.
"O Deep Thought computer," he said, "the task we have designed you to perform is this. We want you to tell us...." he paused, "The Answer." "The Answer?" said Deep Thought. "The Answer to what?" "Life!" urged Fook. "The Universe!" said Lunkwill. "Everything!" they said in chorus. Deep Thought paused for a moment's reflection. "Tricky," he said finally. "But can you do it?" "Yes," said Deep Thought, "I can do it. But, I'll have to think about it." "How long?" "Seven and a half million years," said Deep Thought. [Seven and a half million years later.... Fook and Lunkwill are long gone, but their ancestors continue what they started] "Good Morning," said Deep Thought at last. "Er..good morning, O Deep Thought" said Loonquawl nervously, "do you have...er..." "An Answer for you?" interrupted Deep Thought majestically. "Yes, I have." "And you're ready to give it to us?" urged Loonquawl. "I am." "Now?" "Now," said Deep Thought. "Though I don't think," added Deep Thought, "that you're going to like it." "Doesn't matter!" said Phouchg. "We must know it! Now!" "Alright," said Deep Thought. "The Answer to the Great Question..." "Of Life, the Universe and Everything..." said Deep Thought. "Is..." said Deep Thought, and paused. "Yes...!!!...?" "Okay, here it is, let me print it out for you," said Deep Thought, with infinite majesty and calm. Slowly, a narrow tape came out of a small slit in Deep Thought's titanium panels. It read:
^(?=(?!(.)\1)([^\DO:105-93+30])(?-1)(?<!\d(?<=(?![5-90-3])\d))).[^\WHY?]$
"But… What does that mean?" asked Loonquawl. "I don't know," said Deep Thought. "But I can design a more powerful computer that will be able to tell you that." "It will take time, though", added Deep Thought.Curious? 1. On the following link, you can see a demo of the Meaning of Life Regex at work. 2. … But I highly recommend you try to figure it out for yourself—it's a great exercise! 3. Authors: Douglas Adams in this book—and for the regex, Rex—7 August 2014 (to share this: direct link)
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.Obviously, something there is missing… What two problems are we talking about? Here are some attempts to complete the quote. Please send yours. The Reversed Quote Hypothesis
Some people, when confronted with a problem, think “I know, I won't use regular expressions.” Now they have two problems.Author: Rex, 21 October 2015 The Parrot Hypothesis
Sometimes, when confronted with a problem, you think “I know, I'll use regular expressions.” Now you have two problems: 1. figuring out what to do with the many hours of tedious coding you just saved, and 2. having to deal with the trolls who give you an earache parrotting some lame quote about having two problems.Author: Rex, 7 May 2014 The Recursive Hypothesis
Some people, when confronted with a problem, think “I know, I’ll quote Jamie Zawinski.” Now they have two problems.Source: Martin Liebach, March 4 2009 Variation: I think that reads better with “I know, I’ll call Jamie Zawinski.” Now they have two problems. Will the well-meaning people of this world bring an end to this awful controversy? People have been discussing the two-problem question quite seriously on Stack Exchange but have failed to reach a consensus. I sincerely hope that with your help, the strict scientific approach on this page will be more fruitful. And now… A bit of regex entertainment.
\x5C[^\cH],u(I)D\g{1}0t!
(Untested.
Please don't use this in your code.)
Author: Rex, 8 May 2014
.+
This one is too clever for me.
I've read the explanation on Stack Overflow, but I still don't get it.
From Morten Just
Q: What regex are you most likely to see at Christmas?
A: [^L]
Q: Why couldn't Chris try out the regular expressions he created until he left home?
A: His mom wouldn't let him play with matches.
Source: mortenjust
<tag>.*?</tag>
prevents you from steamrolling from the start to the end of a string such as <tag>Tarzan</tag> likes <tag>Jane</tag> may seem like the best regex trick ever.
At other points in your career, you'll surely fall in love with regex bits such as [^"]+
to match all the content between certain delimiters (in this case double quotes), or with atomic groups.
However, as you mature as a regex practitioner, you come to regard these techniques for what they are: language features rather than tricks.
They are neat, to be sure, but they are how regex works, and nothing more.
In contrast, a "trick" is not a single point of syntax such as a negated character class or a lazy quantifier.
A regex trick uses regex grammar to compose a "phrase" that achieves certain goals.
With regex there's always more to learn, and there's always a more clever person than you (unless you're the lone guy sitting on top of the mountain), so I've often been exposed to awesome tricks that were out of my league—for instance the famous regex to validate that a number is prime, or some fiendish uses of recursion.
But however clever these tricks, I would not call any of them the "best regex trick ever", for the simple reason that they are one-off techniques with limited scope.
You are unlikely to ever use them.
In contrast, the reason I drum up the technique on this page as the "best regex trick ever" is that it has several properties:
Anyone can learn it.
You don't have to be a regex master.
It answers not one, but several common and practical regex questions.
These questions are ones that even competent regex coders often have trouble answering gracefully.
It is simple to implement in most programming languages.
It is easy to extend when requirements change.
It is portable over numerous regex flavors.
It is usually more efficient than competing methods.
It is too little-known.
At least, until now.
Do I have your attention yet?
Before we proceed, I should point out some limitations of the technique:
It will not butter the reverse side of a toast.
It will not make small talk with your mother-in-law.
It relies on your ability to inspect Group 1 captures (at least in the generic flavor), so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command.
The point above also means that you may have to write one or two extra lines of code, but that is a light price to pay for a much cleaner, lighter and easier to maintain regex.
Code samples for the six typical situations are provided below.
There is an edge case to keep in mind.
The regex engine dumps unwanted content into a trash can.
In a typical context that is no problem, but if you are working with an enormous file, the trash can may get so large that you could run into memory issues.
Other than that, it's awesome.
Okay, let's dive in.
No need to buckle up, the technique itself is delightfully simple.
(?<!")Tarzan(?!")
However, this does not work because it also excludes valid strings such as "Tarzan and Jane" and "Jane and Tarzan", whereas we only wanted to exclude "Tarzan".
Back to the Future Regex I
To account for this, you might inject a lookahead inside your negative lookbehind.
This is what I call a "Back to the Future" regex.
The lookahead inside the lookbehind asserts that after we've found the opening double quote behind Tarzan, we can find Tarzan (surprise) and a closing double quote.
Since we're inside a negative lookbehind, this whole package is what we don't want.
(?<!"(?=Tarzan"))Tarzan
Step Forward then Backflip
This approach is closely related to the Back to the Future approach.
You match Tarzan, then you exclude the match if it is followed by a double quote (lookahead) that is preceded by the string "Tarzan".
Tarzan(?!"(?<="Tarzan"))
Conditional
Alternately, you might first turn the negative lookbehind into a positive lookbehind that captures the opening quote if found, then tag a conditional at the end to assert that if Group 1 was set, the following character cannot be a double quote.
(?>(?<=(")|))Tarzan(?(1)(?!"))
Logic à la Lewis Carroll
For this simple sample problem, you can modifiy the faulty lookarounds solution with a bit of logic:
(?<!")Tarzan|Tarzan(?!")
The left side of the alternation excludes "Tarzan, but the right side allows it.
The right side of the alternation excludes Tarzan", but the left side allows it.
As desired, this expression can match Tarzan, "Tarzan, Tarzan" but not "Tarzan".
This is neat, but is it obvious? You might find the logic immediate, but most people will need to think about it for a moment to see how this works (I'm in that camp).
The four options above work… but good luck explaining them to your boss.
Option 2: Parity Check
You can check that Tarzan is not inside quotes by checking that it is not followed by one quote followed by an even number of quotes.
That's a bit of a hack.
Tarzan(?!"(?:(?:[^"]*"){2})*[^"]*)
Simple, right? Er… Not really.
There's plenty of room to introduce bugs here.
And indeed, this regex will not properly handle "Jane and Tarzan", where we would like Tarzan to match (you could get around this with a lookbehind and an alternation).
In contrast, the that uses the regex trick on this page will be hauntingly simple.
Option 3: The Two- or Three-Step Dance (Replace before Matching)
I'll expand on this option below when we look at cases more complex than "Tarzan".
In the meantime, here is the idea:
1.
Replace all instances of the bad string (here "Tarzan").
If you're just trying to match, your replacement can be "" (you can remove the string).
If you want to replace the good strings but leave the bad strings, replace the bad strings with something distinctive, such as "T~a~r~z~a~n"
2.
Simply match or replace the string you want (here Tarzan), which is now safe to do as you know that all the bad strings have been neutralized.
3.
If you are replacing rather than simply matching, there is one more step: you now need to revert the distinctive strings ("T~a~r~z~a~n") to their original form.
When you're working with a text editor and want to perform replacements, this is often your best bet.
The technique on this page is for when you are working in a programming language that allows you to inspect your Group 1 captures, so it won't help you in EditPad Pro or Notepad++.
Option 4 for Perl, PCRE, Ruby, Python: \K
I'll also expand on this option below when we look at more complex cases than "Tarzan".
This option works in Perl, PCRE (C, PHP, R, …), Ruby 1.9+ and Python's alternate regex engine.
In these regex flavors, the \K token tells the engine to discard the characters matched up to its appearance when preparing the overall match.
We can use this feature to match unwanted content (here "Tarzan" and other characters that are not Tarzan) up to the very point where a wanted string begins (here Tarzan).
At that point, the \K discards the unwanted content, and the engine proceeds to match the content we really want.
This solution looks like this:
(?:"Tarzan".*?)*\KTarzan
This is a compact option if you use the engines that support it, but if you're aware of that you'll see below.
Okay, that was the simple case.
Here the context to avoid had a fixed width: a single double-quote character on either side of the word Tarzan.
Now let's look at the general case, where the content to exclude has a variable width.
B.
General Case: variable-width exclusion (for instance between tags)
More often than not, the context we want to exclude has a width we cannot predict.
For instance, suppose we want to avoid matching the string Tarzan somewhere between [a] tags, as in [a]Jane and Tarzan[/a].
Not only will the string between the tags be variable (here Jane and Tarzan), but the tag itself may also vary, as in [a].
In such situations, you often see the big guns come out.
Option 1: Variable-Width Lookbehind
In most regex flavors, a lookbehind must have a fixed number of characters, or at least a number of characters within a specified range.
However, a handful of flavors allow true variable-width lookbehinds.
Among the chosen few are .NET, Matthew Barnett's alternate regex module for Python and JGSoft (available in RegexBuddy and EditPad).
In .NET, the question "match Tarzan except inside curly braces" (e.g., not in "{Jane loved Tarzan's curly hair}") can almost be gracefully handled with:
(?<!{[^}]*)Tarzan
Back to the Future Regex II
Why almost? Because "Tarzan" should be allowed in { Jane and Tarzan..., where the left brace is left open.
To check both sides, we'll need to inject a positive lookahead inside the lookbehind—stepping into Back to the Future territory—to assert that after we've found what we were looking for behind Tarzan, we can find Tarzan (surprise) and optional characters up to a closing curly brace.
Since we're inside a negative lookbehind, this whole package is what we're trying to avoid.
This is the adult version of our earlier , and it looks like this:
(?<!{[^}]*?(?=Tarzan[^{}]*}))Tarzan
What if you need more restrictions—such as also forbidding Tarzan from appearing in [i][/i]
tags inside of [p][/p]
tags? Yes, you can add more variable-length lookbehinds.
Good luck to you as the restrictions become more numerous and complex.
Also, if the pattern to be matched is more complex than the literal Tarzan, the expression can fast become unmanageable.
And in Java, PHP, Ruby and Python's re module, you can forget about this technique altogether because infinite-width lookbehinds do not exist in these flavors.
Option 2: The Two- or Three-Step Dance (Replace before Matching)
To match all instances of Tarzan unless they are embedded in a string inside curly braces, one fairly heavy but simple solution is to perform a two-step dance: Replace then Match.
If we also want to replace all these matches, we need a third-step: a final replacement.
Step 1: You positively match all instances of Tarzan embedded in curly braces.
If you're just trying to match, your replacement can be "" (you can remove the string).
If you want to replace the good strings but leave the bad strings, replace the word Tarzan with something distinctive, such as "T~a~r~z~a~n".
To perform the match, this simple regex would do:
({[^{}]*?)(Tarzan)([^}]*})
The string is captured into three groups: the beginning, Tarzan, and the end.
If you're removing the bad strings before matching, your replacement would be \1\3
or $1$3
depending on your regex flavor.
If you're replacing the bad strings before replacing the good strings, your replacement would be \1T~a~r~z~a~n\3
or $1T~a~r~z~a~n$3
.
If there are other contexts in which you want to avoid matching Tarzan, you probably have to repeat Step 1, as attempting to match all the bad strings in one big regex is fraught with risk.
Step 2: All the unwanted instances of Tarzan have been neutralized, so you can now match Tarzan without worrying about context.
I realize that matching Tarzan in a vacuum is not that interesting.
In real life you might be looking for Tarzan and the phone number that follows.
Optional Step 3: If the point of Step 2 was not only to match but also to perform a replacement on the acceptable Tarzan strings, then once that replacement is made we also need to turn all the T~a~r~z~a~n strings back into Tarzan, which is easily accomplished.
Option 3 for Perl, PCRE, Ruby and Python: \K
This option works in Perl, PCRE (C, PHP, R, …), Ruby 1.9+ and Python's alternate regex engine.
In these engines, the \K token causes the engine to drop all it has matched up to the \K from the overall match it returns.
This opens a strategy for us: we can (i) match any unwanted content (if present) up to the beginning of a wanted Tarzan instance, (ii) throw away that portion of the match using \K, then (iii) match Tarzan.
This option could look like this:
(?:(?>{[^}]*?})[^{}]*?)*\KTarzan
Note that while we try to match unwanted content, we swallow entire sets of {strings in curly braces} without bothering to check if they contain Tarzan.
We do not need to care, because we know that if something is inside curly braces, we don't want it.
Compared with the other options we've seen so far, this is fairly economical.
But if you need to add conditions in which Tarzan cannot be matched, it can become very hard to manage.
Besides, it's still too much work compared with… (drum roll…)
((?<=")?)Tarzan(?(1)(?!"))
and
Tarzan(?!"(?:(?:[^"]*"){2})+[^"]*?(?:$|[\r\n]))
and
(?:"Tarzan".*?)*\KTarzan
Well, you'll now see how simple the problem becomes when you use the best regex trick ever:
"Tarzan"|(Tarzan)
Really? That's it?
Yes.
The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side.
When our programming language returns the results, we ignore the overall matches (that's the trash bin) and instead turn our whole attention to Group 1 matches, which contain what we were after.
Adding exclusions is a breeze
When there's another context we want to exclude, we simply add it as an alternation on the left, where we match it in order to neutralize it—if it's matched, it's in the trash.
For instance, if we also had to exclude Tarzan in Tarzania and --Tarzan--, our regex would become:
Tarzania|--Tarzan--|"Tarzan"|(Tarzan)
Adding exclusions is a breeze, isn't it?
Again, the only instances of Tarzan we care about will be those captured by Group 1.
"Tarzan"|(Tarzan)
against this string:
Now Tarzan says to Jane: "Tarzan".
1.
The engine's string reading head positions itself at the head of the string, before the "N" in "Now".
At this position, the engine attempts to match the entire pattern "Tarzan"|(Tarzan)
2.
At this position, the engine is unable to match the opening double quote in "Tarzan" because the next character is "N", so the left side of the alternation immediately fails.
The engine's pattern reading head then jumps to the right side of the alternation and tries to match the initial T in Tarzan, but fails, again because the next character in the string is "N".
3.
At this position in the string, the match has failed.
The string reading head advances one character in the string (positioning itself between the "N" and the "o" in "Now"), and the pattern reading head resets to the very left.
At this new position, the engine again attempts to match then entire pattern "Tarzan"|(Tarzan)
4.
At this position, the engine is unable to match the opening double quote in "Tarzan" because the next character is "o".
Likewise, the right side of the alternation fails because "T" is not "o".
5.
The string reading head again advances in the string and attempts two matches that fail, the first before the "w" in "Now", the second before the space character preceding "Tarzan".
The string reading head then advances in the string to the position preceding the T.
6.
The left side of the alternation fails because the next character is not a double quote.
The pattern reading head jumps to right side of the alternation, and the engine is able to match the T.
The string reading head advances by one character, the pattern reading head advances by one token.
The engine is able to match a, then, as the reading heads continue to advance in parallel, the engine matches r, z, a and n.
The match succeeds, Tarzan is added to the list of matches, and since it was in parentheses it is also recorded as the Group 1 capture for this match.
7.
The string reading head advances to the position after the "n" in the initial Tarzan, and the pattern reading head resets to the very left.
At this position the engine starts a new match attempt, and fails.
The string reading head advances to each position in "says to Jane: ", and as it does so, at each position the engine attempts a new match, and fails.
The string reading head then advances to the position preceding the first double quote.
8.
At this position, before the opening double quote, the engine attempts to match a double quote and succeeds.
The string reading head advances by one character, the pattern reading head advances by one token.
The engine matches the T, and both reading heads keep advancing in parallel until all the characters in "Tarzan" have been matched.
9.
The match succeeds, "Tarzan" is added to the list of matches, but it is not captured in any capturing group as it was not surrounded by parentheses.
10.
The engine returns two matches: Tarzan and "Tarzan".
We don't pay attention to the matches, but for each match we look at capturing Group 1 using our programming language.
(You'll see code samples in several languages below.) For the first match, we have a non-empty capturing Group 1: Tarzan.
That is what we were after.
NotThis|NotThat|GoAway|(WeWantThis)
This is a game of good cop / bad cop.
Bad string
As in any good cop / bad cop routine, the bad cop comes in first.
The idea is to use a series of alternations on the left to specify the contexts we want to exclude.
By doing so, we force the engine to match these "bad strings".
We won't even look at the overall matches—think of the set of overall matches as a garbage bin.
After matching a bad string, the engine attempts the next overall match starting at the string position that immediately follows the bad string.
In effect, that bad string has been skipped: this is how we manage to exclude unwanted context.
Good string
When the engine starts a match attempt at the beginning of a "good string", it can safely match it, because we know that if that string had been embedded in context we want to exclude… the engine would already have matched it and placed it in the garbage bin! Since we do match the good strings, they too go in the garbage bin.
The difference is that by using capturing parentheses when we match the good strings, we capture them into Group 1.
One or two lines of code
In our code, we'll only examine these Group 1 captures.
Examining Group 1 may take one or two more lines of code than examining "Group 0" (the overall matches), but that's a small price to pay for a regex that is crystal-clear and extremely easy to maintain.
The code samples lower in the page will show you how to use this technique in a variety of languages for the six most common regex tasks: (i) checking if there is a match, (ii) counting matches, (iii) retrieving the first match, (iv) retrieving all matches, (v) replacing, and (vi) splitting.
This is a simple but extremely potent regex technique, don't you think?
(GetThis)
is not so broad that it can swallow strings that contain bad strings—specifically, strings that start one or more characters before a bad string.
For instance, suppose you want to match all words that are not inside an <img> tag.
Let's apply our NotThis|(GetThis)
recipe.
1.
Your NotThis rule could look like this: <img[^>]+>
2.
What about the GetThis rule? Don't use a dot-star, as on the right side of the alternation in <img[^>]+>|(.*)
Why not? The engine starts a match attempt at the beginning of the string.
First, it tries a <
against the first character.
Say the first character is "S": the <
fails to match.
The string reading head stays at the start of the string, but the pattern reading head now moves to the right side of the alternation.
The engine tries the .*
… and the naughty dot-star swallows the "S" and the rest of the string, exclusions and all.
We are relying on the exclusion rules to remove unwanted context.
But on the GetThis side, you can't have an expression that swallows the same context you are trying to remove! That stands to reason, but it needs to be said—and seen.
I sometimes mess this up when building expressions fast, and it's good to be able to instantly spot what is going wrong.
Note that the problem only arises if the GetThis regex is able to match one or more characters before it matches a bad string.
That is because at a string position that precedes a bad string by one or more characters, the exclusion rule is not able to fire, and the engine switches over to the hungry GetThis.
On the other hand, it is perfectly acceptable for the GetThis expression to have the potential to match a bad string, as long as it only has that potential at the very start of a bad string.
Why? Because this potential never has a chance to come to fruition.
Since the exclusion regex patterns are on the left of the alternation, these patterns neutralize bad strings before the GetThis regex can ever get to them.
In our example, this regex would do the job: <img[^>]+>|(\w+)
Not_this_context|(WeWantThis)
Okay, first off, we know that (WeWantThis) is simply (Tarzan)
.
Now how can we express Not_this_context? The unwanted context is Tarzan inside curly braces.
Delightfully, for this, we use something as compact as {[^}]*}
, and I'll explain why in a short moment.
This small expression simply matches the entire content of a pair of curly braces.
For this example, we're assuming that braces are {never {nested}}.
This gives us:
{[^}]*}|(Tarzan)
All we have to do is retrieve the matches from Group 1.
Too easy!!
Of course in real life we would probably not look just for the word Tarzan, but for some variable content, such as Tarzan\d+
Please skip it simple!
Please note this trick within a trick: to specify the exclusion rule, we did not bother to write a whole expression to match Tarzan inside curly braces, such as:
{[^}]*?Tarzan[^}]*}
Instead, we just matched the content of any curly braces:
{[^}]*}
Why? Because if something is inside curly braces, we know that we don't want anything to do with it.
So we can go ahead and skip all sets of curly braces without bothering to look inside!
This is what I call "skipping it simple".
Now let's take it up a notch.
\bBEGIN\b.*?\bEND\b|Therefore.*?[.!?]|{[^}]*}|(Tarzan)
Step 2: clean up your inbox for a couple of hours before announcing to your boss that it was curly, but that by gawd… you've wrestled that regex to the ground!
So what have we done? We've just followed the recipe and added two exclusions to the original regex in alternations at the left.
The first exclusion, which could have been a simple BEGIN.*?END
, matches any sequence starting with BEGIN and ending with END.
You've added the \b
boundaries because you're nice and you want to give your boss a real END, not just any old WENDY.
The second exclusion swallows any string that starts with Therefore and ends with the three characters in the [.!?]
character class—so chosen because your boss told you to assume that all sentences end with periods, question marks or bangs.
Okay, we're feeling great.
What's the next use of our golden technique?
Match X unless it is in contexts a, b and c.Now let's look at a family of questions that sound quite different but reduce to the same:
Match every word except words a, b and c.To start easy, let's try to match every word except Tarzan. Hey, that's simple:
\bTarzan\b|(\w+)
By the way, this is an interesting case because by itself, the \w+
would be able to match Tarzan.
However, it is never able to fire in that situation, because by the time we get to an instance of Tarzan, the exclusion rule has already matched it.
This is explained in more detail in the section about one small thing to look out for.
Note also that as it is, the regex will capture antiTarzan and Tarzania.
That's a feature, not a bug (see the \b
boundaries.)
Let's take it up a notch and talk about blacklists, a commonly requested regex task.
\bTarzan\b|\bJane\b|\bSuperman\b|(\w+)
or, more gracefully:
\b(?:Tarzan|Jane|Superman)\b|(\w+)
You can try it online with "Tarzan, Jane and Superman hopped from vine to vine." Remember that what we're looking at is the Group 1 matches, which are shown in the lower right-hand panel and highlighted differently from the plain matches.
Let's now talk about an application of the technique which, to untrained ears, sounds completely different:
I want to ignore A.It's useful to notice that this wording is just a variation on
Match everything except ADidn't we just see that one? We did. Even so, let's stay sharp by practicing one more time, using this assignment: ignore bolded content. Maybe you can convince your boss to reword this as "match all content except anything in bold". By "in bold", let's say we're talking about content within <b> tags. And by "content", let's say we're talking about sequences of word and whitespace characters. Using our recipe, we can translate the assignment like so:
<b>[^<]*</b>|([\w\s]+)
As a reminder (see the lookout section for details), it would not do to use a (.*)
in the GetThis section, because at any point in the string prior to a bolded section, the exclusion rule would fail, while the naughty dot-star would swallow the entire string from that point to the end—including any bolded sections.
In that case, how about the lazy quantifier (.*?)
, you might wonder? You could do that—but make sure to see the section explaining why lazy quantifiers are expensive on the Mastering Quantifiers page.
Back the the article's Table of Contents
NotThis|NotThat|GoAway|(WeWantThis)
We use:
(KeepThis|KeepThat|KeepTheOther)|DeleteThis
As you can see, the location of the parentheses has been inverted.
We can now replace the match with Group 1.
There are two cases:
- If the match took place on the left branch of the alternation, and therefore captured to Group 1, the match is replaced with itself (no change);
- If the match took place on the right side of the alternation, the match is replaced with Group 1, which is empty: it is therefore deleted.
Here is an interesting variation to do the same:
(KeepThis)|(KeepThat)|(KeepTheOther)|DeleteThis
For the replacement, we concatenate Groups 1, 2 and 3 (in any order).
Since only one of those groups is ever captured (if any), the other two groups contain empty strings.
Once again, the match is replaced with itself (if captured) or with an empty string.
There is no standard for replacement syntax, so in one language this may look like \1\2\3, $1$2$3 or m.group(1) + m.group(2) + m.group(3).
Not_X|(GetThis)
Using Perl, PCRE (PHP, R, C…) or Python's alternate regex engine, we can accomplish the same with either of these:
Not_A(*SKIP)(*FAIL)|GetThis
Not_A(*SKIP)(*F)|GetThis
Not_A(*SKIP)(?!)|GetThis
Note that the parentheses around GetThis have disappeared.
Whenever the engine is able to match Not_A, the (*SKIP)(*FAIL) construct causes it to reject that entire chunk of text and start the next match attempt immediately afterwards.
Whenever the engine is not able to match Not_A, it jumps to the right branch of the alternation | and tries to match GetThis.
If this fails, the engine starts the next match attempt at the next starting position in the subject text, as always.
If we want to avoid three contexts A, B and C, our technique used to do this:
Not_A|Not_B|Not_C|(GetThis)
In Perl and PHP, we can instead say something like one of these:
Not_A(*SKIP)(*FAIL)|Not_B(*SKIP)(*F)|Not_C(*SKIP)(?!)|GetThis
(?:Not_A|Not_B|Not_C)(*SKIP)(*FAIL)|(GetThis)
Back the the article's Table of Contents
\d
without attempting to distinguish between ASCII digits and Unicode digits, as that is not the point of the exercise.
Just be aware that in some engines \d
only matches the ASCII digits 0 to 9, while in others it also matches digits in other alphabets.
If you want to be consistent, use [0-9]
The Test Strings
To test the code, we'll use one string that produces two matches and a small variation that should produce none.
1.
The string below should produce two matches: Tarzan11 and Tarzan22
Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}2. To test failure cases, I suggest you capitalize two z characters as in the string below, which should produce no matches: Jane" "Tarzan12" TarZan11@TarZan22 {4 Tarzan34} The Regex Here is the regex we'll use:
{[^}]+}|"Tarzan\d+"|(Tarzan\d+)
1.
The first part of the alternation {[^}]+}
matches and neutralizes any content between curly quotes.
2.
The second part of the alternation "Tarzan\d+"
matches and neutralizes instances where the sought string is embedded within double quotes.
You may ask why I didn't simply neutralize any content between double quotes in similar fashion to the first part of the alternation, using "[^"]+"
.
For most strings, that would have worked, but if you carefully inspect the test string, you'll see that I sneaked in an extra double quote after Jane.
I did so to illustrate a safe regex work practice.
See, if for any reason the subject string has an odd number of double quotes as is the case here, you cannot be sure that two quotes matched by "[^"]+"
belong together.
Indeed, for our test string, that code would match a single space within double quotes, and the regex would (wrongly) capture Tarzan12 into Group 1.
Therefore, when working with quotes, being specific as in "Tarzan\d+"
is safer.
In the case of braces (where there are distinct characters for the left and right sides), the risk of mismatches is far lower.
3.
The third part of the alternation (Tarzan\d+)
matches Tarzan and the following digits and captures the match into Group 1.
Here are jump points to code samples in various languages.
Implemented
PHP
C#
Python
Java
JavaScript
Ruby
Perl
VB.NET
Not Yet Implemented
Visual C++
Scala
Other language of your choice
(?s)\bblue\b(?=.*(bleu))|\bred\b(?=.*(rouge))
What do we replace our matches with? In the blue match case, the replacement is captured to Group 1.
In the red case, it is captured to Group 2.
When one group is set, the other is empty, so gluing them together with \1\2 just results in the one that is set:
bleu + "" yields bleu
"" + rouge yields rouge
Following this principle, if we had five replacements, our replacement string would be \1\2\3\4\5
Here's an online demo.
Variation: branch reset
In regex flavors that support the (?|...) branch reset syntax, you can capture the replacements to a unique group, so the replacement string becomes a simple \1
In the regex, you just need to wrap the alternation in a branch reset:
(?sx)
(?| # branch reset: both captures go to Group 1
\bblue\b(?=.*(bleu))
|
\bred\b(?=.*(rouge))
)
Even if you have five replacements, the replacement string will still be \1.
Here's an online demo.
(?s)\b(blue|red)\b(?=.*:\1=(\w+)\b)
This matches either color, then looks further in the file for a dictionary entry of the form :original=translation, capturing the translation to Group 2.
Our replacement is therefore \2 (here's a demo).
Of course if there's a chance that the actual text would contain segments that look like dictionary entries, the regex would have to be refined.
Variation when matches are dense (full translation)
In the previous pattern, we specifically look for the literals blue and red because we do not want to give the engine the burden of looking up every word in the dictionary.
However, when nearly every word in your file is a match to be translated, including every word in the regex becomes burdensome.
Instead, we can simplify the regex by just matching any word:
(?s)\b(\w+)\b(?=.*:\1=(\w+)\b)
The replacement is still \2.
Here's a demo.
Version | Date | Change |
---|---|---|
10.36 | 4 Dec 2020 | Added CET_CFLAGS option for Intel CET |
10.35 | 9 May 2020 | Added PCRE2_SUBSTITUTE_LITERAL option to turn off the interpretation of the replacement string |
10.35 | 9 May 2020 | Added PCRE2_SUBSTITUTE_MATCHED option |
10.35 | 9 May 2020 | Added PCRE2_SUBSTITUTE_REPLACEMENT_ONLY option |
10.35 | 9 May 2020 | Added Added (?* and (? as synonms for (*napla: and (*naplb: to match another regex engine. option |
10.34 | 21 Nov 2019 | Added non-atomic positive lookaround via (*non_atomic_positive_lookahead:…) or (*napla:…), (*non_atomic_positive_lookbehind:…) or (*naplb:…) |
10.34 | 21 Nov 2019 | (*ACCEPT) can now be quantified: an ungreedy quantifier with a zero minimum is potentially useful |
10.34 | 21 Nov 2019 | Add pcre2_get_match_data_size() to the API |
10.34 | 21 Nov 2019 | Add pcre2_maketables_free() to the API |
10.33 | 16 Apr 2019 | Added Perl "script run" features (*script_run:…) a.k.a (*sr:…), and (*atomic_script_run:…) a.k.a (*asr:…) |
10.33 | 16 Apr 2019 | Added Perl 5.28 experimental alphabetic names for atomic groups and lookaround assertions, such as (*pla:…) and (*atomic:…) |
10.33 | 16 Apr 2019 | Added PCRE2_EXTRA_ESCAPED_CR_IS_LF option |
10.33 | 16 Apr 2019 | Added PCRE2_COPY_MATCHED_SUBJECT option |
10.33 | 16 Apr 2019 | Added PCRE2_EXTRA_ALT_BSUX option to support ECMAScript 6 \u{hhh} construct |
10.33 | 16 Apr 2019 | In DOTALL mode, \p{Any} is now the same as . |
10.32 | 10 Sep 2018 | unsets all imnsx options |
10.32 | 10 Sep 2018 | (*ACCEPT:ARG), (*FAIL:ARG), and (*COMMIT:ARG) are now supported. |
10.30 | 14 Aug 2017 | Added the PCRE2_LITERAL option, telling the compiler to treat the entire pattern as a literal string, including what would normally be metacharacters |
10.30 | 14 Aug 2017 | Added the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option, telling the compiler to treat an escaped character which isn't a proper token (such as \j) as a literal (in this case the letter j) rather than an error |
10.30 | 14 Aug 2017 | Added the PCRE2_NEWLINE_NUL option, which adds the NUL character (binary zero) to the list of characters which can be set as those to be recognized as new lines, set using pcre2_set_newline() |
10.30 | 14 Aug 2017 | Added the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option, giving finer control over the treatment of Unicode surrogate code points |
10.30 | 14 Aug 2017 | Added the (?n) inline option to disable auto-capture, in the same way as the PCRE2_NO_AUTO_CAPTURE API option |
10.30 | 14 Aug 2017 | Added the (?xx) inline option and the PCRE2_EXTENDED_MORE API option to ignore all unescaped whitespace, including in a character class |
10.30 | 14 Aug 2017 | Added the PCRE2_ENDANCHORED option, telling the engine that the pattern can only match at the end of the subject |
10.30 | 14 Aug 2017 | Added pcre2_pattern_convert() to the API, an experimental foreign pattern conversion function |
10.30 | 14 Aug 2017 | Added pcre2_code_copy_with_tables() to the API |
10.23 | 14 Feb 2017 | Allow backreferences in lookbehind so long as group names or numbers are unambiguous |
10.23 | 14 Feb 2017 | Added forward relative back-reference syntax: \g{+2} (mirroring the existing \g{-2}) |
10.22 | 29 Jul 2016 | Added pcre2_code_copy() to the API |
10.21 | 12 Jan 2016 | Added the PCRE2_SUBSTITUTE_EXTENDED option to enhance replacement syntax |
10.21 | 12 Jan 2016 | Added the ${*MARK} facility to pcre2_substitute() |
10.21 | 12 Jan 2016 | Added the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option to tweak what happens during replacements when the output buffer is too small |
10.21 | 12 Jan 2016 | Added the PCRE2_SUBSTITUTE_UNKNOWN_UNSET and PCRE2_SUBSTITUTE_UNSET_EMPTY options to fine-tune how empty capture groups are treated in replacements |
10.21 | 12 Jan 2016 | Added the PCRE2_ALT_VERBNAMES option to subtly modify marked names that can be used with backtracking control verbs |
10.21 | 12 Jan 2016 | Added pcre2_set_max_pattern_length() to the API, allowing programs to restrict the size of patterns they are prepared to handle |
10.20 | 30 Jun 2015 | Added the PCRE2_ALT_CIRCUMFLEX option to allow ^ to assert position after any newline including a terminating newline |
10.20 | 30 Jun 2015 | Added the PCRE2_NEVER_BACKSLASH_C option to disable \C |
10.20 | 30 Jun 2015 | pcre2_callout_enumerate was added to the API |
10.10 | 6 Mar 2015 | Serialization functions were added to the API |
10.0 | 5 Jan 2015 | Version check available via patterns such as (?(VERSION>=x)…) |
10.0 | 5 Jan 2015 | tells the engine not to automatically anchor patterns that start with .* |
10.0 | 5 Jan 2015 | tell the engine not to return empty matches) |
10.0 | 5 Jan 2015 | By default, PCRE2 buils with unicode support |
10.0 | 5 Jan 2015 | Name switch to PCRE2 and new API, which includes a replacement function: pcre2_substitute() |
*** | *** | *** |
8.41 | 5 Jul 2017 | Inline comments can now be inserted between ++ and +? quantifiers, as in a+(?# make it possessive)+ or a+(?# up to b)?b |
8.34 | 15 Dec 2014 | Added support for the POSIX , which are converted to \b(?=\w) and \b(?<=\w) internally |
8.34 | 15 Dec 2014 | Added \o{…} to specify code points in octal |
8.33 | 28 May 2014 | Added \p{Xuc} (PCRE-specific) to match characters that can be expressed using Universal Character Names |
8.10 | 25 Jun 2010 | Added PCRE-specific Unicode properties: \p{Xan} (alphanumeric), \p{Xsp} (Perl space), \p{Xps} (POSIX space) and \p{Xwd} (word) |
8.10 | 25 Jun 2010 | Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and THEN |
8.10 | 25 Jun 2010 | Added \N (any character that is not a line break) |
8.10 | 25 Jun 2010 | Added the (*UCP) start of pattern modifier, which affects \b, \d, \s and \w |
7.90 | 11 Apr 2009 | Added the (*UTF8) start of pattern modifier |
7.70 | 7 May 2008 | Added Ruby-style subroutine call syntax: \g<2>, \g'name', \g'2' |
7.30 | 28 Aug 2007 | Added backtracking control verbs , (*PRUNE), (*THEN), (*COMMIT), (*ACCEPT) |
7.30 | 28 Aug 2007 | Added the (*CR) start of pattern modifier |
7.20 | 19 Jun 2007 | Added (?-2) and (?+2) syntax for relative subroutine calls |
7.20 | 19 Jun 2007 | Added (?(-2)…) and (?(+2)…) conditional syntax to check if a relative capture group has been set |
7.20 | 19 Jun 2007 | Added to drop what has been matched so far from the match to be returned |
7.20 | 19 Jun 2007 | Added named back-reference synonyms: \k{foo} and \g{foo} |
7.20 | 19 Jun 2007 | Added |
7.20 | 19 Jun 2007 | Added \h and \v (and their counterclasses \H and \V) to match horizontal and vertical whitespace |
7.00 | 19 Dec 2006 | Added \R to match any Unicode newline sequence |
7.00 | 19 Dec 2006 | Added named group synonyms (?<foo>…) and (?'foo'…) |
7.00 | 19 Dec 2006 | Added named subroutine call synonym (?&foo) |
7.00 | 19 Dec 2006 | Added named back-reference synonyms \k<foo> and \k'foo' |
7.00 | 19 Dec 2006 | Added named conditional synonyms (?(<foo>)…), (?('foo')…) and (?(foo)…) |
7.00 | 19 Dec 2006 | Added |
7.00 | 19 Dec 2006 | Added conditional syntax to check if a subroutine or recursion level has been reached: (?(R2)…), (?(R&foo)…) and (?(R)…) |
7.00 | 19 Dec 2006 | Added \g2 and \g{-2} for relative back-references |
6.70 | 4 Jul 2006 | Added named groups in conditionals: (?(foo)…) |
6.50 | 1 Feb 2006 | Added support for Unicode script names via \p{Arabic} |
6.00 | 7 Jun 2005 | Added pcre_dfa_exec() to the API |
6.00 | 7 Jun 2005 | Added pcre_refcount() to the API |
6.00 | 7 Jun 2005 | Added pcre_compile2() to the API |
5.00 | 13 Sep 2004 | Added support for Unicode categories such as \p{L} and negated Unicode categories such as \P{Nd} |
5.00 | 13 Sep 2004 | Added \X Unicode grapheme token |
4.00 | 17 Feb 2003 | Added [:blank:] to match ASCII space character and tab |
4.00 | 17 Feb 2003 | Added \Q…\E escape sequence |
4.00 | 17 Feb 2003 | Added possessive quantifiers: ?+, *+, ++ and {…,…}+ |
4.00 | 17 Feb 2003 | Added \C to match a single byte, even in UTF-8 mode |
4.00 | 17 Feb 2003 | Added the \G continuation anchor |
4.00 | 17 Feb 2003 | Added callouts (?C), (?C2) etc. which can be used in C but not PHP |
4.00 | 17 Feb 2003 | Added , and subroutine calls (?P>foo) |
3.30 | 1 Aug 2000 | Added pcre_free_substring() and pcre_free_substring_list() to the API |
3.00 | 1 Feb 2000 | Added recursion (?R) |
3.00 | 1 Feb 2000 | Added POSIX classes such as [:alpha:] |
3.00 | 1 Feb 2000 | Added pcre_fullinfo() to the API |
2.00 | 24 Sep 1998 | Atomic groups (?>) can now be quantified |
2.00 | 24 Sep 1998 | Added positive lookbehind (?<=…) |
2.00 | 24 Sep 1998 | Added negative lookbehind (? |
2.00 | 24 Sep 1998 | Added non-capturing groups with inline modifiers (?imsx-imsx:) |
2.00 | 24 Sep 1998 | Added unsetting of inline modifiers: (?-imsx) |
2.00 | 24 Sep 1998 | Added conditional pattern matching (?(cond)re|re) |
1.08 | 27 Mar 1998 | Add PCRE_UNGREEDY to invert the greediness of quantifiers |
1.08 | 27 Mar 1998 | Added the to turn on ungreedy mode |
1.08 | 27 Mar 1998 | Added the to turn on extras mode |
0.99 | 27 Oct 1997 | Added atomic groups (?>…) |
0.96 | 16 Oct 1997 | Added DOTALL mode, including inline modifier (?s) |
0.93 | 15 Sep 1997 | Added pcre_study() to the API |
0.92 | 11 Sep 1997 | Added multiline mode via inline modifier (?m) and PCRE_MULTILINE |
0.92 | 11 Sep 1997 | Added pcre_info() to the API (removed in 8.30) |
if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {}
The first thing to notice is that the =~ operator (which stands for matches) does the heavy lifting performed by a match function in other languages.
So the regular expression is not an argument in a function—it is specified directly on the right side of the =~ operator, between the / delimiters.
How compact!
Forget the (?{print "$&\n";}) fragment for a moment.
The regex pattern itself is no more than \w+(*F): match some word characters, then fail to match the (*F) token (the forced-failure token, which never matches), causing the engine to backtrack and gradually give up word characters while looking for another way to match.
The magic is that each time the engine passes the \w+, before failing, it reads a capsule that contains a small piece of injected Perl code:
(?{print "$&\n";})
The code itself is inside the braces: a single print statement print "$&\n"; that outputs the current match (it helps to know that $& is a special variable that contains the match, just as $1 contains the content of capture group 1).
As a result, the program prints the list of temporary matches at each point where the engine finishes matching \w+, corresponding to a full path exploration:
abc
ab
a
bc
b
c
And if that doesn't make you in awe of Perl regular expressions… Maybe nothing will.
Please note that via the (?C…) callout syntax, PCRE aims to provide similar functionality to Perl's "code capsules".
(?<=\d+\w+)
—extremely convenient when you need to check context.
If you are writing code, the only other engine to offer this feature is Matthew Barnett's experimental regex module for Python.
Jan Goyvaerts' proprietary JGSoft flavor (EditPad, RegexBuddy, PowerGrep) also support infinite-width lookbehind, but only Jan is allowed to write code with it.
Capture groups that can be quantified.
This means that if you write (\w+\s)+
, the engine will return not just one Group 1 capture, but a whole array of them.
This has terrific value for parsing strings with an arbitrary number of tokens.
Character class subtraction.
This allows you to write [a-z0-9-[mp3]]
, which means you shouldn't be listening to loud music while writing regex.
Err… sorry, I meant, this means you can match all lowercase English letters and digits except the characters m, p and 3.
Optional right-to-left matching.
I'll soon add a trick to demonstrate a situation where this could be handy.
Bear in mind that in other languages, a workaround would be to reverse the string before matching, then to reverse the results.
(?n) modifier (also accessible via the ExplicitCapture option).
This turns all (parentheses) into non-capture groups.
To capture, you must use named groups.
(\d+):(?1)
.
This is a feature I miss.
By extension, neither can you write something like (?(DEFINE)(?<digits>\d+))(?&digits):(?&digits)
.NET regex does not have the \K
"keep out" feature.
However, in Perl and PCRE, \K
is only a convenience to (partially) make up for the lack of infinite-width lookbehinds.
No possessive quantifiers as in \w++
.
I know, this is only a shorthand notation for the atomic group (?>\w+)… But it is much tidier.
No branch reset.
.NET does now allow (?|(cats) and dogs|(pigs) and whistles)
, where Group 1 can be defined in multiple places in the string.
However, .NET lets you achieve the same by recycling a named groups: (?<pets>cats) and dogs|(?<pets>pigs) and whistles
using System.Text.RegularExpressions;
2.
Use Verbatim Strings
Regex patterns are full of backslashes.
In a normal string, you have to escape them, which prevents you from pasting patterns straight from a regex tool.
To get around this problem, use C#'s verbatim strings, whose characters lose any special significance for the compiler.
To make a verbatim string, precede your string with an @ character, like so:
string myPattern = @"Score: \w+: \d+";
Verbatim strings can span multiple lines.
This is useful for your regex subjects as well as regex patterns that use free-spacing mode.
For instance:
string mySubject = @"Arizona, AZ 100
California, CA 122
South Dakota, SD 33
";
string myPattern = @"(?xm) # free-spacing mode
^([\w\s]+),\s # State
([A-Z]{2}\s) # State abbreviation
(\d+) # Value of a dollar, in cents
";
3.
Watch Out for \w
and \d
By default, .NET RegularExpressions classes assume that your string is encoded in UTF-8.
The regex tokens \w
, \d
and \s
behave accordingly, matching any utf-8 codepoint that is a valid word character, digit or whitespace character in any language.
This means that by default,
\d+
will match 123
\w+
will match abcddられま
\s+
will match all kinds of strange whitespace characters you've never dreamed of.
If all you wanted was English digits for \d
, English letters, English digits and underscore for \w
and whitespace characters you can understand for \s
, then you need to set the ECMAScript option.
Here's how to do it:
var r2 = new Regex(@"\d+", RegexOptions.ECMAScript);
(\d+:?)+
, the regex engine doesn't create multiple capture groups for you.
Instead, the capture group returns the string that was captured last.
For instance, if we used the above regex on the string 111:22:33, the engine would match the whole string, and capture Group 1 would be reported as 33.
Well, with .NET regex, all of that changes.
If you just ask, C# will still report that Group 1 is 33.
But if you dig deeper into Group 1, C# will also return a collection of captures with all the values that Group 1 captured in succession because of the +
quantifier.
This feature is a game changer, because it lets you easily parse strings with an unknown number of tokens.
For instance, consider a file with a series of word translations for a number of languages, like so:
Italian:one=uno,two=due German:one=ein,two=zwei,three=drei,four=vier Japanese:one=ichi,two=ni,three=san
For each language, you would like to parse the English word (e.g.
"two") and its translation (e.g.
"zwei") into variables.
If you had the same number of definitions for each language, you could accomplish this with a fixed number of capture groups.
But, as you can see, Italian has two definitions, German has four, Japanese has three.
For normal regex, the task is complex because you cannot create capture groups on the fly.
With .NET, you have a simple solution.
Consider a regex that matches each language at a time.
It could look like this:
\w+:(?:(\w+)=(\w+),?)+
The \w+:
corresponds to the language (e.g.
Italian:).
Inside of the non-capturing parentheses, we define a dictionary pair, capturing the English word to Group 1 and the translation to Group 2.
The ,?
is just an optional comma (there is no comma after each language's last entry).
So far, this is all normal.
The odd thing here is the +
quantifier that repeats the expression for a dictionary pair.
What happens to the capture Groups?
In normal regex, if we had just matched the Italian entries, Group 1 and Group 2 would correspond to the last dictionary pair captured for that match, i.e.
two for Group 1 and due for Group 2.
In .NET, Group 1 and Group 2 are objects.
Their Value property is the same as in other regex flavors, i.e.
the the last dictionary pair captured for the current match.
The twist is that each Group has a member called Captures, which is an object that contains all the captures that were made for that group during the match.
Therefore for the first match (the Italian entries), Group 1's Captures member will contain two captures, whose values are "one" and "two".
The code below uses this example and shows you exactly how to implement the feature.
Before you examine the code, have a look at the output, which explains how the groups work.
or leave the site to view an online demo
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string ourString = @"Italian:one=uno,two=due Japanese:one=ichi,two=ni,three=san";
string ourPattern = @"\w+:(?:(\w+)=(\w+),?)+";
var ourRegex = new Regex(ourPattern);
MatchCollection AllMatches = ourRegex.Matches(ourString);
Console.WriteLine("**** Understanding .NET Capture Groups with Quantifiers ****");
Console.WriteLine("\nOur string today is:" + ourString);
Console.WriteLine("Our regex today is:" + ourPattern);
Console.WriteLine("Number of Matches: " + AllMatches.Count);
Console.WriteLine("\n*** Let's Iterate Through the Matches ***");
int matchNum = 1;
foreach (Match SomeMatch in AllMatches) {
Console.WriteLine("Match number: " + matchNum++);
Console.WriteLine("Overall Match: " + SomeMatch.Value);
Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count);
Console.WriteLine("Why three Groups, not two? Because the overall match is Group 0");
// Another way of printing the overall match
Console.WriteLine("Group 0: " + SomeMatch.Groups[0].Value);
Console.WriteLine("Since Groups 1 and 2 have quantifiers, the value of Group 1 and Group 2 is the last capture of each group");
Console.WriteLine("Group 1: " + SomeMatch.Groups[1].Value);
Console.WriteLine("Group 2: " + SomeMatch.Groups[2].Value);
// For this match, let's look all the Group 1 captures manually
int g1capCount = SomeMatch.Groups[1].Captures.Count;
Console.WriteLine("Number of Group 1 Captures: " + g1capCount);
Console.WriteLine("Group 1 Capture 0: " + SomeMatch.Groups[1].Captures[0].Value);
Console.WriteLine("Group 1 Capture 1: " + SomeMatch.Groups[1].Captures[1].Value);
// To be safe, we'll check if we have a third capture for Group 1
// Because the first overall match only has two captures
if(g1capCount>2) Console.WriteLine("Group 1 Capture 2: " + SomeMatch.Groups[1].Captures[2].Value);
// Let's look at Group 2 captures automatically
int g2capCount = SomeMatch.Groups[2].Captures.Count;
Console.WriteLine("Number of Group 2 Captures: " + g2capCount);
int i2 = 0;
foreach (Capture g2capture in SomeMatch.Groups[2].Captures ) {
Console.WriteLine("Group 2 Capture " + i2 + ": " + g2capture.Value);
i2++;
} // end iterate G2 captures
Console.WriteLine("\n");
} // end iterate matches
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
\b\w+\b
, this is the string you would feed to a preg function: '~\b\w+\b~'
For the delimiter, you can choose any character apart from spaces and backslashes.
But choose wisely, because if your delimiter appears in the pattern, it needs to be escaped.
The forward slash is a popular delimiter, and strangely so since it needs to be escaped in all sorts of strings having to do with file paths.
For instance, to match http://, do you really want your regex string to look like '/http:\/\//'
?
Doesn't '~http://~'
look better?
Rare characters such as "~", "%", "#" or "@" are more sensible and fairly popular choices.
I don't like the "#" because it clashes with the # you use in comment mode.
Esthetically, my favorite is the tilde ("~") because it meets three criteria.
First, it is discrete, which allows the actual regex to stand out.
Many delimiters look like they belong to the expression, and that is confusing.
Second, tildes rarely occurs in my patterns, so I almost never have to escape them.
Third, it is my favorite, which allows me to introduce some circular logic in this paragraph.
~blob\d+~i
Inline at the start of the pattern: ~(?i)blob\d+~
I tend to prefer inline modifier syntax, first because it jumps out at you when you start reading the regex, second because it is more portable across other regex flavors, and third because you can turn it off further down the string (for instance, (?-i)
turns off the case-insensitive modifier).
The modifiers page explains all the flags and shows how to set them.
It also presents PCRE's Special Start-of-Pattern Modifiers, which include little-known modifiers such as (*LIMIT_MATCH=x).
Whatever you do, never use the cursed U flag or the (?U) modifier because they will draw a gang of raptorexes to your cubicle—not a good look! The u flag and (?u) modifier, on the other hand, are fine—they make the engine treat the input as a utf-8 string.
$hits = preg_match_all($regex,$airports,$matches,PREG_PATTERN_ORDER);
The output is below.
Element 0 contains an array with the whole matches; element 1 contains an array with the Group 1 matches; element 2 contains an array with the Group 2 matches; and so on.
This order (whole match, Group 1, Group 2, Group 3) can be said to be "the order of the regex pattern".
The flag for this presentation is PREG_PATTERN_ORDER (think of it as "the order of the regex pattern").
This is actually the function's default behavior, so you can freely drop the PREG_PATTERN_ORDER flag when you call the function.
Array
(
[0] => Array // The Whole Matches
(
[0] => San Francisco (SFO) USA
[1] => Sydney (SYD) Australia
[2] => Auckland (AKL) New Zealand
)
[1] => Array // The Group 1 Matches
(
[0] => San Francisco
[1] => Sydney
[2] => Auckland
)
[2] => Array // The Group 2 Matches
(
[0] => SFO
[1] => SYD
[2] => AKL
)
[3] => Array // The Group 3 Matches
(
[0] => USA
[1] => Australia
[2] => New Zealand
)
)
Second Presentation: ordered by SET (one set for each match)
Again, $hits contains the number of matches found (including 0 if none are found).
$hits = preg_match_all($regex,$airports,$matches,PREG_SET_ORDER);
The output is below.
Note that the outer array is organized "one SET for each match at a time".
Element 0 contains an array with the first match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…) Element 1 contains an array with the second match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…)
Sometimes, this structure is exactly what you want.
The flag for this presentation is PREG_SET_ORDER (think of it as "ordered by set").
Array
(
[0] => Array // The First Match
(
[0] => San Francisco (SFO) USA
[1] => San Francisco
[2] => SFO
[3] => USA
)
[1] => Array // The Second Match
(
[0] => Sydney (SYD) Australia
[1] => Sydney
[2] => SYD
[3] => Australia
)
[2] => Array // The Third Match
(
[0] => Auckland (AKL) New Zealand
[1] => Auckland
[2] => AKL
[3] => New Zealand
)
)
To remember the flags, try to understand them as "in the order of the regex pattern" (PREG_PATTERN_ORDER), or "ordered by set" (PREG_SET_ORDER)
$string=str_replace('10','20','$string');
The preg_replace function comes in when you need a regex pattern to match the string to be replaced, for instance if you only wanted to replace '10' when it stands alone but not when it is part "101" or "File10".
By default, the function replaces all of the matches in the original string, so make sure this is what you want.
If you want to replace only 1 or 5 instances, specify this limit as a fourth argument.
Here is an example.
$subject='Give me 12 eggs then 12 more.';
$pattern='~\d+~';
$newstring = preg_replace($pattern, "6", $subject);
echo $newstring;
The Output:
Give me 6 eggs then 6 more.
This code replaces the two instances of "12" with "6".
If you wanted to only replace the first instance, you would set the limit (1) as a fourth argument:
$newstring = preg_replace($pattern, "6", $subject,1);
This would output "Give me 6 eggs then 12 more."
If you want to know how many replacements are made, add a variable as a fifth parameter.
This forces you to set the fourth parameter (the limit number of replacements).
To set no limit, use -1.
For instance, with
$newstring = preg_replace($pattern, "6", $subject,-1, $count);
The value of $count would be 2.
Using Captured Groups in the Replacement
In the replacement string, you can refer to capture groups.
Group 1 is \1 or $1, Group 2 is \2 and $2, and so on.
This means that the replacement string "\2###\1" will replace the matched text with the content of Group 2 followed by three hashes and the content of Group 1.
This technique is often used when you want to rearrange the sequence of a string.
You might match a whole big string full of unwanted fluff, capture the portions you are interested in, and rearrange them how you like.
Note that as it makes one replacement after another, the regex engine keeps working on the original string—rather than switching to the latest version of the string.
For instance, using the string abcde, let's use the regex (?<=a)\w
, which matches one word character preceded by an a:
$string = preg_replace('~(?<=a)\w~','a','abcde');
This produces aacde: only the "b" was replaced, because in the original string it is the only character that is preceded by an "a".
If, on the other hand, the regex engine switched to the latest version of the string after making each substitution, when it came to "c", that character would also be preceded by an "a", and we would end with aaaaa.
Replacing an Invisible Delimiter
This is a trick that regex lovers are sure to enjoy.
It is closely related to the technique of Splitting with an Invisible Delimiter, so I explain it in that section.
\b(\w+)(\w)\b
This pattern simply matches each word separately (thanks to the \b word boundaries).
As it does so, it captures all of a word's letters except its last into Group 1, and it captures the final letter into Group 2.
(For this task, we're assuming that each word has at least two letters, so we're okay.)
Here's the basic way of doing the replacement.
$string = ("cool kids capitalize final letters");
$regex = "~\b(\w+)(\w)\b~";
$newstring = preg_replace_callback($regex,"LastToUpper",$string);
function LastToUpper($m) {
return $m[1].strtoupper($m[2]);
}
echo $newstring;
The Output: cooL kidS capitalizE finaL letterS
In the example above, you can see how preg_replace_callback specifies the name of the function that produces the replacement strings: "LastToUpper".
The function LastToUpper is then defined.
We know that preg_replace_callback sends one parameter to the substitution function, so we specify it and call it—arbitrarily—$m.
This $m that preg_replace_callback sends to the substitution function is the current match array, in the same form as the match array of .
This means that $m[0] is the overall match, while $m[1] is Group 1, $m[2] is Group 2, and so on.
This makes it easy for LastToUpper to return the word with the last letter capitalized: it is Group 1 (the initial letters) concatenated with the uppercase version of Group 2 (the last letter).
Here we did something simple, but you can appreciate how easy it would be to infuse our substitution with more logic.
Suppose, for instance, that we want to capitalize the last letter of each word, but that when that letter is an "s", we want to substitute a "Z".
Easy done: we just burn that logic into the callback function.
function LastToUpper($m) {
$last = $m[2]=="s" ? "Z" : strtoupper($m[2]);
return $m[1].$last;
}
The Output: cooL kidZ capitalizE finaL letterZ
Lighter Version: Use an Anonymous Function
Usually, we have no use for the substitution function except for the particular regex we're working on.
The second method is the same, except that instead of passing a function name in the second argument, we define the function "inline" in the call to preg_replace_callback.
$string = ("cool kids capitalize final letters");
$regex = "~\b(\w+)(\w)\b~";
$newstring = preg_replace_callback($regex,
function($m) {return $m[1].strtoupper($m[2]);}
,$string);
echo $newstring;
Same Output: cooL kidS capitalizE finaL letterS
As you can see, our callback function has no name: it's an anonymous function, so we don't pollute the name space.
With this, you're equipped to make some powerful substitutions.
(\d+)
was in parentheses: we include the whole delimiter into the array.
But we don't have to keep the entire delimiter.
Imagine for instance that your delimiter is of the form @@ABC123, where ABC are three capital letters and 123 are three digits.
If you want to fan "ABC" and "123" into the array but lose the "@@", you would do this:
$str = "token1@@ABC123token2@@DEF456token3";
$regex = "~@@([A-Z]{3})(\d{3})~";
print_r(preg_split($regex,$str,-1,PREG_SPLIT_DELIM_CAPTURE));
The Output:
Array: [0]=>token1, [1]=>ABC, [2]=>123, [3]=>token2, [4]=>DEF,
[5]=>456, [6]=> token3
Splitting with an Invisible Delimiter
Here is a lovely feature of splitting string with regex.
The preg_split function allows you to split a string with an invisible delimiter.
For instance, consider a movie title written in camel case (perhaps because it was in a file name): TheDayMyVoiceBroke.
You're interested in retrieving each word.
But what's the delimiter?
There is an "invisible" delimiter: any space where the next character is a capital letter.
This can be expressed as a simple lookahead: (?=[A-Z])
.
You could call that a "zero-width delimiter".
Let's see it at work:
$string = ("TheDayMyVoiceBroke");
$regex = "~(?=[A-Z])~";
$words = preg_split($regex,$string);
print_r($words);
The Output:
Array ( [0] => [1] => The [2] => Day [3] => My [4] => Voice [5] => Broke )
Magical!
But maybe we want to concatenate the words of the movie into a string, with spaces between the words? Before you reach for implode($words," ")
, consider that what we just did with preg_split, we can do with preg_replace.
Here is the code and the output.
Replacing an Invisible Delimiter
$string = ("TheDayMyVoiceBroke");
$regex = "~(?=[A-Z])~";
echo preg_replace($regex," ",$string);
The Output:
The Day My Voice Broke
(?i)marlon \Kbrando
will return "Brando".
Well, you could get "Brando" with a capture group or a lookbehind, so what's the big deal?
The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed.
On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.
For instance, let's say you want to match "Brando xx" in "Marlon Brando xx" (where xx are digits) but only if the string sits somewhere between a <tag> and a </tag>.
You can't look behind for the start of the tag because you don't know how many characters are before "Marlon Brando", and variable-length lookbehinds are forbidden in PCRE.
One option is to match everything and capture "Brando xx" in a Group.
Option 2 is to use \K, saving us the overhead of a capture group:
(?i)<tag>(?:(?!</tag).)*marlon \Kbrando \d+
(*SKIP)(*FAIL)
syntax available in Perl and PHP.
Just search throughout the article.
Here is the article's Table of Contents
Here is the explanation for the code
Here is the PHP code
re.split("(?=-)", "a-beautiful-day")
returns ['a-beautiful-day'].
To split on zero-width matches in Python, we need to use the regex module in V1 mode.
For instance, regex.split("(?V1)(?=-)", "a-beautiful-day")
will return ['a', '-beautiful', '-day']—which is what we want.
Java regex is an interesting beast.
On the one hand, it has a number of "premium" features, such as:
Lookbehind that allows a variable width within a specified range
Methods that return the starting and ending point of a match in a string.
Support for \R to match any kind of line break, including CRLF pairs.
Support for the \G anchor (which asserts that the current position is the beginning of the string or the position immediately following the previous match)
Support for the \Q … \E (block escape)
Possessive quantifiers.
The (?d) modifier (also accessible via the UNIX_LINES option).
When this is on, the line feed character \n is the only one that affects the dot .
(which doesn't match line breaks unless DOTALL is on) and the anchors ^ and $ (which match line beginnings and endings in multiline mode.)
On the other hand, Java regex has several unpleasant aspects, such as:
Absense of important other premium features found in .NET, Perl or PCRE—such as \K, (*SKIP)(*F), subroutines and recursion.
The absence of raw strings, forcing us to double escape backslashes in regex patterns
A buggy lookbehind which has a number of undocumented effects.