rexegg

Must Watch!



MustWatch


A Short History of Regex

In the late 1960s Ken Thompson of Bell Labs wrote them into the editor QED, and in the 1970s they made it into Unix programs and utilities such as grep, sed and AWK. These tools made text-matching much easier than the alternative—writing custom parsing programs for each task. Naturally, some saw the potential for even more powerful text-matching patterns. In the 1980s, programmers could not resist the urge to expand the existing regular expression syntax to make its patterns even more useful—most notably Henry Spencer with his regex library, then Larry Wall with the Perl language, which used then expanded Spencer's library. The engines that process this expanded regular expression syntax were no longer DFAs—they are called Non-Deterministic Finite Automatons (NFAs). At that stage, regex patterns could no longer said to be regular in the mathematical sense. This is why a small minority of people today (most of whom have email addresses ending with .edu) will maintain that what we call regex are not regular expressions. For the rest of us… Regex and regular expressions? Same-same. Perl had a huge influence on the flavors of regular expressions used in most modern engines today. This is why modern regular expressions are often called Perl-style. The differences in features across regex engines are considerable, so in my view speaking of Perl-style regular expressions only makes sense when one wants to make it clear one is not talking about the ivory tower brand of mathematically-correct expressions. But if you really want to avoid ambiguity, just say regex, as that is one word that white-coat computer scientists are not claiming. Subject: Reiterating that regular = DFA = NFA Could you please fix this? I see you've replied, but the main text still needs to be corrected. I actually noticed this before reading the comment. Before: The engines that process this expanded regular expression syntax were no longer DFAs—they are called Non-Deterministic Finite Automatons (NFAs). After (the wording may need a bit of additional reworking): The engines that process this expanded regular expression syntax were no longer regular. They are certainly recursively enumerable, most likely also context-sensitive, and if so, most likely also context-free. I am slightly confused by whether or not context-free languages are also context-sensitive, given the wording. I'm pretty sure they are considered to be, though. Reply to Solomon Ucko Hi Solomon, I don't have the brain cycles right now to carefully consider the adequate wording, but your comment is up. Subject: Please check your theory DFAs and NFAs have equivalent expressive power. In other words, for every NFA, there is an equivalent DFA, and visa-versa. So it is incorrect to say that these "regex patterns could no longer said to be regular in the mathematical sense. " They in fact do define regular languages. Some "regex" engines, however do go beyond regular languages to define "context-free" languages. These cannot be represented by an NFA (or DFA) and require a push-down automata (PDA). Reply to Thomas McLeod Thomas, That's sort of right but sort of muddled. Deterministic Finite Automata, Nondeterministic Finite Automata, and regular expressions all generate/recognize exactly the set of regular languages. However, the "regular expressions" in programming languages might better be described as "extended regular expressions". The language of all strings with some number of As followed by a B followed by the same number of As is not a regular language (pumping lemma). However, it is accepted by the following regex: (A*)B\1 "Automata" is plural. The singular is "automaton".

Quick-Start: Regex Cheat Sheet

Characters

CharacterLegendExampleSample Match
\dMost engines: one digit from 0 to 9file_\d\dfile_25
\d.NET, Python 3: one Unicode digit in any scriptfile_\d\dfile_9
\wMost engines: "word character": ASCII letter, digit or underscore\w-\w\w\wA-b_1
\w.Python 3: "word character": Unicode letter, ideogram, digit, or underscore\w-\w\w\w字-ま_
\w.NET: "word character": Unicode letter, ideogram, digit, or connector\w-\w\w\w字-ま
\sMost engines: "whitespace character": space, tab, newline, carriage return, vertical taba\sb\sca b c
\s.NET, Python 3, JavaScript: "whitespace character": any Unicode separatora\sb\sca b c
\DOne character that is not a digit as defined by your engine's \d\D\D\DABC
\WOne character that is not a word character as defined by your engine's \w\W\W\W\W\W*-+=)
\SOne character that is not a whitespace character as defined by your engine's \s\S\S\S\SYoyo

Quantifiers

QuantifierLegendExampleSample Match
+One or moreVersion \w-\w+Version A-b1_1
{3}Exactly three times\D{3}ABC
{2,4}Two to four times\d{2,4}156
{3,}Three or more times\w{3,}regex_tutorial
*Zero or more timesA*B*C*AAACC
?Once or noneplurals?plural

More Characters

CharacterLegendExampleSample Match
.Any character except line breaka.cabc
.Any character except line break.*whatever, man.
\.A period (special character: needs to be escaped by a \)a\.ca.c
\Escapes a special character\.\*\+\?\$\^\/\\.*+?$^/\
\Escapes a special character\[\{\(\)\}\][{()}]

Logic

LogicLegendExampleSample Match
| Alternation / OR operand22|3333
( … )Capturing groupA(nt|pple)Apple (captures "pple")
\1Contents of Group 1r(\w)g\1xregex
\2Contents of Group 2(\d\d)\+(\d\d)=\2\+\112+65=65+12
(?: … )Non-capturing groupA(?:nt|pple)Apple

More White-Space

CharacterLegendExampleSample Match
\tTabT\t\w{2}Tab
\rCarriage return charactersee below
\nLine feed charactersee below
\r\nLine separator on WindowsAB\r\nCDABCD
\NPerl, PCRE (C, PHP, R…): one character that is not a line break\N+ABC
\hPerl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator
\HOne character that is not a horizontal whitespace
\v.NET, JavaScript, Python, Ruby: vertical tab
\vPerl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator
\VPerl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace
\RPerl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)

More Quantifiers

QuantifierLegendExampleSample Match
+The + (one or more) is "greedy"\d+12345
?Makes quantifiers "lazy"\d+?1 in 12345
*The * (zero or more) is "greedy"A*AAA
?Makes quantifiers "lazy"A*?empty in AAA
{2,4}Two to four times, "greedy"\w{2,4}abcd
?Makes quantifiers "lazy"\w{2,4}?ab in abcd

Character Classes

CharacterLegendExampleSample Match
[ … ]One of the characters in the brackets[AEIOU]One uppercase vowel
[ … ]One of the characters in the bracketsT[ao]pTap or Top
-Range indicator[a-z]One lowercase letter
[x-y]One of the characters in the range from x to y[A-Z]+GREAT
[ … ]One of the characters in the brackets[AB1-5w-z]One of either: A,B,1,2,3,4,5,w,x,y,z
[x-y]One of the characters in the range from x to y[-~]+Characters in the printable section of the ASCII table.
[^x]One character that is not x[^a-z]{3}A1!
[^x-y]One of the characters not in the range from x to y[^-~]+Characters that are not in the printable section of the ASCII table.
[\d\D]One character that is a digit or a non-digit[\d\D]+Any characters, including new lines, which the regular dot doesn't match
[\x41]Matches the character at hexadecimal position 41 in the ASCII table, i.e. A[\x41-\x45]{3}ABE

Anchors and Boundaries

AnchorLegendExampleSample Match
^Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means "not")^abc .*abc (line start)
$End of string or end of line depending on multiline mode. Many engine-dependent subtleties..*? the end$this is the end
\ABeginning of string (all major engines except JS)\Aabc[\d\D]*abc (string......start)
\zVery end of the string Not available in Python and JSthe end\zthis is...\n...the end
\ZEnd of string or (except Python) before final line break Not available in JSthe end\Zthis is...\n...the end\n
\GBeginning of String or End of Previous Match .NET, Java, PCRE (C, PHP, R…), Perl, Ruby
\bWord boundary Most engines: position where one side only is an ASCII letter, digit or underscoreBob.*\bcat\bBob ate the cat
\bWord boundary .NET, Java, Python 3, Ruby: position where one side only is a Unicode letter, digit or underscoreBob.*\b\кошка\bBob ate the кошка
\BNot a word boundaryc.*\Bcat\B.*copycats

POSIX Classes

CharacterLegendExampleSample Match
[:alpha:]PCRE (C, PHP, R…): ASCII letters A-Z and a-z[8[:alpha:]]+WellDone88
[:alpha:]Ruby 2: Unicode letter or ideogram[[:alpha:]\d]+кошка99
[:alnum:]PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z[[:alnum:]]{10}ABCDE12345
[:alnum:]Ruby 2: Unicode digit, letter or ideogram[[:alnum:]]{10}кошка90210
[:punct:]PCRE (C, PHP, R…): ASCII punctuation mark[[:punct:]]+?!.,:;
[:punct:]Ruby: Unicode punctuation mark[[:punct:]]+,:

Inline Modifiers

None of these are supported in JavaScript. In Ruby, beware of (?s) and (?m).
ModifierLegendExampleSample Match
(?i) Case-insensitive mode (except JavaScript)(?i)MondaymonDAY
(?s)DOTALL mode (except JS and Ruby). The dot (.) matches new line characters (\r\n). Also known as "single-line mode" because the dot treats the entire input as a single line(?s)From A.*to ZFrom A to Z
(?m)Multiline mode(except Ruby and JS) ^ and $ match at the beginning and end of every line(?m)1\r\n^2$\r\n^3$1 2 3
(?m)In Ruby: the same as (?s) in other engines, i.e. DOTALL mode, i.e. dot matches line breaks(?m)From A.*to ZFrom A to Z
(?x)Free-Spacing Mode mode (except JavaScript). Also known as comment mode or whitespace mode(?x) # this is a # comment abc # write on multiple # lines [ ]d # spaces must be # in bracketsabc d
(?n).NET, PCRE 10.30+: named capture onlyTurns all (parentheses) into non-capture groups. To capture, use named groups.
(?d)Java: Unix linebreaks onlyThe dot and the ^ and $ anchors are only affected by \n
(?^)PCRE 10.32+: unset modifiersUnsets ismnx modifiers

Lookarounds

LookaroundLegendExampleSample Match
(?=…)Positive lookahead(?=\d{10})\d{5}01234 in 0123456789
(?<=…)Positive lookbehind(?<=\d)catcat in 1cat
(?!…)Negative lookahead(?!theatre)the\w+theme
(?<!…)Negative lookbehind\w{3}(?<!mon)sterMunster

Character Class Operations

Class OperationLegendExampleSample Match
[…-[…]].NET: character class subtraction. One character that is in those on the left, but not in the subtracted class.[a-z-[aeiou]]Any lowercase consonant
[…-[…]].NET: character class subtraction.[\p{IsArabic}-[\D]]An Arabic character that is not a non-digit, i.e., an Arabic digit
[…&&[…]]Java, Ruby 2+: character class intersection. One character that is both in those on the left and in the && class.[\S&&[\D]]An non-whitespace character that is a non-digit.
[…&&[…]]Java, Ruby 2+: character class intersection.[\S&&[\D]&&[^a-zA-Z]]An non-whitespace character that a non-digit and not a letter.
[…&&[^…]]Java, Ruby 2+: character class subtraction is obtained by intersecting a class with a negated class[a-z&&[^aeiou]]An English lowercase letter that is not a vowel.
[…&&[^…]]Java, Ruby 2+: character class subtraction[\p{InArabic}&&[^\p{L}\p{N}]]An Arabic character that is not a letter or a number

Other Syntax

SyntaxLegendExampleSample Match
\K Keep Out Perl, PCRE (C, PHP, R…), Python's alternate engine, Ruby 2+: drop everything that was matched so far from the overall match to be returnedprefix\K\d+12
\Q…\EPerl, PCRE (C, PHP, R…), Java: treat anything between the delimiters as a literal string. Useful to escape metacharacters.\Q(C++ ?)\E(C++ ?)
Changing the File Extension The extension in the replacement pattern below is "rar". Edit it to suit your needs. Search pattern: ^(.*\.)[^.]+$ Replace: \1rar Translation: At the beginning of the file name, greedily match any characters then a dot, capturing this to Group 1. The greedy dot-star will shoot to the end of the file name, then backtrack to the last dot. This capture is the stem of the file name. After this capture, match any character that is a non-dot: the extension. Replace all of this with the content of Group 1, expressed as \1 or $1 (this is the captured stem and includes the dot) and "rar". Removing a Character from the File Name You could do this with a simple search-and-replace, but, to get accustomed to regex, here is a convoluted way to do it. The aim is to zap all dashes. Search pattern: ^([^-]*)-(.*)# Replace: \1\2 Translation: Match and capture any non-dash characters to Group 1, match a dash, then eat up any characters, capturing those in Group 2. Replace the file name with the first group immediately followed by the second group (the dash is gone). The hash character on the first line (#) tells the DOpus engine to perform that substitution until the string stops changing. That way, all dashes are zapped one by one. Replacing Dots with Spaces in File Names This pattern is for times when you have 95 files that look something like this: Search pattern: ([^.]+)\.(.*?)\.([^.]+)$# Replace: \1 \2.\3 Translation: The pattern is a bit long, so let's unroll it. ([^.]+)# Greedily eat anything that is not a dot and capture that substring in group 1 \.# Match a dot (.*?)# Lazily eat up anything and capture that substring in group 2 \.# Match a dot (this is the dot before the file extension) ([^.]+)$# Greedily eat up anything that is not a dot, until the end of the filename, capturing that in group 3 (this is the extension) The final hash character (#) tells Opus to repeat the replace operation until there are no dots left to eat and the filename has been cleaned up. The replace string takes the captured groups and inserts a space in place of each dot. Swapping Two Parts of a Filename Suppose you have named a lot of movie files according to this pattern: 8.2 Groundhog Day (1993).avi The number at the front is the movie's IMDB rating. The number between parentheses at the end is the movie's release year. One day, you decide that instead of sorting movies by their rating, you really want to sort them by year, which means that in the file name, you'd like to swap the position of the rating and year. You want your files to look like this: (1993) Groundhog Day 8.2.avi Without regular expressions, you are in trouble. This is actually a fairly common scenario. It could happen for any collection of files you have tagged, such as music tracks or topo maps. Here is a regular expression that works in this case. Search pattern: ^(\d\.\d)([^(]*)(\([\d]{4}\))(.*) Replace: \3\2\1\4 Let's unroll the regex: ^(\d\.\d)# At the beginning, in Group 1, capture a digit, a dot and a digit. That's the IMDB rating. ([^(]*)# In the second group, greedily capture anything that is not an opening parenthesis. (\([\d]{4}\))# In the third group, capture an opening parenthesis (which needs to be escaped in the regex), four digits, and a closing parenthesis. (.*)# In the last group, capture anything. The replace pattern simply takes the four groups and changes their order.

Regex Examples for Text File Search

What good are text editors if you can't perform complex searches? I checked these sample expressions in EditPad Pro, but they would probably work in Notepad++ or a regex-friendly IDE. Seven-Letter Word Containing "hay" Some examples may seem contrived, but having a small library of ready-made regex at your fingertips is fabulous. Search pattern: (?=\b\w{7}\b)\w*?hay\w* Translation: Look right ahead for a seven-letter word (the \b boundaries are important). Lazily eat up any word characters followed by "hay", then eat up any word characters. We know that the greedy match has to stop because the word is seven characters long. Here, in our word, we allow any characters that regex calls "word characters", which, besides letters, also include digits and underscores. If we want a more conservative pattern, we just need to change the lookup: Traditional word (only letters): (?i)(?=\b[A-Z]{7}\b)\w*?hay\w* In this pattern, in the lookup, you can see that I replaced \w{7} with [A-Z]{7}, which matches seven capital letters. To include lowercase letters, we could have used [A-Za-z]{7}. Instead, we used the case insensitive modifier (?i). Thanks to this modifier, the pattern can match "HAY" or "hAy" just as easily as "hay". It all depends on what you want: regex puts the power is in your hands. Line Contains both "bubble" and "gum" Search pattern: ^(?=.*?\bbubble\b).*?\bgum\b.* Translation: While anchored a the beginning of the line, look ahead for any characters followed by the word bubble. We could use a second lookahead to look for gum, but it is faster to just match the whole line, taking care to match gum on the way. Line Contains "boy" or "buy" Search pattern: \bb[ou]y\b Translation: Inside a word (inside two \b boundaries), match the character b, then one character that is either o or u, then y. Find Repeated Words, such as "the the" This is a popular example in the regex literature. I don't know about you you, but it doesn't happen all that often often that mistakenly repeated words find their way way into my text. If this example is so popular, it's probably because it's a short pattern that does a great job of showcasing the power of regex. You can find a million ways to write your repeated word pattern. In this one, I used POSIX classes (available in Perl and PHP), allowing us to throw in optional punctuation between the words, in addition to optional space. Search pattern: \b([[:alpha:]]+)[ [:punct:]]+\1 Translation: After a word delimiter, in group one, capture a positive number of letters, then eat up space characters or punctuation marks, then match the same word we captured earlier in group one. If you don't want the punctuation, just use an \s+ in place of [ [:punct:]]+. Remember that \s eats up any white-space characters, including newlines, tabs and vertical tabs, so if this is not what you want use [ ]+ to specify space characters. The brackets are optional, but they make the space character easier to spot, especially in a variable-width font. Line does Not Contain "boy" Search pattern: ^(?!.*boy).* Translation: At the beginning of the line, if the negative lookahead can assert that what follows is not "any characters then boy", match anything on the line. Line Contains "bubble" but Neither "gum" Nor "bath" Search pattern: ^(?!.*gum)(?!.*bath).*?bubble.* Translation: At the beginning of the line, assert that what follows is not "any characters then gum", assert that what follows is not "any characters then bath", then match the whole string, making sure to pick up bubble on the way. Email Address If I ever have to look for an email address in my text editor, frankly, I just search for @. That shows me both well-formed addresses, as well as addresses whose authors let their creativity run loose, for instance by typing DOT in place of the period. When it comes to validating user input, you want an expression that checks for well-formed addresses. There are thousands of email address regexes out there. In the end, none can really tell you whether an address is valid until you send a message and the recipient replies. The regex below is borrowed from chapter 4 of Jan Goyvaert's excellent book, . I'm in tune with Jan's reasoning that what you really want is an expression that works with 999 addresses out of a thousand, an expression that doesn't require a lot of maintenance, for instance by forcing you to add new top-level domains ("dot something") every time the powers in charge of those things decide it's time to launch names ending in something like dot-phone or dot-dog. Search pattern: (?i)\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b Let's unroll this one: (?i)# Turn on case-insensitive mode \b# Position engine at a word boundary [A-Z0-9._%+-]+# Match one or more of the characters between brackets: letters, numbers, dot, underscore, percent, plus, minus. Yes, some of these are rare in an email address. @# Match @ (?:[A-Z0-9-]+\.)+# Match one or more strings followed by a dot, such strings being made of letters, numbers and hyphens. These are the domains and sub-domains, such as post. and microsoft. in post.microsoft.com [A-Z]{2,6}# Match two to six letters, for instance US, COM, INFO. This is meant to be the top-level domain. Yes, this also matches DOG. You have to decide if you want achieve razor precision, at the cost of needing to maintain your regex when new TLDs are introduced. \b# Match a word boundary

Regex Examples for Web Server Directives (Apache)

If you are running Apache, chances are you have regular expressions somewhere in your .htaccess file or in your httpd.conf configuration file. Like PHP, Apache uses PCRE-flavor regular expressions. Here are a few examples. Redirecting to a New Directory Sometimes, you decide to change your directory structure. Visitors who follow an old link will request the old urls. Here is how a regex in htaccess can help. RewriteRule old_dir/(.*)$ new_dir/$1 [L,R=301] Explanation: The old url is captured in Group 1, and appended at the end of the new path. Targeting Certain Browsers BrowserMatch \bMSIE no-gzip This directive checks if the user's browser name contains "MSIE" (with a word boundary before the "M"). If so, Apache applies what follows on the line. (In this case, no-gzip tells Apache not to compress content.) Targeting Certain Files <FilesMatch "\.html?$"> Header set Cache-Control "max-age=43200" </FilesMatch> The first line of this htaccess directive for file caching has a small regex matching files ending with a dot and "htm" or "html". Other Regular Expressions in Apache RewriteCond %{HTTP_USER_AGENT} ^Zeus.*?Webster Purpose: In a rewrite rule, tests for certain user agents. RewriteCond %{HTTP_REFERER} ^http://www\.google\.com$ Purpose: In a rewrite rule, tests for a specific referrer. RewriteCond %{REMOTE_ADDR} 192\.168\.\d\d.* Purpose: In a rewrite rule, tests for an IP range. RewriteCond %{TIME_HOUR} ^13$ Purpose: In a rewrite rule, check if the hour is 1pm. There are other uses of regex in Apache. These examples should give you a taste. For background information, you may want to look at the manual page for mod_rewrite, the mod_rewrite page, the rewrite guide and the advanced rewrite guide. Is Apache using the same PCRE version as PHP? Not necessarily. To see which version of PCRE PHP uses, look at the result of phpinfo() and search for PCRE. In addition to the version number, you will find a reference to a directory: something like pcre-regex=/opt/pcre. Another way to find that folder is to run ldd /some/path/php | grep pcre in the shell, where "some/path" is the path returned by "which php". You can use that directory in a shell command line to get more information on your PCRE version: /opt/pcre/bin/pcretest -C On cPanel, EasyApache installs PCRE in the /opt folder, so if PHP reports the folder above, you can expect that mod_rewrite and PHP are using the same version of PCRE (unless there is a bug in cPanel). On other installs, you may want to find all the installed versions of pcretest to see which versions are installed: find / -name pcretest

Regex Examples to locate Records in a Database (MySQL)

To illustrate the basic use of regex in MySQL, here's an example that selects records whose YourField field ends with "ty". SELECT * FROM YourDatabase WHERE YourField REGEXP 'ty$'; Here's a second example that select fields that do not contain a digit: SELECT * FROM YourDatabase WHERE YourField NOT REGEXP "[[:digit:]]"; You can use RLIKE in place of REGEXP, as the two are synonyms. But why would you? Regular expressions in MySQL are aimed to comply with POSIX 1003.2, also known as Harry Spencer's "regex 7" library. The MySQL documentation page for REGEXP states that it is incomplete, but that the full details are included in MySQL source distributions, in the regex.7 file… Okay, that's a drag, but let's download the source. Oh, no, you can't, you need a special installer just to download the source. Nevermind, here is a copy of the regex(7) manual page. If you build queries in a programming language before sending them to MySQL, you have to pay particular attention to escaping contentious characters in the regex string. Your language probably has a function that will get the string ready to be passed to the database. If you are used to Perl-like regular expressions, MySQL's POSIX flavor will sound like baby talk. If you need more power, you may consider the PCRE library for MySQL. Since I upgrade my MySQL server for major releases, the risk of forward incompatibility is a bit high and I stick with regex(7). Subject: what about the contexts for sed and egrep? I'm an old unix guy and lived with the early regular expressions for a very long time. I still script with sed. I spent a number of hours today reading and evaluating many of the web-, linux- and windows-based tools to assist in testing and creating regular expressions. What I find interesting is that I can't find one of these tools that allows one to restrict the engine to the sed (or egrep) contexts. This would be extremely helpful. I was actually surprised at seeing all the "other flavors" leaving the stalwarts no where to be found. Why is this? I would like to see these supported because it is best to use the most efficient method and one can't get much more efficient than sed. Regards oldunixguy Reply to rich painter You're quite right Rich, The tools I use don't have an egrep or sed mode. regexbuddy has does have a Perl mode, and there's a lot you can do on 'nix with perl one-liners (there's a page on the site with examples, in case you haven't seen it yet.)

The Elements of Good Regex Style

Knowing the English alphabet doesn't make you Hemingway. Likewise, knowing regex syntax doesn't make you literate with regex. Style is a hard thing to teach aspiring writers, but the rudiments of style can certainly be taught. Likewise, "good regex" is not easy to put into rules, and perhaps that's why I've never seen literature on the subject. This makes it all the more interesting to give it a try. With the disclaimer that I do not present myself as the ultimate authority on efficient, graceful regex, this page presents some "rules" I have gradually distilled with practice. There is always more to learn, so new rules may appear and old ones may be rephrased or subsumed by others. Before diving into the Elements of Regex Style, let's warm up with some considerations about matching strategy.

Should I Match it, or Should I Capture it?

Matches are not sacred. Feel free to throw them away! One topic I've never seen discussed (but maybe I'm not reading widely and carefully enough) is the various strategies available to you in order to retrieve the data you need. Should you try to match it? Should you try to capture it? It's an implicit question in nearly every complex piece of regex you write. In the examples on this site, you'll see a diverse use of matching and capturing. Sometimes, the match returns exactly the data we want. But often, the match is a "throwaway" that just gets us down the string, down to the portions we really care about, which we then capture. Sometimes, then, the captured groups contain the data we're after. But at other times, you only use groups (and back references) to build an intricate expression that adds up to the overall match you're looking for. If this sounds confusing, it doesn't need to be. There is only one thing you need to tell yourself:
The Match is Just Another Capture Group
Basically, you can imagine that there is a set of parentheses around your entire regex. These parentheses are just implied. They capture Group 0, which by convention we call "the match". In fact, your language may already think that way. In PHP, if $match is the match array, $match[1] will contain Group 1, $match[2] will contain Group 2… and $match[0] ("Group 0") will contain the overall match. Likewise, in JavaScript the overall match will be in matchArray[0]. In Python and C#, you can (although those are not the only options) retrieve the overall match as match.group(0) and matchResult.Groups[0].Value. Likewise, in regex replace operations, \1, \2, \3 (or $1, $2, $3, depending on the flavor) usually refer to capture groups 1, 2 and 3. Not by coincidence, \0 ($0) usually refer to the overall match. Once you see that the match is just another group, the question of whether to match or to capture loses importance: You will be capturing anyhow. What does this mean? You are the one who knows what data you want to match. Knowing this, use whatever means you need (whether it's an overall match or a sneaky capture in Group 3) in order to grab what you want. In the example about keeping the regex in sync with your string, we'll look at a technique that makes many captures—some useful, some not—and then leaves it to the code outside the regex to decide which of the capture groups are important. It's not a particularly efficient technique, but it works. The only moderation I would add to the advice to "use whatever means you need" is that it's generally considered poor programming practice to spawn unnecessary capture groups, as they create overhead. So if you need parentheses in order to evaluate an expression but don't need to capture the data, make it a non-capturing group by using the (?: … ) syntax.

Should I Split, or should I Match All?

Here is a regex axiom that may come as a surprise:
Matching All and Splitting are two sides of the same coin.
Consider a list of fruits separated the word "and": apple and banana and orange and pear and cherry. You are interested in obtaining an array with all the names of fruits. To do so, you could match all the words that are not and (something like \b(?!and)\S+ would do). Another approach would be to split the string using the delimiter " and ". Both approaches would provide you with the same array: it's a bit like one of those drawings that can be interpreted in different ways depending on whether you focus on the white background or on the inked parts. When you to want to match, I'll split you... When you want to split, I'll match you... This is a simple example, but often you will gain considerable advantage by deciding to match rather than to split, or vice-versa. You'll often find that one way is easy and the other nearly impossible. Therefore, if someone tells you "I want to match all the…" or "I am trying to split by…", try not to rush down the first alley because they said "split" or "match": remember the other side of the coin.

The Elements of Regex Style

In the world of regex, it's appropriate to paraphrase Strunk & White:
To write good regex, say what you mean. Say it clearly.
The more specific your expressions, the faster your regex will match; and, often more importantly, the faster your regex will fail when no match is there to be found. Here are a few "golden rules" that every regex craftsperson should keep in mind. If some of these rules don't make complete sense to you right now, don't worry about it—just come here again after you've read some other sections, or in a couple months' time. Whenever Possible, Anchor. Anchors, such as the caret ^ for the beginning of a line and the dollar sign $ for the end of a line often provide the needed clue that ensures the engine finds a match in the right place. For instance, when we validate a string, they ensure that the engine matches the whole string, rather than a substring embedded in the string being examined. And anchors often save the engine a lot of backtracking. Be aware that anchors are not limited to ^ and $. Most engines have other useful built-in anchors, such as \A and \G (see the cheat sheet). When You Know what You Want, Say It. When You Know what You Don't Want, Say It Too! When you feed your regex engine a lot of .* "dot-star soup", the engine can waste a lot of energy running down the string then backtracking. Be as specific as possible, whether by using a literal B character, a \d digit class or a \b boundary. Another great way to be specific is to say what you don't want—whether what you don't want is… a double quote: [^"]… a digit: \D… or for the next three letters to be "boo": (?!boo)[a-z]{3}. Contrast is Beautiful—Use It. When you can, use consecutive tokens that are mutually exclusive in order to create contrast. This reduces backtracking and the need for boundaries in the broad sense of the term, in which I include lookarounds. For instance, let's say you're trying to validate strings that contain exactly three digits located at the end, as in ABC123. Something like ^.+\d{3}$ would not work, because . and \d are not mutually exclusive—this regex would match ABC123456. You may think to add a negative lookbehind: ^.+(?<!\d)\d{3}$. But if you use tokens that are mutually exclusive in the first place, you no longer need a lookaround: ^\D+\d{3}$ works straight out of the box. With time, you come to relish the beautiful contrast between \D and \d, between [^a-z] and [a-z]. This is a variation on When you know what you want, say it. Want to Be Lazy? Think Twice. Let's say you want to match all the characters between a set of curly braces. At first you might think of {.*?} because the lazy quantifier ensures you don't overshoot the closing brace. However, a lazy quantifier has a cost: at each step inside the braces, the engine tries the lazy option first (match no character), then tries to match the next token (the closing brace), then has to backtrack. Therefore, the lazy quantifier causes backtracking at each step (see Lazy Quantifiers Are Expensive). This is more efficient: {[^}]*}. This is a variation on Use Contrast and When you know what you want, say it. A Time for Greed, a Time for Laziness. A reluctant (lazy) quantifier can make you feel safe in the knowing that you won't eat more characters than needed and overshoot your match, but since lazy quantifiers cause backtracking at each step, using them can feel like bumping on a country road when you could be rolling down the highway. Likewise, a greedy quantifier may shoot down the string then backtrack all the way back when all you needed was a few nudges with a lazy quantifier. On the Edges: Really Need Boundaries or Delimiters? Use Them—or Make Your Own! Most regex engines provide the \b boundary, and sometimes others, which can be useful to inspect an edge of a substring. Depending on the engine, other boundaries may be available, but why stop there? In the right context, I believe in DIY boundaries. For instance, using lookarounds, you can make a boundary to check for changes from upper- to lower-case, which can be useful to split a CamelCase string: (?<=[a-z])(?=[A-Z]) However, do not overuse boundaries, because good contrast often make them redundant (see Use Contrast.) Don't Give Up what You Can Possess. Atomic groups (?> … ) and the closely-related possessive quantifiers can save you a lot of backtracking. Structured data often gives you chances to incorporate those in your expressions. Don't Match what Splits Easily, and Don't Split what Matches Nicely. I explained this point in the section about splitting vs. matching. Design to Fail. As Shakespeare famously wrote, "Any fool can write a regex that matches what it's meant to find. It takes genius to write a regex that knows early that its mission will fail." Take (?=.*fleas).*. It does a reasonable job of matching lines that contain fleas. But what of lines that don't have fleas? At the very start of the string, the engine looks all the way down the line. The lookahead fails, the regex engine moves to the second position in the string, and once again looks for fleas all the way down the line. At each position in the string, the engine repeats the lookahead, so that the pattern takes a long time to fail… In comparison, consider ^(?=.*fleas).*. The only difference is the caret anchor. It doesn't look like a big deal, but once the engine fails to find fleas at the start of the string, it stops because the lookahead is anchored at the start. This pattern is designed for failure, and it is much more efficient—O(N) vs. O(N2) for the first. Trust the Dot-Star to Get You to the End of the Line With all the admonishments against the dot-star, here is one of many cases where it can be useful. In a string such as @ABC @DEF, suppose you wish to match the last token that starts with @, but only if there is more than one token. If you simply wanted the last, you could use an anchor: @[A-Z]+$… but that will match the token even if it is the only one in the string. You might think to use a lookahead: @[A-Z].*\K@[A-Z]+(?!.*@[A-Z]). However, there is no need because the greedy .* already guarantees you that you are getting the last token! The dot-star matches all the way to the end of the line then backtracks, but only as far as needed: You can therefore simplify this to @[A-Z].*\K@[A-Z]+ Trust the dot-star to take you to the end of the line!

Two Mnemonic Devices to Check your Regexps

Greedy atoms anchor again. Until you acquire a lot of practice, it's probably impossible to keep all these rules in mind at the same time. But remembering a few is better than remembering none, so if you're starting out, may I suggest a simple phrase to help remind yourself of tweaks that may improve the expression?
Greedy atoms anchor again.
"Greedy" reminds you to check if some greedy quantifiers should be made lazy, and vice-versa. It also reminds you of the performance hit of lazy quantifiers (backtracking at each step), and of potential workarounds. "Atoms" reminds you to check if some parts of the expression should be made atomic (or use a possessive quantifier). "Anchor" reminds you to check if the expression should be anchored. By extension, it may remind you of boundaries, and whether to add them—or remove them. "Again" reminds you to check if parts of the expression could use the repeating subpattern syntax. If you prefer short mnemonic devices, you may prefer the acronym AGRA, helpful to build the Taj Mahal of regular expressions, and named after the Indian city Agra, best known for the Taj Mahal: A for Anchor G for Greed R for Repeat A for Atomic next Everything you always wanted to know about the many pieces of regex syntax that start with the letters (?

Reducing (? … ) Syntax Confusion

What the (? … ) A question mark inside a parenthesis: So many uses! I thought I would bring them all together in one place. I don't know the fine details of the history of regular expressions. Stephen Kleene and Ken Thompson, who started them, obviously wanted something very compact. Maybe they were into hieroglyphs, maybe they were into cryptography, or maybe that was just the way you did things when you only had a few kilobytes or RAM. The heroes who expanded regular expressions (such as Henry Spencer and Larry Wall) followed in these footsteps. One of the things that make regexes hard to read for beginners is that many points of syntax that serve vastly different purposes all start with the same two characters: (? In the regex tutorials and books I have read, these various points of syntax are introduced in stages. But (?: … ) looks a lot like (?= … ), so that at some point they are bound to clash in the mind of the regex apprentice. To facilitate study, I have pulled all the (? … ) usages I know about into one place. I'll start by pointing out three confusing couples; details of usage will follow. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Confusing Couples

Confusing Couple #1: (?: … ) and (?= … ) These false twins have very different jobs. (?: … ) contains a non-capturing group, while (?= … ) is a lookahead. Confusing Couple #2: (?<= … ) and (?> … ) (?<= … ) is a lookbehind, so (?> … ) must be a lookahead, right? Not so. (?> … ) contains an atomic group. The actual lookahead marker is (?= … ). More about all these guys below. Confusing Couple #3: (?(1) … ) and (?1) This pair is delightfully confusing. The first is a conditional expression that tests whether Group 1 has been captured. The second is a subroutine call that matches the sub-pattern contained within the capturing parentheses of Group 1. Now that these three "big ones" are out of the way, let's drill into the syntax.

Lookarounds: (?<= … ) and (?= … ), (?<! … ) and (?! … )

Collectively, lookbehinds and lookaheads are known as lookarounds. This section gives you basic examples of the syntax, but further down the track I encourage you to read the dedicated regex lookaround page, as it covers subtleties that need to be grasped if you'd like lookaheads and lookbehinds to become your trusted friends. In the meantime, if there is one thing you should remember, it is this: a lookahead or a lookbehind does not "consume" any characters on the string. This means that after the lookahead or lookbehind's closing parenthesis, the regex engine is left standing on the very same spot in the string from which it started looking: it hasn't moved. From that position, then engine can start matching characters again, or, why not, look ahead (or behind) for something else—a useful technique, as we'll later see. Here is how the syntax works. Lookahead After the Match: \d+(?= dollars) Sample Match: 100 in 100 dollars Explanation: \d+ matches the digits 100, then the lookahead (?= dollars) asserts that at that position in the string, what immediately follows is the characters "dollars" Lookahead Before the Match: (?=\d+ dollars)\d+ Sample Match: 100 in 100 dollars Explanation: The lookahead (?=\d+ dollars) asserts that at the current position in the string, what follows is digits then the characters "dollars". If the assertion succeeds, the engine matches the digits with \d+. Note that this pattern achieves the same result as \d+(?= dollars) from above, but it is less efficient because \d+ is matched twice. A better use of looking ahead before matching characters is to validate multiple conditions in a password. Negative Lookahead After the Match: \d+(?!\d| dollars) Sample Match: 100 in 100 pesos Explanation: \d+ matches 100, then the negative lookahead (?!\d| dollars) asserts that at that position in the string, what immediately follows is neither a digit nor the characters "dollars" Negative Lookahead Before the Match: (?!\d+ dollars)\d+ Sample Match: 100 in 100 pesos Explanation: The negative lookahead (?!\d+ dollars) asserts that at the current position in the string, what follows is not digits then the characters "dollars". If the assertion succeeds, the engine matches the digits with \d+. Note that this pattern achieves the same result as \d+(?!\d| dollars) from above, but it is less efficient because \d+ is matched twice. A better use of looking ahead before matching characters is to validate multiple conditions in a password. Lookbehind Before the match: (?<=USD)\d{3} Sample Match: 100 in USD100 Explanation: The lookbehind (?<=USD) asserts that at the current position in the string, what precedes is the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}. Lookbehind After the match: \d{3}(?<=USD\d{3}) Sample Match: 100 in USD100 Explanation: \d{3} matches 100, then the lookbehind (?<=USD\d{3}) asserts that at that position in the string, what immediately precedes is the characters "USD" then three digits. Note that this pattern achieves the same result as (?<=USD)\d{3} from above, but it is less efficient because \d{3} is matched twice. Negative Lookbehind Before the Match: (?<!USD)\d{3} Sample Match: 100 in JPY100 Explanation: The negative lookbehind (?<!USD) asserts that at the current position in the string, what precedes is not the characters "USD". If the assertion succeeds, the engine matches three digits with \d{3}. Negative Lookbehind After the Match: \d{3}(?<!USD\d{3}) Explanation: \d{3} matches 100, then the negative lookbehind (?<!USD\d{3}) asserts that at that position in the string, what immediately precedes is not the characters "USD" then three digits. Note that this pattern achieves the same result as (?<!USD)\d{3} from above, but it is less efficient because \d{3} is matched twice. Support for Lookarounds All major engines have some form of support for lookarounds—with some important differences. For instance, JavaScript doesn't support lookbehind, though it supports lookahead (one of the many blotches on its regex scorecard). Ruby 1.8 suffered from the same condition. Lookbehind: Fixed-Width / Constrained Width / Infinite Width One important difference is whether lookbehind accepts variable-width patterns. At the moment, I am aware of only three engines that allow infinite repetition within a lookbehind—as in (?<=\s*): .NET, Matthew Barnett's outstanding regex module for Python, whose features far outstrip those of the standard re module, and the JGSoft engine used by Jan Goyvaerts' software such as EditPad Pro. I've also implemented an infinite lookbehind demo for PCRE. Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid. PCRE (C, PHP, R …), Java and Ruby 2+ allow lookbehinds to contain alternations that match strings of different but pre-determined lengths (such as (?<=cat|raccoon)) Perl and Python require a lookbehind to match strings of a fixed length, so (?<=cat|racoons) will not work. To master lookarounds, there is a bit more you should really know. For these finer details, visit the lookaround page.

Non-Capturing Groups: (?: … )

In regex as in the (2+3)*(5-2) of arithmetic, parentheses are often needed to group components of an expression together. For instance, the above operation yields 15. Without the parentheses, because the * operator has higher precedence than the + and -, 2+3*5-2 is interpreted as 2+(3*5)-2, yielding… er… 15 (a happy coincidence). In regex, normal parentheses not only group parts of a pattern, they also capture the sub-match to a capture group. This is often tremendously useful. At other times, you do not need the overhead. In .NET, this capturing behavior of parentheses can be overridden by the (?n) flag or the RegexOptions.ExplicitCapture option. But in all flavors, .NET included, it is far more common to use (?: … ), which is the syntax for a non-capturing group. Watch out, as the syntax closely resembles that for a lookahead (?= … ). For instance (?:Bob|Chloe) matches Bob or Chloe—but the name is not captured. Within a non-capturing group, you can still use capture groups. For instance, (?:Bob says: (\w+)) would match Bob says: Go and capture Go in Group 1. Likewise, you can capture the content of a non-capturing group by surrounding it with parentheses. For instance, ((?:Bob|Chloe)\d\d) would capture "Chloe44". Mode Modifiers within Non-Capture Groups On all engines that support inline modifiers such as (?i), except Python, you can blend the the non-capture group syntax with mode modifiers. Here are some examples: (?i:Bob|Chloe) This non-capturing group is case-insensitive. (?ism:^BEGIN.*?END) This non-capturing group matches everything between "begin" and "end" (case-insensitive), allowing such content to span multiple lines (the s modifier), starting at the beginning of any line (the m modifier allows the ^ anchor to match the beginning of any line). (?i-sm:^BEGIN.*?END) As above, but turns off the "s" and "m" modifiers See below for more on inline modifiers.

Atomic Groups: (?> … )

An atomic group is an expression that becomes solid as a block once the regex leaves the closing parenthesis. If the regex fails later down the string and needs to backtrack, a regular group containing a quantifier would give up characters one at a time, allowing the engine to try other matches. Likewise, if the group contained an alternation, the engine would try the next branch. An atomic group won't do that: it's all or nothing. Example 1: With Alternation (?>A|.B)C This will fail against ABC, whereas (?:A|.B)C would have succeeded. After matching the A in the atomic group, the engine tries to match the C but fails. Because it is atomic, it is unable to try the .B part of the alternation, which would also succeed, and allow the final token C to match. Example 2: With Quantifier (?>A+)[A-Z]C This will fail against AAC, whereas (?:A+)[A-Z]C would have succeeded. After matching the AA in the atomic group, the engine tries to match the [A-Z], succeeds by matching the C, then tries to match the token C but fails as the end of the string has been reached. Because the group is atomic, it is unable to give up the second A, which would allow the rest of the pattern to match. If, before the atomic group, there were other options to which the engine can backtrack (such as quantifiers or alternations), then the whole atomic group can be given up in one go. When are Atomic Groups Important? When a series of characters only makes sense as a block, using an atomic group can prevent needless backtracking. This is explored on the section on possessive quantifiers. In such situations atomic quantifiers can be useful, but not necessarily mission-critical. On the other hand, there are situations where atomic quantifiers can save your pattern from disaster. They are particularly useful: In order to avoid the with patterns that contain lazy quantifiers whose token can eat the delimiter To avoid certain forms of the Supported Engines, and Workaround Atomic groups are supported in most of the major engines: .NET, Perl, PCRE and Ruby. For engines that don't support atomic grouping syntax, such as Python and JavaScript, see the well-known pseudo-atomic group workaround. Alternate Syntax: Possessive Quantifier When an atomic group only contains a token with a quantifier, an alternate syntax (in engines that support it) is a possessive quantifier, where a + is added to the quantifier. For instance, (?>A+) is equivalent to A++ (?>A*) is equivalent to A*+ (?>A?) is equivalent to A?+ (?>A{…,…}) is equivalent to A{…,…}+ This works in Perl, PCRE, Java and Ruby 2+. For more, see the possessive quantifiers section of the quantifiers page. Non-Capturing Atomic groups are non-capturing, though as with other non-capturing groups, you can place the group inside another set of parentheses to capture the group's entire match; and you can place parentheses inside the atomic group to capture a section of the match. Watch out, as the atomic group syntax is confusingly similar to the .

Named Capture: (?<foo> … ), (?P<foo> … ) and (?P=foo)

When you cut and paste a piece of a pattern, Group 3 can suddenly become Group 1. That's a problem if you were using a back-reference \3 or replacement $3. One way around this problem is named capture groups. The syntax varies across engines (see Naming Groups—and referring back to them for the gory details). It's worth noting that named group also have a number that obeys the left-to-right numbering rules, and can be referenced by their number as well as their name. In short, the two capturing flavors are (?<foo> … ) and (?P<foo> … ). For instance, in the right engines, ^(?<intpart>\d+)\.(?<decpart>\d+)$ or ^(?P<intpart>\d+)\.(?P<decpart>\d+)$ would both match a string containing a decimal number such as 12.22, storing the integer portion to a group named intpart, and storing the decimal portion to a group named decpart. To create a back-reference to the intpart group in the pattern, depending on the engine, you'll use \k<intpart> or (?P=intpart). To insert the named group in a replacement string, depending on the engine, you'll either use ${intpart}, \g<intpart>, $+{intpart}or the group number \1. For the gory details, see Naming Groups—and referring back to them. To name, or not to name? I'll admit that I don't use named groups a whole lot, but some people love them. Sure, named captures are bulkier than a quick (capture) and reference to \1—but they can save hassles in expressions that contain many groups. Do they make your patterns easier to read? That's subjective. For my part, if the regex is short, I always prefer numbered groups. And if it is long, I would rather read a regex with numbered groups and good comments in free-spacing mode than a one-liner with named groups.

Inline Modifiers: (?isx-m)

All popular regex flavors apart from JavaScript support inline modifiers, which allow you to tell the engine, in a pattern, to change how to interpret the pattern. For instance, (?i) turns on case-insensitivity. Except in Python, (?-i) turns it off. If a modifier appears at the head of the pattern, it modifies the matching mode for the whole pattern—unless it is later turned off. But (except in Python) a modifier can appear in mid-pattern, in which case in only affects the portion of the pattern that follows. Modifiers can be combined: for instance, (?ix) turns on both case-insensitive and free-spacing mode. (?ix-s) does the same, but also turns off single-line (a.k.a DOTALL) mode. Summary of inline modifiers (?i) turns on case insensitive mode. Except in Ruby, (?s) activates "single-line mode", a.k.a. DOTALL modes, allowing the dot to match line break characters. In Ruby, the same function is served by (?m) Except in Ruby, (?m) activate "multi-line mode", which allows the dollar $ and caret ^ assertions to match at the beginning and end of lines. In Ruby, (?m) does what (?s) does in other flavors—it activates DOTALL mode. (?x) Turns on the free-spacing mode (a.k.a. whitespace mode or comment mode). This allows you to write your regex on multiple lines—like on the example on the home page—with comments preceded by a #. Warning: You will usually want to make sure that (?x) appears immediately after the quote character that starts the pattern string. For instance, if you try placing it on a newline because it would look better, the engine will try matching the newline characters before it activates free-spacing mode. In .NET, (?n) turns on "named capture only" mode, which means that regular parentheses are treated as non-capture groups. In Java, (?d) turns on "Unix lines mode" mode, which means that the dot and the anchors ^ and $ only care about line break characters when they are line feeds \n. Combining Non-Capture Group with Inline Modifiers As we saw in the section on non-capture groups, you can blend mode modifiers into the non-capture group syntax in all engines that support inline modifiers—except Python. For instance, (?i:bob) is a non-capturing group with the case insensitive flag turned on. It matches strings such as "bob" and "boB" But don't get carried away: you cannot blend inline modifiers with any random bit of regex syntax. For instance, the following are all illegal: (?i=bob), (?iP<name>bob) and (?i>bob) Using Inline Modifiers in the Middle of a Pattern Usually, you'll use your inline modifiers at the start of the regex string to set the mode for the entire pattern. However, changing modes in the middle of a pattern can be useful, so I'll give you two examples. (\b[A-Z]+\b)(?i).*?\b\1\b This ensures that an upper-case word is repeated somewhere in the string, in any letter-case. First we capture an upper-case word to Group 1 (for instance DOG), then we set case-insensitive mode, then .*? matches any characters up to the back-reference \1, which could be dog or dOg. As a neat variation, (\b[A-Z]+\b).*?\b(?=[a-z]+\b)(?i)\1\b ensures that the back-reference is in lower-case. ^(\w+)\b.*\r?\n(?s).*?\b\1\b This ensures that the first word of the string is repeated on a different line. First we capture a word to Group 1, then we get to the end of the line with .*, match a line break, then set DOTALL mode—allowing the .*? to match across lines, which brings us to our back-reference \1. Unsetting all modifiers: (?^) As of PCRE 10.32, (?^) unsets all ismnx modifiers.

Subroutines: (?1) and (?&foo)

As you well know by now, when you create a capture group such as (\d+), you can then create a back-reference to that group—for instance \1 for Group 1—to match the very characters that were captured by the group. For instance, (\w+) \1 matches Hey Hey. In Perl, PCRE (C, PHP, R …) and Ruby 1.9+, you can also repeat the actual pattern defined by a capture Group. In Perl and PCRE, the syntax to repeat the pattern of Group 1 is (?1) (in Ruby 2+, it is \g<1>) For instance, (\w+) (?1) will match Hey Ho. The parentheses in (\w+) not only capture Hey to Group 1—they also define Subroutine 1, whose pattern is \w+. Later, (?1) is a call to subroutine 1. The entire regex is therefore equivalent to (\w+) \w+ Subroutines can make long expressions much easier to look at and far less prone to copy-paste errors. Relative Subroutines Instead of referring to a subroutine by its number, you can refer to the relative position of its defining group, counting left or right from the current position in the pattern. For instance, (?-1) refers to the last defined subroutine, and (?+1) refers to the next defined subroutine. Therefore, (\w+) (?-1) and (?+1) (\w+) are both equivalent to our first example with numbered group 1. In Ruby 2+, for relative subroutine calls, you would use \g and \g. Named Subroutines Instead of using numbered groups, you can use named groups. In that case, in Perl and PHP the syntax for the subroutine call will be (?&group_name). In Ruby 2+ the syntax is \g<some_word>. For instance, (?<some_word>\w+) (?&some_word) is equivalent to our first example with numbered group 1. Pre-Defined Subroutines So far, when we defined our subroutines, we also matched something. For instance, (\w+) defines subroutine 1 but also immediately matches some word characters. It so happens that Perl, PCRE and Python's alternate Subroutines and Recursion If you place a subroutine such as (?1) within the very capture group to which it refers—Group 1 in this case—then you have a recursive expression. For instance, the regex ^(A(?1)?Z)$ contains a recursive sub-pattern, because the call (?1) to subroutine 1 is embedded in the parentheses that define Group 1. If you try to trace the matching path of this regex in your mind, you will see that it matches strings like AAAZZZ, strings which start with any number of letters A and end with letters Z that perfectly balance the As. After you open the parenthesis, the A matches an A… then the optional (?1)? opens another parenthesis and tries to match an A… and so on. We'll look at recursion syntax in the next section. There is also a page dedicated to recursion. Warning Note that the (?1) syntax looks confusingly similar to the ?(1) found in conditionals.

Recursive Expressions: (?R) … and old friends

A recursive pattern allows you to repeat an expression within itself any number of times. This is quite handy to match patterns where some tokens on the left must be balanced by some tokens on the right. Recursive calls are available in PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python. Recursion of the Entire Pattern: (?R) To repeat the entire pattern, the syntax in Perl and PCRE is (?R). In Ruby, it is \g<0>. For instance, A(?R)?Z matches strings or substrings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a number of letters Z at the end. The initial token A matches an A… Then the optional (?R)? tries to repeat the whole pattern right there, and therefore attempts the token A to match an A… and so on. Recursion of a Subroutine: (?1) and (?-1) You also have recursion when a subroutine calls itself. For instance, in ^(A(?1)?Z)$ subroutine 1 (defined by the outer parentheses) contains a call to itself. This regex matches entire strings such as AAAZZZ, where a number of letters A at the start are perfectly balanced by a number of letters Z at the end. As we saw in the section on subroutines, you can also call a subroutine by the relative position of its defining group at the current position in the pattern. Therefore, ^(A(?-1)?Z)$ performs exactly like the above regex. There is much more to be said about recursion. See the page dedicated to recursive regex patterns.

Conditionals: (?(A)B) and (?(A)B|C)

This section covers the basics on conditional syntax. For more, you'll want to explore the page dedicated to regex conditionals. In (?(A)B), condition A is evaluated. If it is true, the engine must match pattern B. In the full form (?(A)B|C), when condition A is not true, the engine must match pattern C. Conditionals therefore allow you to inject some if(…) then {…} else {…} logic into your patterns. Typically, condition A will be that a given capture group has been set. For instance, (?(1)}) says: If capture Group 1 has been set, match a closing curly brace. This would be useful in ^({)?\d+(?(1)})$ Likewise, (?(foo)…) checks if the capture group named foo has been set. This pattern matches a string of digits that may or may not be embedded in curly braces. The optional capture Group 1 ({)? captures an opening brace. Later, the conditional checks if capture 1 was set, and if so it matches the closing brace. Let's expand this example to use the "else" part of the syntax: ^(?:({)|")\d+(?(1)}|")$ This pattern matches strings of digits that are either embedded in double quotes or in curly braces. The non-capture group (?:({)|") matches the opening delimiter, capturing it to Group 1 if it is a curly brace. After matching the digits, (?(1)}|") checks whether Group 1 was set. If so, we match a closing curly brace. If not, we match a double quote. Lookaround in Conditions In (?(A)B), the condition you'll most frequently see is a check as to whether a capture group has been set. In .NET, PCRE and Perl (but not Python and Ruby), you can also use lookarounds: \b(?(?<=5D:)\d{5}|\d{10})\b If the prefix 5D: can be found, the pattern will match five digits. Otherwise, it will match ten digits. Needless to say, that is not the only way to perform this task. Checking if a relative capture group was set (?(1)A) checks whether Group 1 was set. In PCRE, instead of hard-coding the group number, we can also check whether a group at a relative position to the current position in the pattern has been set: for instance, (?(-1)A) checks whether the previous group has been set. Likewise, (?(+1)A) checks whether the next capture group has been set. (This last scenario would be found within a larger repeating group, so that on the second pass through the pattern, the next capture group may indeed have been set on the previous pass.) Checking if a recursion level was reached This is not the place to be talking in depth about recursion, which has a section below and a dedicated page, but for completion I should mention two other uses of conditionals, available in Perl and PCRE: (?(R)A) tests whether the regex engine is currently working within a recursion depth (reached from a recursive call to the whole pattern or a subroutine). (?(R1)A) tests whether the current recursion level has been reached by a recursive call to subroutine 1. See examples here. Availability of Regex Conditionals Conditionals are available in PCRE, Perl, .NET, Python, and Ruby 2+. In other engines, the work of a conditional can usually be handled by the careful use of lookarounds. Similar Syntax Note that the (?(1)B) syntax can look confusingly similar to (?1) which stands for a regex subroutine, where the regex pattern defined by Group 1 must be matched.

Pre-Defined Subroutines: (?(DEFINE)(<foo> … )(<bar> … )) and (?&foo)

Available in Perl, PCRE (and therefore C, PHP, R…) and Python's alternate regex engine, pre-defined subroutines allow you to produce regular expressions that are beautifully modular and start to feel like clean procedural code. Within a (?(DEFINE) … ) block, you can pre-define one or several named subroutines without matching any characters at that time. You can even pre-define subroutines based on other subroutines. When you get to the matching part of the regex, this allows you to match complex expressions with compact and readable syntax—and to match the same kind of expressions in multiple places without needing to repeat your regex code. This makes your regex more maintainable, both because it is easier to understand and because you don't need to fix a sub-pattern in multiple places. But an example is worth a thousand words, so let's dive in. If you like, you can play with the pattern and sample text in this online demo. A quick note first: in case you wonder what the \ are all about, they simply match one space character. The regex is in free-spacing mode—the x flag is implied but could be made part of the pattern using the (?x) modifier. In free-spacing mode, spaces that you do want to match must either be escaped as in \ or specified inside a character class as in [ ]. (?(DEFINE) # start DEFINE block # pre-define quant subroutine (?<quant>many|some|five) # pre-define adj subroutine (?<adj>blue|large|interesting) # pre-define object subroutine (?<object>cars|elephants|problems) # pre-define noun_phrase subroutine (?<noun_phrase>(?&quant)\ (?&adj)\ (?&object)) # pre-define verb subroutine (?<verb>borrow|solve|resemble) ) # end DEFINE block ##### The regex matching starts here ##### (?&noun_phrase)\ (?&verb)\ (?&noun_phrase) This regex would match phrases such as: five blue elephants solve many interesting problems many large problems resemble some interesting cars Note that the portion that does the matching is extremely compact and readable: (?&noun_phrase)\ (?&verb)\ (?&noun_phrase) The subroutine noun_phrase is called twice: there is no need to paste a large repeated regex sub-pattern, and if we decide to change the definition of noun_phrase, that immediately trickles to the two places where it is used. Note also that noun_phrase itself is built by assembling smaller blocks: its code (?&quant)\ (?&adj)\ (?&object) uses the quant, adj and object subroutines. With this kind of modularity, you can build regex cathedrals. There is a beautiful example on the page with the regex to match numbers in plain English. A Note on Group Numbering Please be mindful that each named subroutine consumes one capture group number, so if you use capture groups later in the regex, remember to count from left to right. The gory details are on the page about Capture Group Numbering & Naming.

Branch Reset: (?| … )

If you've read the page about Capture Group Numbering & Naming, you'll remember that capture groups get numbered from left to right. Therefore, if you have two sets of capturing parentheses, they have two group numbers. Sometimes, you might wish that these two sets of parentheses might capture to the same numbered group. Perl and PCRE (and therefore C, PHP, R…) have a feature that let you reuse a group number when capturing parentheses are present on different sides of an alternation. This is rather abstract, so let's take an example. Let's say you want to match a number, but only in three situations: If it follows an A, as in A00 If it precedes a B, as in 11B If it is sandwiched between C and D, as in C22D This poses no problem using lookahead and lookbehind, but the branch reset syntax (?| … ) gives you another—potentially more readable—option: (?|A(\d+)|(\d+)B|C(\d+)D) After the initial (?|, which introduces a branch reset, the group has a three-piece alternation (two |). Each of those contains a capture group (\d+). The number of all of those capture groups is the same: Group 1. You are not limited to one group. For instance, if you are also interested in capturing a potential suffix after the number (which can happen in the situations 11B and C55D), place another set of parentheses wherever you find a suffix: (?|A(\d+)|(\d+)(B)|C(\d+)(D)) Using this regex to match the string A00 11B C22D, you obtain these groups: Match Group 1: Number Group 2: Suffix ----- --------------- --------------- A00 00 (not set) 11B 11 B C22D 22 D How Useful is Branch Reset? When I first read about branch reset in the PCRE documentation a few years ago, I was excited and certain I'd use it often. Since then, I've written several thousand regular expression patterns, but I've used branch reset less than a handful of times. It's probably my fault for always jumping on other ways to do things first, but this leaves me with a sense that the feature is not all that useful after all. That being said, on rare occasions, it's just the most direct and elegant way of doing things. Let's look at one more example, less contrived than the first—which was pared down in order to explain the feature. A Branch Reset Example: Tokenization with Variable Formats To me, this is an example where branch reset seems to offer benefits over competing idioms. Suppose you want to parse strings such as song:"Sweet Home Alabama" fruit:apple color:blue motto:"Don't Worry" into pairs of keys and values. When the value following the colon is between quotes, you only want the inside of the quotes. Therefore, you expect something like: Group 1 Group 2 ------- ------- song Sweet Home Alabama fruit apple color blue motto Don't Worry This branch reset regex will get you there: (\S+):(?|([^"\s]+)|"([^"]+)) Group 1 (\S+) is a straight capture group that captures the key. In the branch reset, the two sets of capturing parentheses allow you to capture different kinds of values in different formats to the same group, i.e. Group 2. You can check the group captures in the right pane of this online regex demo. To me, this alternative with a conditional and a lookbehind… (\S+):"?((?(?<!")[^"\s]+|[^"]+)) …feels a little less satisfying. But hey, it works too.

Inline Comments: (?# … )

By now you must be familiar with the free-spacing mode, which makes it possible to unroll long regexes and comment them out, as in the many code boxes on this site. To turn on free-spacing for an entire pattern, the syntax varies: the (?x) modifier works in .NET, Perl, PCRE, Java, Python and Ruby. the x flag can be added after the pattern delimiter in Perl, PHP and Ruby. .NET lets you turn on the RegexOptions.IgnorePatternWhitespace option. Python lets you turn on re.VERBOSE What if you only want to insert a single comment without turning on free-spacing mode for the entire pattern? In Perl, PCRE (and therefore C, PHP, R…), Python and Ruby, you can write an inline comment with this syntax: (?# … ) For instance, in: (?# the year)\d{4} \d{4} matches four digits, while (?# the year) tells you what we are trying to match. How useful is this? Not very. I almost never use this feature: when I want comments, I just turn on free-spacing mode for the whole regex.

Code Capsule: (?{…})

Perl regex has a magical feature: the ability to insert fragments of code to be executed in the middle of a pattern evaluation. The syntax (?{…}) creates a code capsule. When the engine encounter it, Perl executes the statements within the curly braces {…}. A delightful example is given on the section about what makes Perl special. Here is a more basic one. Consider the pattern (?:[a-z])+, which simply matches a sequence of lower-case letters. In the non-capture group, we'll inject a code capsule that prints the temporary match, which Perl represents with the $& variable. The code is {print "Temp match: '$&'\n";}, and the capsule is (?{print "Temp match: '$&'\n";}). This allows you to see the match being built, as shown in the output below. if ('abcd' =~ /(?:[a-z](?{print "Temp match: '$&'\n";}))+/ ) {} # Output: # Temp match: 'a' # Temp match: 'ab' # Temp match: 'abc' # Temp match: 'abcd'

PCRE Callouts: (?C…)

The (?C…) token enables PCRE to provide a similar feature to Perl's code capsules. When the token is encountered, if a callout function has been specified when invoking the match function, the engine temporarily suspends the matching and passes control to the callout function. For details, please see my page about PCRE callouts.

Version Check: (?(VERSION>=x) … )

This is a feature I am proud to have suggested to Philip Hazel, the father of PCRE. In PCRE2 (versions of PCRE 10.0 and beyond), you can check what version of PCRE you are using. This is useful because PCRE is often embedded in environments such as Apache, PHP or text editors such as Notepad++, where you may not know which version of PCRE is being used, and therefore which features are available. To check whether the current version is over version 10, use something such as YES as your subject, and try to match it with this pattern: (?(VERSION>=10)YES) If it matches, the version is 10 or later. As another example, you could use LATER // EARLIER as your subject, and match it with this: (?(VERSION>=10.5)LATER|EARLIER) Depending on your version, PCRE2 will either match LATER or EARLIER.

Ignore ALL Unescaped Whitespace: (?xx)

As of PCRE 10.30 (and some version of Perl I haven't checked), this flag tells the engine to ignore all unescaped whitespace in the pattern, including inside character classes. It is therefore like (?x) on steroids.

Disable capture groups: (?n)

In .NET and as of PCRE 10.30 (and some version of Perl I haven't checked), this flag tells the engine to treat all groups as non-capture group, so that (this) becomes equivalent to (?:this) Subject: Redundant \d In a paragraph "Negative Lookahead After the Match": I believe, that the second "\d" in a regex \d+(?! \d| dollars) is just unnecessary, as the part "\d+" will eat all the digits in a row, as quantifiers are greedy by default. So it is technically impossible, that "\d+" is followed by yet another digit. Am I right? Reply to blixen Hi blixen, The \d in the negative lookahead does serve a purpose: with what you suggest, i.e. \d+(?! dollars) we would match "100" in "1001 dollars" Regards, Rex Subject: Unbelievable The most interesting tutorial on subject of the WWW!! "If, before the atomic group, there were other options to which the engine can backtrack (such as quantifiers or alternations), then the whole atomic group can be given up in one go. " What does this line mean? Will it backtrack past the atomic group and try it again fresh when it advances to it again? Reply to Anthony time, it give it up in one go (one block). Then of course if it resumes its forward motion and reaches the group again, it tries it again. Subject: essence of the (? I found this page while trying to hone in the "essence" of the (? In regex. While I realize that the subsets that all share this mark are widely varied is it safe to say they all share the distinction of being a non-capturing group? Thanks in advance for your reply and… Keep up the good work! Troy D. Reply to Troy Dalmasso Hi Troy, I sympathize with your desire to distill, but IMO the direction you're going to try to summarize (?...) will not be a useful conceptual construct to you in the long run. For instance (?i) turns on case-insensitivity. In Perl regex, (?{print "$&\n";}) is a capsule that executes a bit of Perl code. And what about (?(1)(?!)) which means fail if Group 1 is captured? (?...) is just a nail on which a lot of unrelated regex syntax hangs. If you make peace with that fact, I think your experience will be smoother. Kindest regards, Rex Subject: Removing Confusion Around (? Regex Syntax This topic is very well written and much appreciated. Distills large works like Friedl's book into an easily digestible quarter of an hour. I look forward to reading the rest! Subject: RE: Your banner regex Thanks Rex, you really made me laugh!! I see you always have the same excellent sense of humor as in your (brilliant) articles & tutorials! Thank you for this great site and for the joke :) (and for the new regex) Greetings from (the south of) France! Xavier Tello Reply to xtello Hi Xavier, Thank you for your very kind encouragements! If only everyone could be like you. When the technology becomes available, would you mind if I get back in touch in order to clone you? Wishing you a fun weekend, Rex Subject: Your banner regex I looked at the regex displayed in your banner… Applying this regex to the string [spoiler] will produce [spoiler] (if I'm not wrong!). What's this easter egg? ;-) Reply to xtello Hi Xavier, Thank you for writing, it was a treat to hear from you. Wow, you are the first person to notice! In fact, you made me change the banner to satisfy your sense of completion (and make it harder for the next guy). > What's this easter egg? This Easter Egg (pun intended, I presume) is that you are the grand winner of a secret contest. From the time I launched the site, I had planned that the first person to discover this would win a free trip to the South of France. You won!!! :) :) :) Wishing you a beautiful day, Rex Subject: Little question about capture Hi Andy. Thank you for all these articles, they are amazing! I learn a lot with this website. So glad to found it! Like they said : Best ressource on internet :) I tried some of your example, and I'm stuck with one of them: (? :(\()|-)\d{6}(? (1)\)). When I'm trying "(111111)" with "preg_match_all", it captures"(". Do you think it's possible to bypass this capture? When I use "-222222", it catches an empty string… And I dont unserstand why. Could you please explain this? Thank you Andy! And again: Nice work! Reply to Nicolas Hi Nicolas, Run this: $regex='~(?:(\()|-)\d{6}(?(1)\))~'; $string='(such as "(444444)"), or it is preceded by a minus sign (such as "-333333").'; preg_match_all($regex,$string,$m); var_dump( $m ); You will see that the MATCHES are (444444) and -333333 The CAPTURES are "(" and "". The captured left par is what makes the ?(1) work later in the regex. Let me know if this is still unclear. I enjoyed reading this article and learnt a lot. Thanks for your wonderful work. :) Subject: Brilliant Best resource I've found yet on regular expressions. Much appreciate the work you put into this. Why not create an eBook that could be downloaded—I for one would willingly cough up a few dollars. Regards Vin Reply to Vin Hi Vin, Thank you very much for your encouragements, and also for your suggestion. I've been itching to make a print-on-demand book with the lowest price possible, to make it easy to read offline. Will probably do that as soon as they extend the length of a day to 49 hours. Wishing you a fun weekend, Andy

Regex Boundaries and Delimiters—Standard and Advanced

Although this page starts with the regex word boundary \b, it aims to go far beyond: it will also introduce less-known boundaries, as well as explain how to make your own—DIY Boundaries. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Boundaries vs. Anchors

Why are ^ and $ called anchors while \b is called a boundary? These tokens have one thing in common: they are assertions about the engine's current position in the string. Therefore, none of them consume characters. Anchors assert that the current position in the string matches a certain position: the beginning, the end, or in the case of \G the position immediately following the last match. In contrast, boundaries make assertions about what can be matched to the left and right of the current position. The distinction is blurry. Typically, you would translate ^ as something like "assert that the current position is the beginning of the string". But if you were in a mood to play with logic, you could say: Imagine that a string is a space between two walls—one to the left and one to the right. All the positions in the string are within that space. Then we could translate the ^ anchor as:
Assert that immediately to the left of the current position, we can find the left wall, while to the right of the current position we cannot find the left wall.
Yep, in that light, our anchor is a boundary—we look left and right. We'll keep anchors and boundaries on separate pages because there's a lot of ground to cover, but just keep that in mind.

Word Boundary: \b

The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character). The regex \bcat\b would therefore match cat in a black cat, but it wouldn't match it in catatonic, tomcat or certificate. Removing one of the boundaries, \bcat would match cat in catfish, and cat\b would match cat in tomcat, but not vice-versa. Both, of course, would match cat on its own. Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters. Be aware, though, that \bcat\b will not match cat in _cat or in cat25 because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters. If you want to create a "real word boundary" (where a word is only allowed to have letters), see the recipe below in the section on DYI boundaries. Difference between Engines As you can see on the regex cheat sheet, \b behaves differently depending on the engine: In PCRE (PHP, R…) with the Unicode mode turned off, JavaScript and Python 2.7, it matches where only one side is an ASCII letter, digit or underscore. In PCRE (PHP, R…) with the Unicode mode turned on, .NET, Java, Perl, Python 3 and Ruby, it matches a position where only one side is a Unicode letter, digit or underscore.

Not-a-word-boundary: \B

\B matches all positions where \b doesn't match. Therefore, it matches: When neither side is a word character, for instance at any position in the string $=(@-%++) (including the beginning and end of the string) When both sides are a word character, for instance between the H and the i in Hi! This may not seem very useful, but sometimes \B is just what you want. For instance, \Bcat\B will find cat fully surrounded by word characters, as in certificate, but neither on its own nor at the beginning or end of words. cat\B will find cat both in certificate and catfish, but neither in tomcat nor on its own. \Bcat will find cat both in certificate and tomcat, but neither in catfish nor on its own. \Bcat|cat\B will find cat in embedded situation, e.g. in certificate, catfish or tomcat, but not on its own. Difference between Engines In all engines that support it, \B matches positions that are not matched by \b. Since \b behaves differently in various engines, see \b engine variations a few paragraphs above.

Left- and Right-of-Word Boundaries

The PCRE (PHP, R, …) version 8.34+ and MySQL engines support the POSIX character classes for the beginning-of-word boundary [[:<:> and the end-of-word boundary [[:>:]] [[:<:> matches cat in the word on its own as well as in catfish, but neither in tomcat nor in certificate. cat[[:<:> never matches as a word cannot start in the middle of a word. cat[[:>:]] matches cat in the word on its own as well as in tomcat, but neither in catfish nor in certificate. [[:>:]]cat never matches as a word cannot end in the middle of a word. For MySQL, the definition of a word character is an ASCII letter, digit or underscore—and this set of characters drives the interpretation of these "start of word" and "end of word" boundaries. PCRE offers these boundaries as a convenience for occasions when someone might want to paste POSIX regex into a PCRE-powered language (or, more likely, switch the regex library used by an old C program), but the engine makes the following substitutions before starting the match: The start of word boundary [[:<:> is converted to \b(?=\w) The end of word boundary [[:>:]] is converted to \b(?<=\w) Therefore, the "start of word" and "end of word" boundaries derive their meaning from the \b boundary. In non-Unicode mode, it matches a position where only one side is an ASCII letter, digit or underscore. In Unicode mode, it matches a position where only one side is a Unicode letter, digit or underscore. Other Engines I've never yet encountered a situation where I wished I had one of these boundaries. Most likely, if it ever arises, I automatically solve it by using lookarounds. If you ever want to use these specific boundaries in a language that doesn't support them, one solution among several is to copy the patterns (from two paragraphs above) that PCRE uses to convert the boundaries to regular syntax.

Making Your Own Boundaries

Finding a boundary between a word character and a non-word character is convenient, and we can thank \b for that. But there are many other cases where we could use a boundary for which regex does not provide explicit syntax. For instance, how do you match the position between a letter and a digit? We'll make this exact boundary further down, but let's get there at a comfortable pace. Delimiters As a first example, let's look at a line in an email reply: > and then she told him she wouldn't settle for less than a Hawaiian pizza, and Let's say we want a boundary that finds the position between the > and an ASCII letter. As a first approach, we could use a lookbehind. Assuming we're in multi-line mode, where the anchor ^ matches at the beginning of any line, the lookbehind (?<=^> ) asserts that what precedes the current position is the beginning of the string, then a "greater-than" symbol > and a space. Therefore, something like (?<=^> )\w+ would find the first word of the line. This works, but I would not call (?<=^> ) a boundary. Whereas a boundary asserts that there is a difference between what lies to the left and what lies to the right, our lookbehind only looks in one direction. If we used it on its own, it would match after the space character > in > >>>: it doesn't care about what follows. It is what I would call a delimiter, rather than a boundary. Delimiters are very useful, and they are a major source of business for regex lookarounds. For instance, .*?(?=END) would match an entire line up to—but not including—the word END: the lookahead (?=END) serves as an ending delimiter. Likewise, (?<=START) serves as a beginning delimiter in (?<=START).*, which matches an entire line after—but not including—the word START. Further down, we will look at a useful technique: double-negative delimiters. Boundaries: Look Left and Right To finish our boundary for the position following the start of an email reply line and preceding a letter, we also need to look to the right. We do that by adding a lookahead after the lookbehind: (?<=^> )(?=[a-zA-Z]) After asserting that what precedes the current position is a "greater than" and a space, we assert that what follows is a letter. Note that the order of the lookahead and the lookbehind do not matter, as they do not consume any characters: they look to the left and to the right with our feet firmly planted in the same spot in the string. Therefore, the reverse-order boundary (?=[a-zA-Z])(?<=^> ) works equally well. After either of these patterns, we can confidently use any regex meta-character—such as the dot—and be sure that it will match a letter: they are true boundaries. Generalizing the idea: home-made word boundary We can use this technique to construct any boundary we like. The coming sections will show some examples in detail, but to whet our appetite, how would you build a word boundary if your regex engine didn't support \b? When it matches on the left of word characters, a word boundary is able to check that what follows is a word character but what precedes is not. In lookaround terms, this is (?=\w)(?<!\w). When it matches on the right of word characters, a word boundary is able to check that what precedes is a word character but what follows is not. In lookaround terms, this is (?<=\w)(?!\w) A word boundary must match either of these positions. Grouping them together inside an alternation, our homemade word boundary becomes: (?:(?=\w)(?<!\w)|(?<=\w)(?!\w)) Yes, \b is a bit shorter.

DIY Boundary Workshop: "real word boundary"

With some variations depending on the engine, regex usually defines a word character as a letter, digit or underscore. A word boundary \bdetects a position where one side is such a character, and the other is not. In the everyday world, most people would probably say that in the English language, a word character is a letter. Others might allow for hyphens. In some situations, it might therefore be useful to have a "real word boundary" that detects the edge between an ASCII letter and a non-letter. How do we do that? As a start, with lookarounds you can make a left-side and a right-side boundary: (?i)(?<=^|[^a-z])cat(?=$|[^a-z]) The left side asserts that what precedes is either the beginning of the string or a character that is a non-letter. The right side asserts that what follows is either the end of the string or a non-letter. Your next step could be to combine the two to form a boundary that can be popped on either side: (?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z]) On the left side, of the alternation, we have our earlier left boundary, and we add a lookahead to check that what follows is a letter. On the right side of the alternation, we have our earlier right boundary, and we add a lookbehind to check that what precedes us is a letter. Needless to say, if you need to paste this wherever you want a "real word boundary", this is a bit heavy. With engines that support pre-defined subroutines—Perl, PCRE (PHP, R, …)—you can define the boundary once and for all, then use it wherever you like by referring to its name: (?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<alphaB> # Define "alphaB" boundary # This boundary matches when # only one side is a letter (?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z]) ) # End alphaB definition ) # End DEFINE # The actual regex matching starts here # We can use our "alphaB" boundary wherever we like (?&alphaB)cat(?&alphaB) This would work really well as a component of a large parsing regex.

DIY Boundary: between a letter and a digit

Once we have this recipe, producing boundaries is simple. For instance, with minor tweaks, we can produce a boundary that matches between ASCII letters and digits. I called this pre-defined boundary by the descriptive name A1. (?x) # free-spacing mode (?(DEFINE) # Define some subroutines (?<A1> # Define "A1" boundary # This boundary matches when # one side is a letter and # the other is a number (?i)(?<=^|\d)(?=[a-z])|(?<=[a-z])(?=$|\d) ) # End A1 definition ) # End DEFINE # The actual regex matching starts here # We can use our "A1" boundary wherever we like (?&A1)cat(?&A1) If your engine doesn't support pre-defined subroutines, you would have to paste this monster in your regex: (?:(?i)(?<=^|\d)(?=[a-z])|(?<=[a-z])(?=$|\d))

Double Negative Delimiter: Character, or Edge of String

In this section I would like to introduce you to a useful family of delimiters that use a fiendish technique: double negative delimiters. Consider the string 0# 1 #2 #3# 4# #5. In this string, we want to match 0, 3 and 5, i.e. digits where each side is either a hash or one of the edges of the string. One first thought might be to use a capture group: (?:^|#)(\d)(?:$|#). This exactly performs the task specified in the previous paragraph—first matching either the beginning of the string or a hash, then a digit, then either the end of the string or a hash. The desired digits are captured to Group 1. To get rid of the capture group, you will probably think of using lookarounds: (?<=^|#)\d(?=$|#). This is nearly exactly the same as the first regex, except that the sides are no longer matched, but just checked with a lookbehind and a lookahead. This works in .NET, PCRE (C, PHP, R, …), Java and Ruby (or Python with the regex module), but not in other engines as traditional lookbehind must have a fixed width (see Lookbehind: Fixed-Width / Constrained Width / Infinite Width). In Perl, you can get around this problem with (?:^|#\K)\d(?=$|#), where we match the left-side hash (if any) then drop it with the \K. This would also work in PCRE and Ruby. But here is the solution I would like to introduce you to: (?<![^#])\d(?![^#]) This is a bit of a brain twister. On the left side, the negative lookbehind (?<![^#]) asserts that what precedes the current position is not one character that is not a hash. Flipping the double negative back to a positive assertion, this says that if there is a character behind us, it must be a hash. What is allowed behind us is therefore either a hash character or "not a character" (the beginning of the string). Why the double negative? Isn't that the same as the positive lookbehind (?<=#)? Well, no: this positive lookbehind requires a hash character—whereas we also want to allow the absence of any character on the left. The negative lookahead at the end of the string follows the same principle: (?![^#]) asserts that what follows is not a character that is not a hash—i.e., if it is a character, it must be a hash. Limitation This technique works for single-line strings. As soon as you move to multiple lines, 0# no longer matches at the beginning of lines 2 and beyond. That is because there is a character before the 0: the \n, and it is not a hash. Likewise, #5 no longer matches at the end of any line but the last, because there is now a line break character—not a hash—after the 5. Extension To get your eyes accustomed to the technique, let's apply it to other tasks. To match A, B or E in A0 1B1 2C D3 4E, i.e capital letters that have either a digit or a string-end on each side, you can use this pattern: (?<!\D)[A-Z](?!\D) To match A, C or F in A -B- C -D -E F, i.e capital letters that have either a space or a string-end on each side, you can use this pattern: (?<!\S)[A-Z](?!\S) Finally, an unlikely example: to match the tilde, hash or colon in ~A ? 2! _#4 @5 6:, i.e special characters that have either a word character or a string-end on each side, you can use this pattern: (?<!\W)[~#:@?!](?!\W) Subject: Nicely done Well presented and great examples, thank you. Subject: Thanks Just wanted to say thanks for the good explanation of anchors and boundaries. Subject: boundaries Ahh, Finally to find the logical brain that can express concisely.

Regex Anchors

Anchors belong to the family of regex tokens that don't match any characters, but that assert something about the string or the matching process. Anchors assert that the engine's current position in the string matches a well-determined location: for instance, the beginning of the string, or the end of a line. This type of assertion is useful for several reasons. First, it is expressive: it lets you specify that you want to match digits at the end of a line, but not anywhere else. Second, it is efficient: when you tell the engine that you want to find a compex pattern at a given location, it doesn't have to spend time trying to find that pattern at many other locations. This is why the regex style guide recommends using anchors whenever possible—even when your regex would match without them. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Anchors vs. Boundaries and Delimiters

Why is ^ called an anchor while \b is called a boundary and (?<!.) is called (by me) a delimiter? These semantic questions are explored in Boundaries vs. Anchors on the boundaries page, and in Anchors Seen as Delimiters lower on this page.

Caret ^: Beginning of String (or Line)

In most engines (but not Ruby), when multiline mode hasn't been turned on, the caret anchor ^ asserts that the engine's current position in the string is the beginning of the string. Therefore, ^a matches an a at the beginning of the string. Ruby In Ruby, ^ always asserts that the engine's current position in the string is the beginning of a line—which is either the beginning of the string or a position following a line feed character \n. Therefore, ^a matches the a on both of these lines: apple apricot Multiline Mode: ^ Matches on Every Line Apart from Ruby, all engines allow you to turn on a mode where the anchors ^ and $ match on every line. That mode is often turned on with an inline modifier (?m), but this varies across engines and applications, and the Modifiers page has instructions for how to turn on multiline mode in various languages. When that mode is set, ^a matches the a on both of these lines: apple apricot

Dollar $: End of String (or Line)

The dollar anchor changes its meaning depending on which engine we use and on whether the multiline mode is on. Let's start with the narrowest meaning then gradually expand. All Engines: Match at the Very End of the String All engines agree on one string position where the dollar anchor $ is allowed to match: at the very end of the string. For instance, if your string is the apple, the $ anchor matches the position after the e, and e$ matches the final e, but not the first. For convenience, most engines build some flexibility into $, allowing it to match before final line breaks—we'll examine this in the following sections. JavaScript, however, does not have this kind of flexibility (which won't surprise you if you remember that JavaScript is by a long shot the worst among all the major regex engines). If you add a line break at the end of your the apple string, as in the apple\n, the JavaScript engine no longer matches the final e. All Engines Except JavaScript: Match Before One Final Line Break In all major engines except JavaScript, if the string has one final line break, the $ anchor can match there. For instance, in the apple\n, e$ matches the final e. This is convenient: When you want to match some characters at the end of a string, you don't have to worry about whether it might have a line break at the end. For all engines except PCRE, the final line break character must be a linefeed character \n. But in PCRE (C, PHP, R…), the definition of a line break for the purpose of the $ anchor depends on how PCRE was compiled. Depending on the language and tool, that final newline character may be allowed to be a linefeed, a carriage return, a linefeed-and-carriage-return sequence, or any of the above. Apart from these potential differences resulting from how PCRE was compiled, PCRE also lets you override the default newline (and therefore the behavior of the $ anchor) with one of PCRE's special beginning of pattern modifiers, such as (*ANYCRLF), (*CR) and (*ANY) Multiline Mode (and Ruby Default): $ Matches on Every Line In Ruby, the $ anchor always matches at the end of a line. For instance, e$ matches the e on both of these lines: apple orange Other engines (including JavaScript) also allow the $ to match at the end of a line when you turn on the multiline mode (follow the link for how to turn on multiline in various languages). At first, Ruby insistence to allow ^ and $ to match on every line may seem annoying because it is not standard. However, once you learn about anchors such as \A and \Z, you see that this way of proceeding makes a lot of sense as it gives anchors an unambiguous meaning.

Beginning of String: \A

In all major regex flavors except JavaScript, the \A anchor asserts that the engine's current position in the string is the beginning of the string. Therefore, in the following string, \Aa matches the a in apple but not apricot: apple apricot Of course the ^ anchor also matches that position. But since ^ changes its meaning when multiline is on, \A gives us an unambiguous anchor to ensure we really only ever match at the beginning of the string.

Very End of String: \z

A string goes from \A to \z In all major regex flavors except JavaScript and Python (where this token does not exist), the \z anchor asserts that the engine's current position in the string is the very end of the string (past any potential final line breaks). Therefore, in the following string, e\z matches the e in orange but not apple: apple orange Of course the $ anchor also matches that position. But since $ changes its meaning when multiline is on, \z gives us an unambiguous anchor to ensure we really only ever match at the very end of the string. Since the \A anchor specifies the very beginning of the string, we could have expected another capital letter for the end of the string—rather than the lowercase \z. Instead, the \Z (which we'll see next) gives us some flexibility around the line break (except in Python). Therefore, it's helpful to think of a string as going from \A to \z.

End of String (or Before Optional Line Break): \Z

This flexible "end-of-string" anchor behaves differently depending on the engine. In Python, the token \Z does what \z does in other engines: it only matches at the very end of the string. In .NET, Perl and Ruby, \Z is allowed to match before a final line feed. Therefore, e\Z will match the final e in the string apple\norange\n In Java, \Z will also match before a final line break character, which may be a line feed, a carriage return and other line separators. In PCRE (C, PHP, R…), \Z will also match before a final line break character. As discussed earlier, what PCRE defines as a line break character depends on how it was compiled, as well as on the potential presence of PCRE's special beginning of pattern modifiers, such as (*ANYCRLF), (*CR) and (*ANY)

Bringing Some Order: Multiline All the Time

There is a good argument for always leaving multiline mode on in order to give clear roles to all the anchors we've seen so far. If you do this, you know that $ and ^ always match on every line. And if you need to match specifically at the beginning or the end of the string, you use \A, \Z or \z. In fact, this is Ruby's behavior. That is also where Perl 6 is headed. To me, this idea makes a lot of sense: it gets rid of the multiline mode m and makes anchors more meaningful.

Beginning of String or End of Previous Match: \G

.NET, PCRE (C, PHP, R…), Java, Perl and Ruby support \G, a useful anchor that can match at one of two positions: The beginning of the string, The position that immediately follows the end of the previous match. Among other uses, \G can come in handy in tokenized strings when you want to match tokens in certain areas of the string but not in others. Consider for instance this string showing Jane and Tarzan's times on three separate swim tests: Tarzan A:33 B:32 C:36 Jane A:35 B:33 C:31 If we are only interested in matching Jane's scores, we can use: (?:Jane|\G) \w+:(\d+) There are other ways to do this, especially if you have infinite lookbehind (.NET), but this approach is particularly economical. How does it work? When the engine tries to match at the beginning of the string, the first token (?:Jane|\G) succeeds because \G matches at the beginning of the string. However, the next token (a space character) fails against Tarzan's T. The next chance for the pattern to match is at the position preceding Jane. The engine matches "Jane A:35", capturing the 35 to Group 1. At the starting position of the next match attempt, \G matches, and the engine matches " B:33". Finally, \G matches again, and the engine matches " C:31". Incidentally, in PCRE, Perl and Ruby, you don't need to retrieve the times from Group 1: you can match them directly with this small variation, where \K tells the engine to drop what it matched so far from the match to be returned: (?:Jane|\G) \w+:\K\d+ "Beginning of String" Match: Using or Bridling \G The fact that \G matches at the beginning of the string is neither convenient nor inconvenient. Half the time, we use that property. The other half, we work around it. For our second example, consider this string, which might represent two potential positions for placing a "submarine" on a paper grid in preparation for a naval battle: A1B1C1vsA1A2A3 Each position (on either side of vs) has three tokens composed of one letter and one digit. Suppose we want to match the first three tokens, i.e. A1, B1, C1. We can do this quite easily with this regex: \G[A-Z]\d The \G matches at the beginning of the string, allowing us to match A1. Then \G matches before the next token, so we match it, as well as the following token. \G succeeds again before the vs, but [A-Z] cannot match the v, so the match fails. There is no more position for \G to match, and we therefore avoid the tokens to the right, as we wanted. Now suppose we want to match the second position's tokens, i.e. A1, A2, A3. Remembering the Tarzan and Jane example, we could try (?:vs|\G)([A-Z]\d)… but the strings in these two examples are not built the same way, and this regex would match all the tokens! Let's see how. After the \G matches at the beginning of the string, [A-Z]\d is able to match the first token. Then \G matches again, so we match the second token, and the third. Then, when we hit vs, \G still matches, but [A-Z] fails against vs. The engine backtracks and tries the other side of the alternation, vs, which matches. [A-Z]\d matches the fourth token, then \G helps us with tokens 5 and 6. Clearly, this time \G is in our way: we wish it didn't match at the beginning of the string. To solve this, we can "bridle \G" by placing the negative lookahead (?!\A) right next to it. It asserts that what immediately follows the current position is not the beginning of the string, so \G can no longer match there. The regex becomes: (?:vs|\G(?!\A))([A-Z]\d) It may sound strange that we used (?!\A) to negate the anchor \A. As it turns out, (?<!\A) would also have worked. We'll explore this in the section about anchors within a lookaround.

Anchors Anywhere, Anytime

If you read too much beginner literature, overexposure to patterns such as ^Name: \w+$ could lead you believe that anchors such as ^ and $ must be used in immutable positions at either end of the string. The (?!\A) in the regex just above hints that it is not so. Anchors are not sacred cows. Move them where you like. There is nothing sacred about anchors. A single anchor such as ^ may appear multiple times in a regex pattern. It can also participate in any subexpression, such as a group, an alternation, or a lookbehind. Let's see some examples. Anchor in Multiple Places ^cat|^mouse Here we use the beginning-of-string anchor twice, on both sides of an alternation. This ensures that whichever word we match will be at the beginning of the string. In contrast, if we used ^cat|mouse, the ^ would only apply to cat. On the other side of the alternation, mouse is not anchored, so it could match anywhere in the string—perhaps not what we intend. Another way to ensure the anchor applies to both sides would be to enclose the alternation in a group, as in ^(?:cat|mouse) Anchor in an Alternation \bcat(\w+|$) Here we use the end-of-string anchor $ in an alternation. After the word boundary \b and the letters cat, the engine must either match some word characters \w+ or be able to assert that the current position is the end of the string. As a result, the word cat on its own can match, but only at the end of the string. In contrast, catch can match anywhere in the string. Anchor Within a Lookaround \w+(?=,|$) Here we use the end-of-string anchor $ within a lookahead. The engine matches word characters, then asserts that what follows is either a comma or the end of the string. In the string one apple, two peaches, three plums, this regex would match the fruits but not the numbers. It's worth taking a moment to examine the meaning of an anchor within a lookaround. When the engine is standing at the beginning of the string, you could say that what immediately follows or precedes this position is also the beginning of the string. If immediately refers to an infinitely small distance to the left or to the right, that is indeed the case. This is a matter of semantics and perspective, of course, but it's a perspective that the engine takes on board: ^, (?<=^) and (?=^) all match at the beginning of a string. The same applies to other anchors: whenever an anchor is within a lookaround, the meaning of the anchor is the same as if the lookaround weren't there at all. This is convenient for several reasons: We can use a negative lookahead or a negative lookbehind to assert that the current position does not correspond to an anchor. For instance, (?<!^) checks that the current position is not the beginning of the string. We can place an anchor in an alternation within a lookaround to make a complex delimiter, as in the \w+(?=,|$) example a few paragraphs above. To take another example, (?<=>>|^) ensures that what precedes the current position is either two "greater than" characters >, or—assuming we're in multiline mode—the beginning of a line. This could be useful, for instance, in parsing the text of an email. Note that as in any alternation, the order of tokens is important: (?<=^|>>) matches at the beginning of the string if possible, or past >> if not. A few lines above, the priority was reversed.

Anchors seen as Delimiters

Just as you can argue that anchors are a kind of boundary, you can argue that they are a kind of delimiter. Take the beginning of string anchor ^: you can express it with the negative lookbehind (?s)(?<!.)a, which asserts that what precedes the current position is not a character. (In this regex, the mode modifier (?s)—also known as dotall mode—is necessary to allow the dot to match any character including line breaks.) Likewise, in multiline mode, where ^ matches the beginning of every line, you can mimic the anchor's behavior with the negative lookbehind (?<!.)a, which asserts that what precedes the current position is not a character, except perhaps a line break. Expressed in this way, the anchor ^ no longer seems to refer to a specific position in the string: it is just a shorthand notation for a "do-it-yourself delimiter". Ultimately, this will boil down to a question of semantics. But the question is not futile. Rephrasing anchors as DIY delimiters suggests ways to implement these anchors using other meta-characters if you ever decided to write your own engine. More importantly, exploring alternate ways to perform the work of anchors brings out some of their features and suggests possible variations. For instance, in many engines, if you wanted a variation on the multiline anchor ^ that only matches at the beginning of lines two and beyond, you could use something like the lookbehind (?<=\n), which asserts that what precedes the current position is a linefeed character. It really is mind-boggling that such a quality content is available for free. Thanks for your hard work. In the following use cases, the "$" anchor changes the regex behavior when trying to match the simple string "a" : - regex "^b*" matches, - regex "^b*$" does not match. How can we explain that? Thanks, -Yon Reply to Yon Hi Yon, Neither of these patterns matches the string "a".

Capture Group Numbering & Naming: The Gory Details

Capture groups and back-references are some of the more fun features of regular expressions. You place a sub-expression in parentheses, you access the capture with \1 or $1… What could be easier? For instance, the regex \b(\w+)\b\s+\1\b matches repeated words, such as regex regex, because the parentheses in (\w+) capture a word to Group 1 then the back-reference \1 tells the engine to match the characters that were captured by Group 1. Yes, capture groups and back-references are easy and fun. But when it comes to numbering and naming, there are a few details you need to know, otherwise you will sooner or later run into situations where capture groups seem to behave oddly. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

How do Capture Groups Beyond \9 get Referenced?

Normally, within a pattern, you create a back-reference to the content a capture group previously matched by using a backslash followed by the group number—for instance \1 for Group 1. (The syntax for replacements can vary.) In practice, you rarely need to create back-references to groups with numbers above 3 or 4, because when you need to juggle many groups you tend to create named capture groups. However, if you spend time in the smoky corridors of regex, at one time or another you're sure to wonder what is the correct syntax to create back-references to Groups 10 and higher. So in a regular expression, what does \10 mean? It looks ambiguous: on the face of it, that could refer either to Group 10, or to Group 1 followed by a zero. In fact, the meaning does depend on the regex engine. If Group 10 has been set, all major engines treat \10 as a back-reference to Group 10. If there is no Group 10, however, Java translates \10 as a back-reference to Group 1, followed by a literal 0; Python understands it as a back-reference to Group 10 (which will fail); and C#, PCRE, JavaScript, Perl and Ruby understand it as an instruction to match "the backspace character" (whatever that is)… because 10 is the octal code for the backspace character in the ASCII table! To avoid this kind of ambiguity, here is the proper syntax to create a back-reference to Group 10. C#, Ruby: \k<10> PCRE, Perl: \g{10} Java, JavaScript, Python: no special syntax (use \10—knowing that if Group 10 is not set Java will treat this as Group 1 then a literal 0, while JavaScript will treat it as the elusive "backspace character") Replacement Syntax for Group 10 As you probably know, there is no standard across engines to insert capture groups into replacements. Some engines use the \1 syntax, some use $1, some allow both. So how do you insert Group 10 in a replacement? C#: ${10} Java, JavaScript, Perl, PHP: $10 (if Group 10 has not been set, Java and JavaScript insert Group 1 then the literal 0, while Perl and PHP treat this as a back-reference to an undefined group) Python, Perl, PHP: \10 (if Group 10 has not been set, Python and and PHP treat this as a back-reference to an undefined group, while Perl inserts the backspace character, whatever that means) Ruby does not allow Group numbers above \1 in replacements (use a named group).

Naming Groups—and referring back to them

In this section, to summarize named group syntax across various engines, we'll use the simple regex [A-Z]+ which matches capital letters, and we'll name it CAPS. .NET (C#, VB.NET…), Java: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference, ${CAPS} inserts the capture in the replacement string. Python: (?P<CAPS>[A-Z]+) defines the group, (?P=CAPS) is a back-reference, \g<CAPS> inserts the capture in the replacement string. Perl: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. The P syntax (see Python) also works. $+{CAPS} inserts the capture in the replacement string. PHP: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. The P syntax (see Python) also works. To insert the capture in the replacement string, you must either use the group's number (for instance \1) or use preg_replace_callback() and access the named capture as $match['CAPS'] Ruby: (?<CAPS>[A-Z]+) defines the group, \k<CAPS> is a back-reference. To insert the capture in the replacement string, you must use the group's number, for instance \1. JavaScript does not have named groups (along with lookbehind, inline modifiers and other useful features.)

How Capture Groups Get Numbered

This section starts with basics and moves on to more advanced topics. You can skip the topics that don't pertain to your regex engine or to regex features you aren't planning to use in the coming days, but the next three paragraphs (in which I've made sure to insert a streak of yellow) are required reading if you want to understand how group numbering works. In a regex pattern, every set of capturing parentheses from left to right as you read the pattern gets assigned a number, whether or not the engine uses these parentheses when it evaluates the match. This left-to-right numbering is strict—with the exceptions of the branch reset feature in Perl and PCRE (PHP, R…) and of duplicated group names in .NET. If the first capture group as you read the expression from the left never gets captured (probably because it lives on the wrong side of an | alternation operator), it is still Group 1, even though it will be empty. Capture groups get numbered from left to right. End of story. Capture Groups with Quantifiers In the same vein, if that first capture group on the left gets read multiple times by the regex because of a star or plus quantifier, as in ([A-Z]_)+, it never becomes Group 2. For good and for bad, for all times eternal, Group 2 is assigned to the second capture group from the left of the pattern as you read the regex. What happens to the number of the first group when it gets captured multiple times? It remains Group 1. The Returned Value for a Given Group is the Last One Captured Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z]_)+, when you inspect the match, Group 1 will be D_. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched. The .NET Exception: Capture Collections As far as I know, the only engine that doesn't throw away intermediate captures is the .NET engine, available for instance through C# and VB.NET. In the above example, when you inspect the match and request the value for Group 1, C# will also return D_. But it will also make available to you a CaptureCollection object that stores all the values captured for Group 1 along the way. To see how this works, see the CaptureCollection section of the C# page. Perl, PHP, R, Python: Group Numbering with Subroutines and Recursion Some engines—such as Perl, PCRE (PHP, R, Delphi…) and Matthew Barnett's regex module for Python—allow you to repeat a part of a pattern (a subroutine) or the entire pattern (recursion). For instance, ([A-Z])_(?1) could be used to match A_B, as (?1) repeats the pattern inside the Group 1 parentheses, i.e. [A-Z]. The subroutine should be considered as a function call: in a sense, it has its own "local variable", i.e. its own version of Group 1. Likewise, each depth level of a recursion has its own version of Group 1 (and therefore no matter how many times you recurse, Group 1 is always Group 1 for a given depth). What this means is that when ([A-Z])_(?1) is used to match A_B, the Group 1 value returned by the engine is A. It also means that (([A-Z])\2)_(?1) will match AA_BB (Group 1 will be AA and Group 2 will be A). Whatever Group 1 values were used in the subroutine or recursion are discarded. In PCRE but not Perl, one interesting twist is that the "local" version of a Group in a subroutine or recursion starts out with the value set at the next depth level up the ladder, until it is overwritten. This means that in PCRE, ([A-Z]\2?)([A-Z])_(?1) would match AB_CB (but not in Perl). I covered this point in the section on group contents and numbering in recursive patterns. Perl, PHP, R: Group Numbering in Pre-Defined Subroutines Perl and PCRE (PHP, R…) allow you to pre-define and name a subpattern. This allows you to build beautifully modular expressions. The main syntax page explains the (?(DEFINE) … ) syntax in detail, but we'll look at a short example to refresh our memory. There is also a beautiful example on the page on matching numbers in plain English. (?(DEFINE)(?<CAPS>[A-Z]+)) lets you define a subpattern called CAPS that matches uppercase letters. Thereafter, you can drop (?&CAPS) anywhere in your expression to match upper-case letters. How does group numbering work with these defined subpatterns? You should think of these defined subpatterns as function calls: capture groups used by a subpattern won't be available outside. To complicate matters, the subroutine itself is assigned a number, so remember to count it when counting group numbers from left to right. For instance, in (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS), the subpattern (?<TWOCAPS>([A-Z])\2)) counts as Group 1 and can in fact be called with (?1) instead of (?&TWOCAPS). In turn, ([A-Z]) counts as Group 2, so that the entire TWOCAPS pattern matches two identical upper-case letters. However, the regex (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)\2 will fail on AAA because the final \2 is outside the subroutine and therefore refers to a group that has not been set at that depth. However, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)(?2) would match AAB, as (?2) refers to the pattern of Group 2, i.e. [A-Z]. Remember that we number groups from left to right: therefore, after the TWOCAPS definition, the next available group number is 3. As a result, (?(DEFINE)(?<TWOCAPS>([A-Z])\2))(?&TWOCAPS)([BC])\3 will happily match AABB, as Group 3 ([BC]) captured the first B. Branch Reset (Perl and PCRE, e.g. PHP and R) There are two exceptions to the strict left-to-right numbering. One is the numbering with duplicated group names in .NET. The other is branch reset syntax (?|…(this)…|…(that)…)available in Perl and PCRE. A branch reset is introduced by (?|. In the pseudo-example above, the groups capturing (this) and (that) would be assigned the same number. For details and examples, see the Branch Reset section of the main regex syntax page. Duplicating Group Names In .NET, PCRE (C, PHP, R…), Perl and Ruby, you can use the same group name at various places in your pattern. (In PCRE you need to use the (?J) modifier or PCRE_DUPNAMES option.) In these engines, this regex would be valid: :(?<token>\d+)|(?<token>\d+)# This particular example could be handled by branch reset syntax (supported by Perl and PCRE), but in more complex constructions the feature can come in handy. In PCRE, Perl and Ruby, the two groups still get numbered from left to right: the leftmost is Group 1, the rightmost is Group 2. In .NET, there is only one group number—Group 1. This is one of the two exceptions to capture groups' left-to-right numbering, the other being branch reset. If the group matches in several places, all captures get added to the capture collection for Group 1 (or named group token). See the section on named group reuse. Conclusion on Group Numbering Once you understand the strict left-to-right numbering of capture groups, much potential confusion about "which group should capture what" melts away. This numbering mode may seem annoying when it stands in the way of intricate logic you would love to inject in your regex. But it's simple and consistent, and that wins the day. The two following sections present examples that illustrate the main traps of group numbering we just saw, hopefully helping drill them down.

Generating New Capture Groups Automatically (You Can't!)

One of the wonderful features of regular expressions is that you can apply quantifiers to patterns. As you know, \d+ repeatedly eats up digits. As you also know, you can capture pieces of the subject. For instance, the regex (\d) captures a digit into group 1. So what happens if you put the two together into this regex? (\d)+ When I was a young dinosaur, I romantically hoped that if I applied that regex to the string 1234, the engine would eat each of the digits, one at a time, capturing them into four groups that could later be referenced using \1, \2, \3 and \4. Was I deluded! Things don't work that way—although alone among the crowd, .NET capture collections offer something nearly identical. The capturing parentheses you see in a pattern only capture a single group. So in (\d)+, capture groups do not magically mushroom as you travel down the string. Rather, they repeatedly refer to Group 1, Group 1, Group 1… If you try this regex on 1234 (assuming your regex flavor even allows it), Group 1 will contain 4—i.e. the last capture. In essence, Group 1 gets overwritten every time the regex iterates through the capturing parentheses. The same happens if you use recursive patterns instead of quantifiers. And the same happens if you use a named group: (?P<MyDigit>\d)+ The group named MyDigit gets overwritten with each digit it captures. That is less surprising, and this scenario helps explain the first, because the two phrasings are equivalent. It may not jump out at you from the raw symbols, but in the regex (\d)+, the set of parentheses refers to a specific group—Group 1—even though that group is not explicitly named, as it is in (?P<MyDigit>\d)+. The name is implied. If you were hoping to use a quantifier to spawn multiple captures as the engine travels down the string, forget about it—unless you use .NET capture collections, a feature that lets you inspect the successive captures made by a quantified group. But there's always a solution. For instance, the "match all" feature of your language lets you break down the string into chunks, each with its set of captures, which you can put back together if need be. For instance, you could use: C#: Matches() or iterate with Match() and NextMatch() Python: finditer or findall PHP: preg_match_all() Java: matcher() with while… find() JavaScript: match() or iterate with exec() Perl: $subject =~ m! Ruby: subject.scan

Resetting Capture Groups like Variables (You Can't!)

Closely related to the desire (explored above) to spawn new capture groups by using a quantifier, there is the temptation to try and capture a certain named group at different points in the regex—perhaps using a sophisticated machinery of conditionals as you might be used to doing in a programming language. For instance, you might try to write this to set the PriorError group at various places in the pattern, much like a variable or a flag: (?x) # free-spacing mode # this regex attempts to match "dog", # allowing for a one-character error, e.g. dig or bog, but not bug d? # "d" is the first character. It is optional. (?(?<!d)(?P<PriorError>o)|.) # If the previous character is not d, set PriorError and require "o". Otherwise (no error so far), accept any character. (?(PriorError)g # If PriorError is set, require g (no more errors!) | (?:(?<!o)(?P<PriorError>g)|.)) # Otherwise (no prior error), if the previous character is not o, set PriorError and require g (no more errors!), otherwise accept any character This is real cool regex code (I wrote it as a juvenile.) The only problem is that it doesn't work: You are not allowed to set a capture group at two places in the regex. Regular expressions just aren't a language were you can set and reset variables anywhere you like in a pattern—with the possible exception of a few intricate tricks, for instance using .NET's balancing groups. For some operations, to get what you want, you may need to use more than one regex.

Relative Back-References and Forward-References

Perl and PCRE allow you to make back-references by specifying the relative position of the capture group, for instance: \g{-2} Likewise, PCRE (as of 10.23) but not Perl allows you to make forward-references by specifying the relative position of the capture group, for instance: \g{+2}

Disabling Capture Groups

In Perl (unsure as of when) and PCRE 10.30+, (?n) tells the engine to treat all groups as non-capture group. This means that (subpattern) becomes equivalent to (?:subpattern) Subject: Great discussion of a complex topic Great discussion of a complex topic, thank you! The other sites I found left me more confused than when I started… The "You Can't" topics were very helpful too.

Regex Modifiers—Turning them On

This is a reference page—don't feel you have to read it as it is rather terse. Other sections link to it when needed. Regex modifiers turn up at every corner on this site. Rather than repeatedly explain what they do and the multiple ways to turn them on in every regex flavor, I decided to gather the four common ones (i, s, m and x) in one place. The final section briefly surveys other modifiers, which are usually language-specific. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Case Insensitivity: i

By default, all major regex engines match in case-sensitive mode. If you want patterns such as Name: [a-z]+ to match in case-insensitive fashion, we need to turn that feature on. Yes, but… What does case-insensitive really mean? As long as you stick to the 26 letters of the English alphabet, the definition of upper-case and lower-case is straightforward. When you branch out into typographical niceties or other languages and scripts, things are not always so simple. Here are some questions you may run into. Will (one-character, i.e. the fl ligature) match FL? Will à (one character) match à? Will à (two characters, i.e. the letter a and the grave accent) match à? All engines seem to handle that correctly. Will match ss? No engine seems to do so. Will i match (Turkish capital i) as well as I? These questions are just the tip of the iceberg. Even if I knew all the answers, it would be impossible to include them all in this section: if you use a particular script, you'll need to research how your specific engine handles case-insensitive matching in that script. More than one way For several engines, note that there are two ways of turning on case-insensitive matching: as an inline modifier (?i) or as an option in the regex method or function. Inline Modifier (?i) In .NET, PCRE (C, PHP, R…), Perl, Python, Java and Ruby (but not JavaScript), you can use the inline modifier (?i), for instance in (?i)cat. See the section on inline modifiers for juicy details about three additional features (unavailable in Python): turning it on in mid-string, turning it off with (?-i), or applying it only to the content of a non-capture group with (?i:foo) .NET Apart from the (?i) inline modifier, .NET languages have the IgnoreCase option. For instance, in C# you can use: var catRegex = new Regex("cat", RegexOptions.IgnoreCase); Perl Apart from the (?i) inline modifier, Perl lets you add the i flag after your pattern's closing delimiter. For instance, you can use: if ($the_subject =~ m/cat/i) { … } PCRE (C, PHP, R…) Note that in PCRE, to use case-insensitive matching with non-English letters that aren't part of your locale, you'll have to turn on Unicode mode—for instance with the (*UTF8) special start-of-pattern modifier. Apart from the (?i) inline modifier, PCRE lets you set the PCRE_CASELESS mode when calling the pcre_compile() (or similar) function: cat_regex = pcre_compile( "cat", PCRE_CASELESS, &error, &erroroffset, NULL ); In PHP, the PCRE_CASELESS option is passed via the i flag, which you can add in your regex string after the closing delimiter. For instance, you can use: $cat_regex = '~cat~i'; In R, the PCRE_CASELESS option is passed via the ignore.case=TRUE option. For instance, you can use: grep("cat", subject, perl=TRUE, value=TRUE, ignore.case=TRUE); Python Apart from the (?i) inline modifier, Python has the IGNORECASE option. For instance, you can use: cat_regex = re.compile("cat", re.IGNORECASE) Java Apart from the (?i) inline modifier, Java has the CASE_INSENSITIVE option. For instance, you can use: Pattern catRegex = Pattern.compile( "cat", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE ); The UNICODE_CASE option added here ensures that the case-insensitivity feature is Unicode-aware. If you're only working with ASCII, you don't have to use it. JavaScript In JavaScript, your only option is to add the i flag after your pattern's closing delimiter. For instance, you can use: var catRegex = /cat/i; Ruby Apart from the (?i) inline modifier, Ruby lets you add the i flag after your pattern's closing delimiter. For instance, you can use: cat_regex = /cat/i

DOTALL (Dot Matches Line Breaks): s (with exceptions)

By default, the dot . doesn't match line break characters such as line feeds and carriage returns. If you want patterns such as BEGIN .*? END to match across lines, we need to turn that feature on. This mode is sometimes called single-line (hence the s) because as far as the dot is concerned, it turns the whole string into one big line—.* will match from the first character to the last, no matter how many line breaks stand in between. The mode is also called DOTALL in PCRE, Python and Java (to be more precise, the PCRE documentation uses PCRE_DOTALL). To me, the name DOTALL is a sensible way to call this mode. The third option dot-matches-line-breaks is descriptive but a bit of a mouthful. For several engines, note that there are two ways of turning it on: as an inline modifier or as an option in the regex method or function. JavaScript JavaScript does not support single-line mode. To match any character in JavaScript, including line breaks, use a construct such as [\D\d]. This character class matches one character that is either a non-digit \D or a digit \d. Therefore it matches any character. Another JavaScript solution is to use the XRegExp regex library. If you've got infinite time on your hands, you can also try porting PCRE to JavaScript using Emscripten, as Firas seems to have done on regex 101. Inline Modifier (?s) In .NET, PCRE (C, PHP, R…), Perl, Python and Java (but not Ruby), you can use the inline modifier (?s), for instance in (?s)BEGIN .*? END. See the section on inline modifiers for juicy details about three additional features (unavailable in Python): turning it on in mid-string, turning it off with (?-s), or applying it only to the content of a non-capture group with (?s:foo) .NET Apart from the (?s) inline modifier, .NET languages have the Singleline option. For instance, in C# you can use: var blockRegex = new Regex( "BEGIN .*? END", RegexOptions.IgnoreCase | RegexOptions.Singleline ); Perl Apart from the (?s) inline modifier, Perl lets you add the s flag after your pattern's closing delimiter. For instance, you can use: if ($the_subject =~ m/BEGIN .*? END/s) { … } PCRE (C, PHP, R…) Apart from the (?s) inline modifier, PCRE lets you set the PCRE_DOTALL mode when calling the pcre_compile() (or similar) function: block_regex = pcre_compile( "BEGIN .*? END", PCRE_DOTALL, &error, &erroroffset, NULL ); In PHP, the PCRE_DOTALL option is passed via the s flag, which you can add in your regex string after the closing delimiter. For instance, you can use: $block_regex = '~BEGIN .*? END~s'; Python Apart from the (?s) inline modifier, Python has the DOTALL option. For instance, you can use: block_regex = re.compile("BEGIN .*? END", re.IGNORECASE | re.DOTALL) Java Apart from the (?s) inline modifier, Java has the DOTALL option. For instance, you can use: Pattern blockRegex = Pattern.compile( "BEGIN .*? END", Pattern.CASE_INSENSITIVE | Pattern.DOTALL ); Ruby: (?m) modifier and m flag In Ruby, you can use the inline modifier (?m), for instance in (?m)BEGIN .*? END. This is an odd Ruby quirk as other engines use (?m) for the "^ and $ match on every line" mode. See the section on inline modifiers for juicy details about three additional features: turning it on in mid-string, turning it off with (?-m), or applying it only to the content of a non-capture group with (?m:foo) Ruby also lets you to add the m flag at the end of your regex string. For instance, you can use: block_regex = /BEGIN .*? END/m Origins of DOTALL The single-line mode is also often called DOTALL (which stands for "dot matches all") because of the PCRE_DOTALL option in PCRE, the re.DOTALL option in Python and the Pattern.DOTALL option in Java. I've heard it claimed several times that "DOTALL is a Python thing" but this seemed to come from people who hadn't heard about the equivalent options in PCRE and Java. Still this made me wonder: where did DOTALL appear first? Looking at the PCRE Change Log and old Python documentation, it seems that it appeared in PCRE with version 0.96 (October 1997), in Python with version 1.5 (February 1998), then in Java 1.4 (February 2002). The gap between the PCRE and Python introductions wasn't conclusive—the word might have been in circulation in earlier beta versions, or even in other tools—so I asked Philip Hazel (the father of PCRE) about it. He replied:
I believe I invented it — I certainly had not seen it elsewhere when I was trying to think of a name for the PCRE option that corresponds to Perl's /s option. ("S" there stands for "single-line" (…) so I wanted a better name.)
So there. Those who like a bit of history might enjoy this tasty nugget.

Multiline (^ and $ Match on Every Line): m (except Ruby)

By default, in all major engines except Ruby, the anchors ^ and $ only match (respectively) at the beginning and the end of the string. In Ruby, they match at the beginning and end of each line, and there is no way to turn that feature off. This is actually a reasonable way of doing things, with which Ruby partially redeems itself for using m for DOTALL mode when other engines use s. In other engines, if you want patterns such as ^Define and >>>$ to match (respectively) at the beginning and the end of each line, we need to turn that feature on. This feature is usually called multi-line (hence the m) because the anchors ^ and $ operate on multiple lines. For several engines, note that there are two ways of turning it on: as an inline modifier (?m) or as an option in the regex method or function. Ruby In Ruby, the anchors ^ and $ always match on all lines. There is no way to turn this option off. This is actually quite a nice way to do things, since, as in most flavors, there are separate anchors for the beginning and end of strings: \A, \Z and \z. On the other hand, one can regret Ruby's choice to use the m flag and modifier in a non-standard way (see DOTALL). Inline Modifier (?m) In .NET, PCRE (C, PHP, R…), Perl, Python, Java and Ruby (but not JavaScript), you can use the inline modifier (?m), for instance in (?m)^cat. See the section on inline modifiers for juicy details about three additional features (unavailable in Python): turning it on in mid-string, turning it off with (?-m), or applying it only to the content of a non-capture group with (?m:foo) .NET Apart from the (?m) inline modifier, .NET languages have the Multiline option. For instance, in C# you can use: var catRegex = new Regex("^cat", RegexOptions.IgnoreCase | RegexOptions.Multiline); Perl Apart from the (?m) inline modifier, Perl lets you add the m flag after your pattern's closing delimiter. For instance, you can use: if ($the_subject =~ m/^cat/m) { … } PCRE (C, PHP, R…) Apart from the (?m) inline modifier, PCRE lets you set the PCRE_MULTILINE mode when calling the pcre_compile() (or similar) function: cat_regex = pcre_compile( "^cat", PCRE_CASELESS | PCRE_MULTILINE, &error, &erroroffset, NULL ); In PHP, the PCRE_MULTILINE option is passed via the m flag, which you can add in your regex string after the closing delimiter. For instance, you can use: $cat_regex = '~^cat~m'; Python Apart from the (?m) inline modifier, Python has the MULTILINE option. For instance, you can use: cat_regex = re.compile("^cat", re.IGNORECASE | re.MULTILINE) Java Apart from the (?m) inline modifier, Java has the MULTILINE option. For instance, you can use: Pattern catRegex = Pattern.compile( "^cat", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE ); JavaScript In JavaScript, your only option is to add the m flag after your pattern's closing delimiter. For instance, you can use: var catRegex = /^cat/m;

Free-Spacing: x (except JavaScript)

By default, any space in a regex string specifies a character to be matched. In languages where you can write regex strings on multiple lines, the line breaks also specify literal characters to be matched. Because you cannot insert spaces to separate groups that carry different meanings (as you do between phrases and pragraphs when you write in English), a regex can become hard to read, as for instance the Meaning of Life regex from the regex humor page: ^(?=(?!(.)\1)([^\DO:105-93+30])(?-1)(?<!\d(?<=(?![5-90-3])\d))).[^\WHY?]$ Luckily, many engines support a free-spacing mode that allows you to aerate your regex. For instance, you can add spaces between the tokens. In PHP, you could write this—note the x flag after the final delimiter ~: $word_with_digit_and_cap_regex = '~ ^ (?=\D*\d) \w*[A-Z]\w* $ ~x'; But why stay on one line? You can spread your regex over as many lines as you like—indenting and adding comments—which are introduced by a #. For instance, in C# you can do something like this: var wordWithDigitAndCapRegex = new Regex( @"(?x) # Free-spacing mode ^ # Assert that position = beginning of string ######### Lookahead ########## (?= # Start lookahead \D* # Match any non-digits \d # Match one digit ) # End lookahead ######## Matching Section ######## \w* # Match any word chars [A-Z] # Match one upper-case letter \w* # Match any word chars $ # Assert that position = end of string "); This mode is called free-spacing mode. You may also see it called whitespace mode, comment mode or verbose mode. It may be overkill in a simple regex like the one above (although anyone who has to maintain your code will thank you for it). But if you're building a serious regex pattern like the one in the trick to match numbers in plain English… Unless you're a masochist, you have no choice. Note that inside a character class, the space character and the # (which otherwise introduces comments) are still honored—except in Java, where they both need to be escaped if you mean to match these characters. For several engines, there are two ways of turning the free-spacing mode on: as an inline modifier or as an option in the regex method or function. Free-spacing mode is wonderful, but there are a couple of minor hazards you should be aware of, as they may leave you scratching your head wondering why a pattern is not working as you expect. Trip Hazard #1: The Meaning of Space First, you can no longer use Number: \d+ to match a string such as Number:24. The reason is that the space in : \d no longer matches a space. We're in free-spacing mode, remember? That's the whole point. To match a space character, you need to specify it. The two main ways to do so are to place it inside a character class, or to escape it with a backslash. Either of those would work: Number:[ ]\d+ or Number:\ \d+ Of course Number:\s\d+ would also match, but remember that \s matches much more than a space character. For instance, it could match a tab or a line break. This may not be what you want. Trip Hazard #2: Late Start Second, you may get overconfident in the power of free-spacing and try something like this in order to let the regex stand on its own: var wordWithDigitAndCapRegex = new Regex(@" (?x) # Free-spacing mode ^ # Beginning of string etc # Match the literal chars e,t,c "); The problem with this is that although it may look as though the free-spacing modifier (?x) is the first thing in your regex, it is not. After the opening double-quote ", we have a line break and a number of spaces. The engine tries to match those, because at that stage we are not yet in free-spacing mode. That mode is turned on only when we encounter (?x). This regex will never match the string etc and more, because by the time we encounter the beginning of string anchor ^, we're supposed to already have matched a line break and space characters! This is why if you look at the first example, you will see that the free-spacing modifier (?x) is the very first thing after the opening quote character. Whitespace is not just trimmed out of the pattern Even though whitespace is ignored, the position of a whitespace still separates the previous token from the next. For instance, (A)\1 2 is not the same as (A)\12. The former matches AA2, the latter matches A\n in .NET, PCRE, Perl and Ruby (12 is the octal code for the linefeed character) \p{Nd} is valid, but \p{N d} is not—except in Perl and Ruby JavaScript JavaScript does not support free-spacing mode. In JavaScript, to match any character including line breaks, use a construct such as [\D\d]. This character class matches one character that is either a non-digit \D or a digit \d. Therefore it matches any character. Another JavaScript solution is to use the XRegExp regex library. If you've got infinite time on your hands, you can also try porting PCRE to JavaScript using Emscripten, as Firas seems to have done on regex 101. Inline Modifier (?s) In .NET, PCRE (C, PHP, R…), Perl, Python, Java and Ruby (but not JavaScript), you can use the inline modifier (?x), for instance, this is an aerated regex to match repeated words: (?x)(\w+)[ \r\n]+\1\b Also see the section on inline modifiers. .NET Apart from the (?x) inline modifier, .NET languages have the IgnorePatternWhitespace option. For instance, in C# you can use: var repeatedWordRegex = new Regex(@" (\w+) [ \r\n]+ \1\b", RegexOptions.IgnorePatternWhitespace ); Perl Apart from the (?x) inline modifier, Perl lets you add the x flag after your pattern's closing delimiter. For instance, you can use: if ($the_subject =~ m/(\w+) [ \r\n]+ \1\b/x) { … } PCRE (C, PHP, R…) Apart from the (?x) inline modifier, PCRE lets you set the PCRE_EXTENDED mode when calling the pcre_compile() (or similar) function: repeated_word_regex = pcre_compile( "(\w+) [ \r\n]+ \1\b", PCRE_EXTENDED, &error, &erroroffset, NULL ); In PHP, the PCRE_EXTENDED option is passed via the x flag, which you can add in your regex string after the closing delimiter. For instance, you can use: $repeated_word_regex = '~(\w+) [ \r\n]+ \1\b~x'; Python Apart from the (?x) inline modifier, Python has the VERBOSE option. For instance, you can use: repeated_word_regex = re.compile(r"(\w+) [ \r\n]+ \1\b", re.VERBOSE) Java Unlike in other engines, inside a Java character class hashes introduce comments and spaces are ignored, so you need to escape them if you want to use these characters in a class, e.g. [\#\ ]+ Apart from the (?x) inline modifier, Java has the COMMENTS option. For instance, you can use: Pattern repeatedWordRegex = Pattern.compile( "(\\w+) [ \\r\\n]+ \\1\\b", Pattern.COMMENTS ); Ruby Apart from the (?x) inline modifier, Ruby lets you add the x flag at the end of your regex string. For instance, you can use: repeated_word_regex = /(\w+) [ \r\n]+ \1\b/x

Other Modifiers

Some engines support modifiers and flags in addition to i, s, m and x. I plan to cover those in the pages dedicated to those flavors. For instance, .NET has the (?n) modifier (also accessible via the ExplicitCapture option). This turns all (parentheses) into non-capture groups. To capture, you must use named groups. Java has the (?d) modifier (also accessible via the UNIX_LINES option). When this is on, the line feed character \n is the only one that affects the dot . (which doesn't match line breaks unless DOTALL is on) and the anchors ^ and $ (which match line beginnings and endings in multiline mode.) Perl has several other flags. See the documentation's modifier section. PCRE has the (?J) modifier (also available in code via the PCRE_DUPNAMES option). When set, different capture groups are allowed to use the same name—though they will be assigned different numbers. PCRE has the (?U) modifier (also available in code via the PCRE_UNGREEDY option). When set, quantifiers are ungreedy by default. Appending a ? makes them greedy. PCRE has the (?X) modifier (also available in code via the PCRE_EXTRA option). Historically, this mode has been used to enable new features in development. At the moment, it triggers errors if tokens such as \B are used in a character class (where normally it matches the capital letter B, unlike outside a character class, where it is a not-a-word-boundary assertion). PCRE has a number of special modifiers that can be set at the start of the pattern (these are shown below). In addition, many options can be sent to the pcre_compile() family of functions, if you have access to them. For details on those, get pcre_compile.html from the doc folder by downloading PCRE.

PCRE's Special Start-of-Pattern Modifiers

PCRE has a number of "special modifiers" you can set at the start of a pattern. Instead of the standard (?z) syntax for inline modifiers, the special modifier syntax looks like (*MYMODIFIER). These modifiers are particularly useful in contexts where PCRE is integrated within a tool or a language—as they replace a number of options you would send to pcre_compile(). UTF Strings Assuming PCRE is compiled with the relevant options, you can instruct the engine to treat the subject string as various kinds of UTF strings. (*UTF) is a generic way to treat the subject as a UTF string—detecting whether it should be treated as UTF-8, UTF-16 or UTF-32. (*UTF8), (*UTF16) and (*UTF32) treat the string as one of three specific UTF encodings. Unicode Properties for \d and \w By default, \d only matches ASCII digits, whereas \w only matches ASCII digits, letters and underscores. The (*UCP) modifier (which stands for Unicode Character Properties) allows these tokens to match Unicode digits and word characters. For instance, (*UCP)\d+ :: \w+ matches 1 :: Aれま (See demo). In combination with (*UCP), you may also need to use one of the (*UTF) modifiers. To see how this works, consider the output of this program with a standard Xampp PHP: $string = '1 :: Aれま'; $utfregex[0] = "~\d+ :: \w+~"; $utfregex[1] = "~(*UCP)\d+ :: \w+~"; $utfregex[2] = "~(*UTF)(*UCP)\d+ :: \w+~"; $utfregex[3] = "~(*UTF)\d+ :: \w+~"; $utfregex[4] = "~\d+ :: \w+~u"; foreach (range(0, 4) as $i) { echo "$i: ".preg_match($utfregex[$i],$string)."<br />"; } // Output: // 0: 0 // 1: 0 // 2: 1 => (*UTF)(*UCP) // 3: 0 // 4: 1 => The u flag produces the same result as (*UTF)(*UCP) Line Break Modifiers By default, when PCRE is compiled, you tell it what to consider to be a line break when encountering a . (as the dot it doesn't match line breaks unless in dotall mode), as well the ^ and $ anchors' behavior in multiline mode. You can override this default with the following modifiers: (*CR) Only a carriage return is considered to be a line break (*LF) Only a line feed is considered to be a line break (as on Unix) (*CRLF) Only a carriage return followed by a line feed is considered to be a line break (as on Windows) (*ANYCRLF) Any of the above three is considered to be a line break (*ANY) Any Unicode newline sequence is considered to be a line break For instance, (*CR)\w+.\w+ matches Line1\nLine2 because the dot is able to match the \n, which is not considered to be a line break. See demo. Controling \R By default, the \R metacharacter matches any Unicode newline sequence. When UTF-8 mode is off, these newline sequences are the \r\n pair, as well as the carriage return, line feed, vertical tab, form feed or next line characters. In UTF-8 mode, the token also matches the line separator and the paragraph separator character. Two start-of-pattern modifiers let you change the behavior of \R: With (*BSR_ANYCRLF), \R only matches the \r\n sequence, \r or \n. This can also be set when PCRE is compiled or requested via the PCRE_BSR_ANYCRLF option With (*BSR_UNICODE), \R matches any Unicode newline sequence (overriding the PCRE_BSR_ANYCRLF option if set). This can also be set when PCRE is compiled or requested via the PCRE_BSR_UNICODE option Controling Runaway Patterns To limit the number of times PCRE calls the match() function, use the (*LIMIT_MATCH=x) modifier, setting x to the desired number. To limit recursion, use (*LIMIT_RECURSION=d), setting d to the deepest recursion level allowed. Turning Off Optimizations By default, PCRE studies the pattern and automatically makes a quantified token atomic when the following token is incompatible—for instance turning A+X into A++X. The (*NO_AUTO_POSSESS) modifier disables this optimization. Use this when you want to use pcretest to benchmark two patterns and make yourself feel good about all the cycles auto-possessification is saving you. By default, PCRE performs several optimizations to find out faster whether a match will fail. The (*NO_START_OPT) modifier disables these optimizations. Disabling Empty Matches In PCRE2, (*NOTEMPTY) tells the engine not to return empty matches. Likewise, (*NOTEMPTY_ATSTART) tells the engine not to return empty matches found at the start of the subject. Disabling Automatic Anchoring Optimization In PCRE2, PCRE2_NO_DOTSTAR_ANCHOR tells the engine not to automatically anchor patterns that start with .* You can read more about this flag on the PCRE2 API page (search for PCRE2_NO_DOTSTAR_ANCHOR).

Mastering Lookahead and Lookbehind

LookaroundNameWhat it Does
(?=foo)LookaheadAsserts that what immediately follows the current position in the string is foo
(?<=foo)LookbehindAsserts that what immediately precedes the current position in the string is foo
(?!foo)Negative LookaheadAsserts that what immediately follows the current position in the string is not foo
(?<!foo)Negative LookbehindAsserts that what immediately precedes the current position in the string is not foo

Lookahead Example: Simple Password Validation

Let's get our feet wet right away with an expression that validates a password. The technique shown here will be useful for all kinds of other data you might want to validate (such as email addresses or phone numbers). Our password must meet four conditions: 1. The password must have between six and ten word characters \w 2. It must include at least one lowercase character [a-z] 3. It must include at least three uppercase characters [A-Z] 4. It must include at least one digit \d We'll assume we're working in a regex flavor where \d only matches ASCII digits 0 through 9, unlike .NET and Python where that token can match any Unicode digit. With lookarounds, your feet stay planted on the string. You're just looking, not moving! Our initial strategy (which we'll later tweak) will be to stand at the beginning of the string and look ahead four times—once for each condition. We'll look to check we have the right number of characters, then we'll look for a lowercase letter, and so on. If all the lookaheads are successful, we'll know the string is a valid password… And we'll simply gobble it all up with a plain .* Let's start with condition 1 A string that is made of six-to-ten word characters can be written like this: \A\w{6,10}\z The \A anchor asserts that the current position is the beginning of the string. After matching the six to ten word characters, the \z anchor asserts that the current position is the end of the string. Within a lookahead, this pattern becomes (?=\A\w{6,10}\z). This lookahead asserts: at the current position in the string, what follows is the beginning of the string, six to ten word characters, and the very end of the string. We want to make this assertion at the very beginning of the string. Therefore, to continue building our pattern, we want to anchor the lookahead with an \A. There is no need to duplicate the \A, so we can take it out of the lookahead. Our pattern becomes: \A(?=\w{6,10}\z) So far, we have an expression that validates that a string is entirely composed of six to ten word characters. Note that we haven't matched any of these characters yet: we have only looked ahead. The current position after the lookahead is still the beginning of the string. To check the other conditions, we just add lookaheads. Condition 2 For our second condition, we need to check that the password contains one lowercase letter. To find one lowercase letter, the simplest idea is to use .*[a-z]. That works, but the dot-star first shoots down to the end of the string, so we will always need to backtrack. Just for the sport, can we think of something more efficient? You might think of making the star quantifier reluctant by adding a ?, giving us .*?[a-z], but that too requires backtracking as a lazy quantifier requires backtracking at each step. For this type of situation, I recommend you use something like [^a-z]*[a-z] (or even better, depending on your engine, the atomic (?>[^a-z]*)[a-z] or possessive version [^a-z]*+[a-z]—but we'll discuss that in the footnotes). The negated character class [^a-z] is the counterclass of the lowercase letter [a-z] we are looking for: it matches one character that is not a lowercase letter, and the * quantifier makes us match zero or more such characters. The pattern [^a-z]*[a-z] is a good example of the principle of contrast recommended by the regex style guide. Let's use this pattern inside a lookahead: (?=[^a-z]*[a-z]) The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can match zero or more characters that are not lowercase letters, then we can match one lowercase letter: [a-z] Our pattern becomes: \A(?=\w{6,10}\z)(?=[^a-z]*[a-z]) At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead twice. We still haven't matched any characters. Note that on a logical level it doesn't matter which condition we check first. If we swapped the order of the lookaheads, the result would be the same. We have two more conditions to satisfy: two more lookaheads. Condition 3 For our third condition, we need to check that the password contains at least three uppercase letters. The logic is similar to condition 2: we look for an optional number of non-uppercase letters, then one uppercase letter… But we need to repeat that three times, for which we'll use the quantifier {3}. We'll use this lookahead: (?=(?:[^A-Z]*[A-Z]){3}) The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can do the following three times: match zero or more characters that are not uppercase letters (the job of the negated character class [^A-Z] with the quantifier *), then match one uppercase letter: [A-Z] Our pattern becomes: \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3}) At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead three times. We still haven't matched any characters. Condition 4 To check that the string contains at least one digit, we use this lookahead: (?=\D*\d). Opposing \d to its counterclass \D makes good use of the regex principle of contrast. The lookahead asserts: at this position in the string (i.e., the beginning of the string), we can match zero or more characters that are not digits (the job of the "not-a-digit" character class \D and the * quantifier), then we can match one digit: \d Our pattern becomes: \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d) At this stage, we have asserted that we are at the beginning of the string, and we have looked ahead four times to check our four conditions. We still haven't matched any characters, but we have validated our string: we know that it is a valid password. If all we wanted was to validate the password, we could stop right there. But if for any reason we also need to match and return the entire string—perhaps because we ran the regex on the output of a function and the password's characters haven't yet been assigned to a variable—we can easily do so now. Matching the Validated String After checking that the string conforms to all four conditions, we are still standing at the beginning of the string. The five assertions we have made (the anchor \A and the four lookaheads) have not changed our position. At this stage, we can use a simple .* to gobble up the string: we know that whatever characters are matched by the dot-star, the string is a valid password. The pattern becomes: \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d).* Fine-Tuning: Removing One Condition For n conditions, use n-1 lookaheads If you examine our lookaheads, you may notice that the pattern \w{6,10}\z inside the first one examines all the characters in the string. Therefore, we could have used this pattern to match the whole string instead of the dot-star .* This allows us to remove one lookahead and to simplify the pattern to this: \A(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)\w{6,10}\z The pattern \w{6,10}\z now serves the double purpose of matching the whole string and of ensuring that the string is entirely composed of six to ten word characters. Generalizing this result, if you must check for n conditions, your pattern only needs to include n-1 lookaheads at the most. Often, you are even able to combine several conditions into a single lookahead. You may object that we were able to use \w{6,10}\z because it happened to match the whole string. Indeed that was the case. But we could also have converted any of the other three lookaheads to match the entire string. For instance, taking the lookahead (?=\D*\d) which checks for the presence of one digit, we can add a simple .*\z to get us to the end of the string. The pattern would have become: \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})\D*\d.*\z By the way, you may wonder why I bother using the \z after the .*: shouldn't it get me to the end of the string? In general, not so: unless we're in DOTALL mode, the dot doesn't match line breaks. Therefore, the .* only gets you to the end of the first line. After this, the string may have line breaks and many more line. A \z anchor ensures that after the .* we have reached not only the end of the line, but also the end of the string. In this particular pattern, the first lookaround (?=\w{6,10}\z) already ensures that there cannot be any line breaks in the string, so the final \z is not strictly necessary.

The Order of Lookaheads Doesn't Matter… Almost

In our password validation pattern, since the three lookaheads don't change our position in the string, we can rearrange them in any order without affecting the overall logic. While the order of lookaheads doesn't matter on a logical level, keep in mind that it may matter for matching speed. If one lookahead is more likely to fail than the other two, it makes little sense to place it in third position and expend a lot of energy checking the first two conditions. Make it first, so that if we're going to fail, we fail early—an application of the design to fail principle from the regex style guide. In fact, this is what we do by placing the anchor \A in first position. Since it is an assertion that doesn't consume characters, it too could swap positions with any of the lookaheads. We'll see why this is a bad idea, but first… In passing, consider that \A can be written with lookarounds: in DOTALL mode, where the dot matches any character including line breaks, the negative lookbehind (?<!.) asserts that what precedes the current position is not any character—therefore the position must be the beginning of the string. Without DOTALL mode, the negative lookbehind (?<![\D\d]) asserts the same, since [\D\d] matches one character that is either a digit or a non-digit—in other words, any character. Now imagine we set \A in fourth position, after the three lookaheads. The resulting match would be the same, but it could take a lot more time. For instance, suppose the third lookahead (whose job it is to assert that the string contains at least one digit) fails. After failing to find a match at the first position in the string, the engine advances to the second position and tries the lookaheads again, one after the other. Once more, the third lookahead is bound to fail to find a digit. After each failure, the engine will start a new match attempt starting at the next position in the string. Even when the two first lookaheads succeed (and they may fail, as the uppercase or lowercase letter they check for may have been the lone one in the string, and at a position already passed), the third lookahead will always fail to find a digit. Therefore the anchor \A is never even attempted: the pattern fails before the engine reaches that token. In contrast, when \A is first, it can only match at the first position in the string. The third lookahead still fails, but when the engine tries to match at further positions, the \A immediately fails, so the engine doesn't need to waste any more time with the lookaheads.

Lookarounds Stand their Ground

If I seem to be flogging a dead horse here, it's only because this point is the most common source of confusion with lookarounds. As the password validation example made clear, lookarounds stand their ground. They look immediately to the left or right of the engine's current position on the string—but do not alter that position. Therefore, do not expect the pattern A(?=5) to match the A in the string AB25. Many beginners assume that the lookahead says that "there is a 5 somewhere to the right", but that is not so. After the engine matches the A, the lookahead (?=5) asserts that at the current position in the string, what immediately follows is a 5. If you want to check if there is a 5 somewhere (anywhere) to the right, you can use (?=[^5]*5). Moreover, don't expect the pattern A(?=5)(?=[A-Z]) to match the A in the string A5B. Many beginners assume that the second lookahead looks to the right of the first lookahead. It is not so. At the end of the first lookahead, the engine is still planted at the very same spot in the string, after the A. When the lookahead (?=[A-Z]) tries to assert that what immediately follows the current position is an uppercase letter, it fails because the next character is still the 5. If you want to check that the 5 is followed by an uppercase letter, just state it in the first lookahead: (?=5[A-Z]) So lookahead and lookbehind don't mean "look way ahead into the distance". They mean "look at the text immediately to the left or to the right". If you want to inspect a piece of string further down, you will need to insert "binoculars" inside the lookahead to get you to the part of the string you want to inspect—for instance a .*, or, ideally, more specific tokens.

Various Uses for Lookarounds

Before we dive into interesting but sometimes terse details, let's get excited about lookarounds by surveying some of their terrific uses. Validation The password validation section showed how the combination of several lookaheads can impose a number of conditions on the string to be matched, allowing us to validate it with a single pattern. Restricting a Character Range (Subtraction, Intersection) Suppose you want to match one word character \w as long as it is not the letter Q. There are several ways to do it without lookarounds: In engines that support character class subtraction, you can use [\w-[Q]] (.NET), [\w&&[^Q]] (Java and Ruby 1.9+) or [\w--Q] (Python with the alternate regex module) You can build a character class such as [_0-9a-zA-PR-Z] You can use [^\WQ]—an example of an obnoxious double-negative character range. If your engine doesn't support character class subtraction, the simplest may be to use the workaround shown on the page about class operations. This uses a lookahead to restrict the character class \w: (?!Q)\w After the negative lookahead asserts that what follows the current position is not a Q, the \w matches a word character. Not only is this solution easy to read, it is also easy to maintain if we ever decide to exclude the letter K instead of Q, or to exclude both: (?![QK])\w Note that we can also perform the same exclusion task with a negative lookbehind: \w(?<!Q) After the \w matches a word character, the negative lookbehind asserts that what precedes the current position is not a Q. Using the same idea, if we wanted to match one character in the Arabic script as long as it is not a number, we could use this pattern: (?!\p{N})\p{Arabic} This would work in Perl, PCRE (C, PHP, R…) and Ruby 2+. In .NET and Java, you would use (?!\p{N})\p{IsArabic} Likewise, we can use this technique to perform a DIY character class intersection. For instance, to match one character in the Arabic script as long as it is a number, we transform the negative lookahead above to a positive lookahead. In the Perl / PCRE / Ruby version, this gives us: (?=\p{N})\p{Arabic} This is basically the password validation technique with two conditions applied to a single character. Needless to say, you can interchange the content of the lookahead with the token to be matched: (?=\p{Arabic})\p{N} Tempering the scope of a token This use is similar to the last. Instead of removing characters from a class, it restricts the scope within which a token is allowed to match. For instance, suppose we want to match any character as long as it is not followed by {END}. Using a negative lookahead, we can use: (?:(?!{END}).)* Each . token is tempered by (?!{END}), which specifies that the dot cannot be the beginning of {END}. This technique is called on the Quantifiers page. Another technique is: (?:[^{]++|{(?!END}))*+ On the left side of the alternation, [^{]++ matches characters that are not an opening brace. On the right side, {(?!END}) matches an opening brace that is not followed by END}. This technique appears in the section of the Quantifiers page. Delimiter Do you have a string where you want to start matching all characters once the first instance of #START# is passed? No problem, just use a lookbehind to make a delimiter: (?<=#START#).* After the lookbehind asserts that what immediately precedes the current position is #START#, the dot-star .* matches all the characters to the right. Or would you like to match all characters in a string up to, but not including the characters #END#? Make a delimiter using a lookahead: .*?(?=#END#) You can, of course, combine the two: (?<=#START#).*?(?=#END#) See the page on boundaries for advice on building fancy DIY delimiters. Inserting Text at a Position Someone gave you a file full of film titles in CamelCase, such as HaroldAndKumarGoToWhiteCastle. To make it easier to read, you want to insert a space at each position between a lowercase letter and an uppercase letter. This regex matches these exact positions: (?<=[a-z])(?=[A-Z]) In your text editor's regex replacement function, all you have to do is replace the matches space characters, and spaces be inserted in the right spot. This regex is what's known as a "zero-width match" because it matches a position without matching any actual characters. How does it work? The lookbehind asserts that what immediately precedes the current position is a lowercase letter. And the lookahead asserts that what immediately follows the current position is an uppercase letter. Splitting a String at a Position We can use the exact same regex from the previous example to split the string AppleOrangeBananaStrawberryPeach into a list of fruits. Once again, the regex (?<=[a-z])(?=[A-Z]) matches the positions between a lowercase letter and an uppercase letter. In most languages, when you feed this regex to the function that uses a regex pattern to split strings, it returns an array of words. Note that Python's re module does not split on zero-width matches—but the far superior regex module does. Finding Overlapping Matches Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex: (?=(\w+)) When you allow the engine to find all matches, all the substrings will be captured to Group 1 How does this work? At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD. The engine moves to the next position in the string, and the process repeats itself for CD then D. In .NET, which has infinite lookbehind, you can find overlapping matches from the other side of the string. For instance, on the same string ABCD, consider this pattern: (?<=(\w+)) It will capture A, AB, ABC and ABCD. To achieve the same in an engine that doesn't support infinite lookbehind, you would have to reverse the string, use the lookahead version (?=(\w+)) then reverse the captures.

Zero-Width Matches

As we've seen, a lookaround looks left or right but it doesn't add any characters to the match to be returned by the regex engine. Likewise, an anchor such as ^ and a boundary such as \b can match at a given position in the string, but they do not add any characters to the match. Usually, lookaheads, lookbehinds, anchors and boundaries appear in patterns that contain tokens that do match characters, allowing the engine to return a matched string. For instance, in (?<=start_)\d+, the engine matches and returns some digits, but not the prefix start_ However, if a pattern only contains lookarounds, anchors and boundaries, the engine may be able to match the pattern without matching any characters. The resulting match is called a zero-width match because it contains no characters. This can be a useful technique, and we have already seen some applications of zero-width matches in the section on uses for lookarounds. To bring them together under one heading, here are some of their main uses. Validation If you string several lookarounds in a row, you can validate that a string conforms to a set of rules, as in the password validation technique. We saw that when you have n conditions, if you also want to match the string, you usually need n-1 lookarounds at the most as one condition can be removed and used in the matching section of the pattern. But if all you want to do is validate, all the conditions can stay inside lookarounds, giving you a zero-width match. Inserting You can use a zero-width match regex to match a position in a string and insert text at that position. For instance, by matching (?m)^ (the beginning of a line in multiline mode) and replacing the match with // , you can add a prefix to every line of a file. Likewise, we saw how the zero-width pattern (?<=[a-z])(?=[A-Z]) allows you to insert characters in a CamelCase word. Splitting We saw how the same zero-width pattern (?<=[a-z])(?=[A-Z]) allows you to split a CamelCase word into its components. Overlapping Matches We saw how an unanchored lookaround that contains capture groups—such as (?=(\w+))—allows you to match overlapping string segments.

Positioning the Lookaround

Often, you have two options for positioning a lookaround: before the text to be matched, or after. Usually, one of the options is more efficient because it requires less work of the engine. To illustrate this, here are examples for each kind of lookaround. I borrowed them from the lookarounds section of the main syntax page, where they are discussed in greater detail. Lookahead \d+(?= dollars) and (?=\d+ dollars)\d+ both match 100 in 100 dollars, but the first is more efficient because the engine needs to match \d+ only once. Negative Lookahead \d+(?! dollars) and (?!\d+ dollars)\d+ both match 100 in 100 pesos, but the first is more efficient because the engine needs to match \d+ only once. Lookbehind (?<=USD)\d{3} and \d{3}(?<=USD\d{3}) both match 100 in USD100, but the first is more efficient because the engine needs to match \d{3} only once. Negative Lookbehind (?<!USD)\d{3} and \d{3}(?<!USD\d{3}) both match 100 in JPY100, but the first is more efficient because the engine needs to match \d{3} only once. What may not be so clear is that each of these lookarounds can be used in two main ways: before the expression to be matched, or after it. These two ways have a slightly different feel. Please don't obsess over the differences; rather, just cruise through these simple examples to become familiar with the types of effects you can achieve. When you compare each pair, the two methods have a different feel. The point of the examples is not to make you memorize "the right position", but to expose you to those two basic feels. Once you're familiar with them, you will naturally think of rewriting a lookaround that feels too heavy. With a bit of practice, the efficient way of positioning your lookarounds will probably come to you naturally.

Lookarounds that Look on Both Sides: Back to the Future

Suppose you want to match a two-digit number surrounded by underscores as in _12_ but not the underscores. We have already seen three ways to do this: You can match everything and capture the digits to Group 1: _(\d{2})_ You can use a lookbehind and a lookahead: (?<=_)\d{2}(?=_) You can use \K to drop the first underscore from the match: _\K\d{2}(?=_) There is a fourth technique I'd like to introduce you to. I call it the "back to the future lookbehind." There shouldn't be any reason to use it on its own, but sometimes within an intricate pattern it may just what you need, so it's nice to be familiar with it and add it to your repertoire. We can position our back-to-the-future lookbehind before or after the digits. Let's start with the before version: (?<=_(?=\d{2}_))\d+ Wowzy, what does this do? The lookbehind asserts that what immediately precedes the current position in the string is an underscore, then a position where the lookahead (?=\d{2}_) can assert that what immediately follows is two digits and an underscore. This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic. Second, note that even though it looks complex, this is a fixed-width lookbehind (the width is one character, the underscore), so it should work in all flavors of lookbehind. (However, it does not work in Ruby as Ruby does not allow lookaheads and negative lookbehinds inside lookbehind.) Another interesting feature is how the notion of "current position in the string" is not the same for the lookbehind and for the lookahead. You'll remember that lookarounds stand their ground, so that after checking the assertion made by a lookaround, the engine hasn't moved in the string. Are we breaking that rule? We're not. In the string 10 _16_ 20, let's say the engine has reached the position between the underscore and the 1 in 16. The lookbehind makes an assertion about what can be matched at that position. When the engine exits the lookbehind, it is still standing in that same spot, and the token \d{2} can proceed to match the characters 16. But within the lookbehind itself, we enter a different little world. You can imagine that outside that world the engine is red, and inside the little world of the lookbehind, there is another little engine which is yellow. That yellow engine keeps track of its own position in the string. In most engines (.NET proceeds differently), the yellow engine is initially dropped at a position in the string that is found by taking the red engine's position and subtracting the width of the lookbehind, which is 1. The yellow engine therefore starts its work before the leading underscore. Within the lookbehind's little world, after matching the underscore token, the yellow engine's position in the string is between the underscore and the 1. It is that position that the lookahead refers to when it asserts that at the current position in the string (according to the little world of the lookbehind and its yellow engine), what immediately follows is two digits and an underscore. After the digits Here is a second version where the "back-to-the-future lookbehind" comes after the digits: \d+(?<=_\d{2}(?=_)) The lookbehind states: what immediately precedes this position in the string is an underscore and two digits, then a position where the lookahead (?=_) can assert that what immediately follows the current position in the string (according to the yellow engine and the lookbehind's little world) is an underscore. This too is a fixed-width lookbehind (the width is three character, i.e. the leading underscore and the two digits), so it should work in all flavors of lookbehind except Ruby.

Compound Lookahead and Compound Lookbehind

The back-to-the-future lookbehind introduced us to what I call compound lookarounds, i.e., lookarounds that contain other lookarounds. You could also call them nested lookarounds, but for me the idea of compounding captures something more about the feel of working with these constructs. Let's look at some examples. Token followed by one character, but not more How can you match a number that is followed by one underscore, but not more? You can use this: \d+(?=_(?!_)) The lookahead asserts: what follows the current position in the string is one underscore, then a position where the negative lookahead (?!_) can assert that what follows is not an underscore. A less elegant variation would be \d+(?=(?!__)_) Token preceded by one character, but not more How can you match a number that is preceded by one underscore, but not more? You can use this: (?<=(?<!_)_)\d+ The lookbehind asserts: what precedes the current position in the string is a position where the negative lookbehind (?<!_) can assert that what immediately precedes is not an underscore, then an underscore. A variation would be (?<=_(?<!__))\d+ Multiple Compounding Needless to say, it won't be long until you find occasions to add levels of compounding beyond the two we've just seen. But that quickly becomes obnoxious, and it becomes simpler to rearrange the regex. For instance, building on the previous pattern, (?<=(?<!(?<!X)_)_)\d+ matches a number that is precede by an underscore that is not preceded by an underscore unless that underscore is preceded by an X. In .NET, PCRE, Java and Ruby, this could be simplified to (?<=(?<!_)_|X__)\d+ In Perl and Python, you could use (?:(?<=(?<!_)_)|(?<=X__))\d+

The Engine Doesn't Backtrack into Lookarounds… …because they're atomic

Here's a fun regex task. You have a string like this: _rabbit _dog _mouse DIC:cat:dog:mouse The DIC section at the end contains a list of allowed animals. Our job is to match all the _tokens named after an allowed animal. Therefore, we expect to match _dog and _mouse. A lookaround helps us do this: _(\w+)\b(?=.*:\1\b) After matching the underscore, we capture a word to Group 1. Then the lookahead (?=.*:\1\b) asserts what follows the current position in the string is zero or more characters, then a colon, then the word captured to Group 1. As hoped, this matches both _dog and _mouse. Now suppose we try a "reversed" approach: _(?=.*:(\w+)\b)\1\b This only matches _mouse. Why? First let's try to understand what this regex hopes to accomplish. It may not be that obvious, but it illustrates an important feature of lookarounds. After the engine matches the underscore, the lookahead (?=.*:(\w+)\b) asserts that what follows the current position in the string is any number of characters, then a colon, then a word (captured to Group 1). After passing that assertion, the back-reference \1 matches what was captured into Group 1. Let's see how this works out. Remember that our string is _rabbit _dog _mouse DIC:cat:dog:mouse After the underscore that precedes rabbit, we expect the lookahead to fail because there is no rabbit in the DIC section—and it does. The next time we match an underscore is before dog. At that stage, inside the lookahead (?=.*:(\w+)\b), the dot-star shoots down to the end of the string, then backtracks just far enough to allow the colon to match, after which the word mouse is matched and captured to Group 1. The lookahead succeeds. The next token \1 tries to match mouse, but the next character in the string is the d from dog, so the token fails. At this stage, having learned everything about backtracking, we might assume that the regex engine allows the dot-star to backtrack even more inside the lookahead, up to the previous colon, which would then allow (\w+) to match and capture mouse. Then the back-reference \1 would match mouse, and the engine would return a successful match. However, it does not work that way. Once the regex engine has left a lookaround, it will not backtrack into it if something fails somewhere down the pattern. On a logical level, that is because the official point of a lookaround is to return one of two values: true or false. Once a lookahead evaluates to true at a given position in the string, it is always true. From the engine's standpoint, there is nothing to backtrack. What would be the point—since the only other available value is false, and that would fail the pattern? The fact that the engine will not backtrack into a lookaround means that it is an atomic block. This property of lookarounds will rarely matter, but if someday, in the middle of building an intricate pattern, a lookahead refuses to cooperate… This may be the reason.

Fixed-Width, Constrained-Width and Infinite-Width Lookbehind

In strings such as 123456_ORANGE abc12_APPLE, suppose you are interested in matching uppercase words, provided they are preceded by a prefix composed of digits and an underscore character. Therefore, in this string, you want to match ORANGE but not APPLE. It's worth remembering that in most regex flavors (.NET is one of the few exceptions), the following pattern is invalid: (?<=\b\d+_)[A-Z]+ That is because the width of the text matched by the token \d+ can be anything. Most engines require the width of the subexpression within a lookbehind to be known in advance, as in (?<=\d{3}) Some engines allow the width of the subexpression within a lookbehind to take various pre-determined values found on the various sides of an alternation, as in (?<=0|128|\d{6}). Yet others allow the width to vary within a pre-determined range, as in (?<=d{2,6}) For details of what kinds of widths various engines allow in a lookbehind, see the Lookbehind: Fixed-Width / Constrained Width / Infinite Width section of the main syntax page. To honor the winners, I'll just repeat here that the only two programming-language flavors that support infinite-width lookbehind are .NET (C#, VB.NET, …) and Matthew Barnett's regex module for Python. I've also implemented an infinite lookbehind demo for PCRE. Capture Group Inside Variable Lookbehind: Difference between Java and .NET Both Java and .NET allow this pattern: (?<=(\d{1,5}))Z .NET allows it because it supports infinite-width lookbehind. Java allows it because it supports lookbehind whose width falls within a defined range. However, they operate differently. As a result, against the string 123Z, this pattern will return different Group 1 captures in the two engines. Java captures 3 to Group 1. The engine sees that the width of the string to be matched inside the lookbehind must fall between one and five characters. Java tries all the possible fixed-width patterns in the range, from the shortest to the longest, until one succeeds. The shortest possible fixed-width pattern is (?<=(\d{1})). The engine temporarily skips back one character in the string, tries to match \d{1} and succeeds. The lookaround succeeds, and Group 1 contains 3. .NET captures 123 to Group 1. The .NET engine has a far more efficient way of processing variable-width lookbehinds. Instead of trying multiple fixed-width patterns starting at points further and further back in the string, .NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string. Therefore, in 123Z, to try the lookbehind at the point before Z, it reverses the portion of string to be tested from 123 to 321. Likewise, the lookbehind (?<=(\d{1,5})) is flipped into the lookahead (?=(\d{1,5})). \d{1,5} matches 321. Reversing that string, Group 1 contains 123. To only capture 3 as in Java, you would have to make the quantifier lazy: (?<=(\d{1,5}?))Z Like .NET, the regex alternate regular expressions module for Python captures 123 to Group 1. Workarounds There are two main workarounds to the lack of support for variable-width (or infinite-width) lookbehind: Capture groups. Instead of (?<=\b\d+_)[A-Z]+ , you can use \b\d+_([A-Z]+), which matches the digits and underscore you don't want to see, then matches and captures to Group 1 the uppercase text you want to inspect. This will work in all major regex flavors. The \K "keep out" verb, which is available in Perl, PCRE (C, PHP, R…), Ruby 2+ and Python\'s alternate regex engine. \K tells the engine to drop whatever it has matched so far from the match to be returned. Instead of (?<=\b\d+_)[A-Z]+, you can therefore use \b\d+_\K[A-Z]+ Compared with lookbehinds, both the \K and capture group workarounds have limitations: When you look for multiple matches in a string, at the starting position of each match attempt, a lookbehind can inspect the characters behind the current position in the string. Therefore, against 123, the pattern (?<=\d)\d (match a digit preceded by a digit) will match both 2 and 3. In contrast, \d\K\d can only match 2, as the starting position after the first match is immediately before the 3, and there are not enough digits left for a second match. Likewise, \d(\d) can only capture 2. With lookbehinds, you can impose multiple conditions (similar to our password validation technique) by using multiple lookbehinds. For instance, to match a digit that is preceded by a lower-case Greek letter, you can use (?<=\p{Ll})(?<=\p{Greek})\d. The first lookbehind (?<=\p{Ll}) ensures that the character immediately to the left is a lower-case letter, and the second lookbehind (?<=\p{Greek}) ensures that the character immediately to the left belongs to the Greek script. With the workarounds, you could use \p{Greek}\K\d to match a digit preceded by a character in the Greek script (or \p{Greek}(\d) to capture it), but you cannot impose a second condition. To get over this limitation, you could capture the Greek character and use a second regex to check that it is a lower-case letter.

Lookarounds (Usually) Want to be Anchored

Let's imagine we want to match a string consisting of one word, provided it contains at least one digit. This pattern offers a reasonable solution—one of several: \A(?=\D*\d)\w+\z The \A anchor asserts that the current position is the beginning of the string. The lookahead (?=\D*\d) asserts that at the current position (which is still the beginning of the string), we can match zero or more non-digits, then one digit. Next, \w+ matches our word. Finally, the \z anchor asserts that the current position is the end of the string. Now consider what happens when we forget the anchor \A and use (?=\D*\d)\w+\z. To make our oversight seem less severe, let's assume we know that our string always contains an uninterrupted string of word characters. This guarantees that if we find a match, it will have to be the right one—at the beginning of the string, as we wanted. So what's the problem? Suppose we use our regex on a string composed of one hundred characters V. Since the string doesn't contain a single digit, you and I can immediately see that the regex must fail. Let's see how fast the engine comes to the same conclusion. As always, the engine begins by trying to match the pattern at the first position in the string. Starting with the first token (?=\D*\d), it tries to assert that at the current position, i.e. the beginning of the string, it can match zero or more non-digits, then one digit. Within the subexpression, the \D* matches all the V characters. The engine then tries to match a digit, but since we have reached the end of the string, that fails. If we're using a smart engine such as PCRE, at this stage the engine fails the lookaround for this first match attempt. That's because before starting the match attempt, the engine has studied the pattern and noticed that the \D and \d tokens are mutually exclusive, and it has turned the * quantifier into a possessive quantifier *+, a process known to PCRE as auto-possessification (see footnote). A less clever engine will backtrack, giving up all the \D characters it has matched one by one, each time attempting to match a \d after giving up a \D. Eventually, the engine runs out of characters to backtrack, and the lookahead fails. Once the engine understands that the lookahead must fail (whether it comes to this conclusion cleverly or clumsily), it gives up on the entire first match attempt. Next, as always in such cases, the engine moves to the next position in the string (past the first V) and starts a new match attempt. Again, the \D* eats up all the V characters—although this time, there are only 99 of them. Again, the lookahead fails, either fast if the engine is smart, or, more likely, after backtracking all the way back to the starting position. After failing a second time, the engine moves past the second V, starts a new match attempt, and fails… And so on, all the way to the end of the string. Because the pattern is not anchored at the beginning of the string, at each match attempt, the engine checks whether the lookahead matches at the current position. In doing so, in the best case, it matches 100 V characters, then 99 on the second attempt, and so on—so it needs about 5000 steps before it can see that the pattern will never match. In the more usual case, the engine needs to backtrack and try the \d at each position, adding two steps at each V position. Altogether, it needs about 15,000 steps before it can see that the pattern will never match. In contrast, with the original anchored pattern \A(?=\D*\d)\w+\z, after the engine fails the first match attempt, each of the following match attempts at further positions in the string fail instantly, because the \A fails before the engine gets to the lookahead. In the best case, the engine takes about 200 steps to fail (100 steps to match all the V characters, then one step at each of the further match attempts.) In the more usual case, the engine takes about 400 steps to fail (300 steps on the first match attempt, then one step at each of the further match attempts.) Needless to say, the ratio of (15,000 / 400) steps is the kind of performance hit we try to avoid in computing. This makes a solid case for helping the engine along by minimizing the number of times lookaheads must be attempted, either by using anchors such as ^ and \A, or by matching literal characters immediately before the lookahead. One Exception: Overlapping Matches There are times when we do want the engine to attempt the lookahead at every single position in the string. Usually, the purpose of such a maneuver is to match a number of overlapping substrings. For instance, against the string word, if the regex (?=(\w+)) is allowed to match repeatedly, it will match four times, and each match will capture a different string to Group 1: word, ord, rd, then d. The section on overlapping matches explains how this works.

Footnotes

Atomic tweak The atomic variation (?>[^a-z]*)[a-z] or possessive version [^a-z]*+[a-z] are tweaks that ensure that if the engine fails to find the lowercase letter, it won't "stupidly" backtrack, giving up the non-lowercase letters one by one to see if a lowercase letter might fit at that stage. Note that before they start matching, some engines notice the mutually exclusive character of [a-z] and its counterclass and automatically make the * quantifier possessive for you. This optimization is what PCRE calls auto-possessification. It allows you to turn it off with the Special Start-of-Pattern Modifier (*NO_AUTO_POSSESS)—but why would you ever want to? Subject: Do you have any books you've written I really like the way you explain in this website every detail is explained with examples and understandable Subject: Great examples Great article, examples with detailed explanations for regexp newbies like me. It helps me open my mind about regex. I love the way you present. THANKS A LOT :) bookmarked Subject: Lookahead Very Very helpful… Thank You Awesome, enjoyed it! Subject: none Very good article. Lookbehinds had been very confusing to me until I read this, specifically the fact that (a) the engine has not moved at the end of the lookaround(s), so (b) it is very important where in the regex you put any literals that do in fact move the engine, in relation to the lookbehind. So very clear now! You're a great teacher, very clear writing too. Subject: Lookarounds Nicely explained. Very easy to read and understand. Thnx. Very well explained. Been put off lookarounds until now. Thanks a lot

Mastering Quantifiers

The behavior of regex quantifiers is a common source of woes for the regex apprentice. You may have heard that they can be "greedy" or "lazy", sometimes even "possessive"—but sometimes they don't seem to behave the way you had expected. Is there a bug in your regex engine? As it turns out, there is more to quantifiers than just "greedy" and "lazy". This page digs deep into the details of quantifiers and shows you the traps you need to be aware of and the tricks you need to master in order to wield them effectively. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Quantifier Basics

Before we dive into quantifier tricks and traps, let's have a quick reminder of the basics because I don't know what you've read so far and there's no shortage of incomplete regex tutorials. Stick to what's immediately to the left A regex quantifier such as + tells the regex engine to match a certain quantity of the character, token or subexpression immediately to its left. For instance, in A+ the quantifier + applies to the character A in \w* the quantifier * applies to the token \w in carrots? the quantifier ? applies to the character s—not to carrots in (?:apple,|carrot,)+ the quantifier + applies to the subexpression (?:apple,|carrot,) One place where the "stick-to-the-left" rule is not immediately obvious is with the sequence that escapes all of the characters it contains. If you stick a + after such a sequence, should it apply to the whole sequence, or only to its last character? The engine treats the content of the sequence as a series of literals, so the quantifier only applies to the last character. For instance, \QC+\E+ matches all of C++++, but against C+C+ it only matches C+. Greedy: As Many As Possible (longest match) By default, a quantifier tells the engine to match as many instances of its quantified token or subpattern as possible. This behavior is called greedy. For instance, take the + quantifier. It allows the engine to match one or more of the token it quantifies: \d+ can therefore match one or more digits. But "one or more" is rather vague: in the string 123, "one or more digits" (starting from the left) could be 1, 12 or 123. Which of these does \d+ match? Because by default quantifiers are greedy, \d+ matches as many digits as possible, i.e. 123. For the match attempt that starts at a given position, a greedy quantifier gives you the longest match. Do beware of this notion of "longest match": it refers to the longest match that can be found with a match attempt that starts at a given position in the string — not to the longest possible match that can be found if a pattern is applied repeatedly to various sections of a string. For more on this, see the section about the longest and shortest match traps. Docile: Give Back When Needed Greedy… but docile. Because by default a quantifier is greedy, the engine starts out by matching as many of the quantified token as it can swallow. For instance, with A+, the engine swallows as many A characters as it can find. But if the quantified token has matched so many characters that the rest of the pattern can not match, the engine will backtrack to the quantified token and make it give up characters it matched earlier—one character or chunk at a time, depending on whether the quantifier applies to a single character or to a subpattern that can match chunks of several characters. After giving up each character or chunk, the engine tries once again to match the rest of the pattern. I call this behavior of greedy quantifiers docile. For instance, against the string A tasty apple, using the regex .*apple, the token .* starts out by greedily matching every single character in the string. The engine then advances to the next token a, which fails to match as there are no characters left in the string. The engine backtracks into the .*, which gives up the e in apple. The engine once again advances to the next token, but the a fails to match the e. The engine again backtracks into the .*, which gives up the l. The process repeats itself until the .* has given up the a, at which stage the text tokens a, p, p, l and e are all able to match and the overall match is successful. When you hear that A+ means "one or more A characters", that is therefore not the whole story. It is "one or more, but as many as possible (greedy), and giving back characters if needed in order to allow the rest of the pattern to match (docile)". Suppose our entire string is AAA. Depending on which pattern we use to match the string, the quantified token A+ could end up matching A, AA or AAA. Consider these three patterns: A+—A+ matches AAA (as many as possible). (A+).—A+ (captured to Group 1) matches AA, because to allow the dot to match, A+ (which starts out by matching AAA) has to give up one A. (A+)..—A+ (captured to Group 1) matches A, because to allow the two dots to match, A+ (which starts out by matching AAA) has to give up two A characters. Lazy: As Few As Possible (shortest match) In contrast to the standard greedy quantifier, which eats up as many instances of the quantified token as possible, a lazy (sometimes called reluctant) quantifier tells the engine to match as few of the quantified tokens as needed. As you'll see in the table below, a regular quantifier is made lazy by appending a ? question mark to it. Since the * quantifier allows the engine to match zero or more characters, \w*?E tells the engine to match zero or more word characters, but as few as needed—which might be none at all—then to match an E. In the string 123EEE, starting from the very left, "zero or more word characters then an E" could be 123E, 123EE or 123EEE. Which of these does \w*?E match? Because the *? quantifier is lazy, \w*? matches as few characters as possible to allow the overall match attempt to succeed, i.e. 123—and the overall match is 123E. For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match. Do beware of this notion of "shortest match": it refers to the shortest match that can be found with a match attempt that starts at a given position in the string — not to the shortest possible match that can be found if a pattern is applied repeatedly to various sections of a string. For more on this, see the section about the longest and shortest match traps. Helpful: Expand When Needed Lazy… but helpful. With a lazy quantifier, the engine starts out by matching as few of the tokens as the quantifier allows. For instance, with A*, the engine starts out matching zero characters, since *allows the engine to match "zero or more". But if the quantified token has matched so few characters that the rest of the pattern can not match, the engine backtracks to the quantified token and makes it expand its match—one step at a time. After matching each new character or subexpression, the engine tries once again to match the rest of the pattern. I call this behavior of lazy quantifiers helpful. For instance, against the string Two_apples, using the regex .*?apples, the token .*? starts out by matching zero characters—the minimum allowed by the * quantifier. The engine then advances in the pattern and tries to match the a token against the T in Two. That fails, so the engine backtracks to the .*?, which expands to match the T. The engine advances both in the pattern and in the string and tries to match the a token against the w in Two. Once again, the engine has to backtrack. The .*? expands to match the w, then the a token fails against the o in Two. This process of advancing, failing, backtracking and expanding repeats itself until the .*? has expanded to match Two_. At that stage, the following token a is able to match, as are the p and all the tokens that follow. The match attempt succeeds. As this example showed, because lazy quantifiers expand their match one step at a time in order to match only as much as needed, they cause the engine to backtrack at each step. They are expensive. To fully grasp how lazy quantifiers work, let's look at one more example. The quantified token A*? matches zero or more A characters—as few as possible, expanding as needed. Against the string AA, depending on the overall pattern, A*? could end up matching no characters at all, A or AA. Consider how these three patterns match AA: ^(A*?)AA$—A*? (captured to Group 1) matches no characters. After the anchor ^ asserts that the current position is the beginning of the string, A*? tries to match the least number of characters allowed by *, which is zero characters. The engine moves to the next token: the A, which matches the first A in AA. The next token matches the second A. The match attempt succeeds, and Group 1 ends up containing no characters. ^(A*?)A$—A*? (captured to Group 1) matches one A. Initially, the A*? matches zero characters. The next token A matches the first A in AA. The engine advances to the next token, but the anchor $ fails to match against the second A. The engine sees that the A*? can expand. It backtracks and gives up the A, which the A*? now expands to match. The engine moves to the next token: the A matches the second A in the string. The $ anchor now succeeds. Group 1 ends up containing one A. ^A*?$—A*? matches AA. After the A*? matches zero characters, the $ fails to match. The engine backtracks and allows the A*? to match one A. Once again, the $ fails to match (there is one A left in the string). The engine backtracks again and allows the A*? to expand to match the second A. This time, the $ anchor matches. Group 1 ends up containing AA. Possessive: Don't Give Up Characters In contrast to the standard docile quantifier, which gives up characters if needed in order to allow the rest of the pattern to match, a possessive quantifier tells the engine that even if what follows in the pattern fails to match, it will hang on to its characters. As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back. Whereas the regex A+. matches the string AAA, A++. doesn't. At first, the token A++ greedily matches all the A characters in the string. The engine then advances to the next token in the pattern. The dot . fails to match because there are no characters left to match. The engine looks if there is something to backtrack. But A++ is possessive, so it will not give up any characters. There is nothing to backtrack, and the pattern fails. In contrast, with A+., the A+ would have given up the final A, allowing the dot to match. Possessive quantifiers match fragments of string as solid blocks that cannot be backtracked into: it's all or nothing. This behavior is particularly useful when you know there is no valid reason why the engine should ever backtrack into a section of matched text, as you can save the engine a lot of needless work. In particular, when a match must fail, a possessive quantifier can help it to fail faster. For instance, suppose we want to match a string of digits that ends with E, as in 123E. We can use a possessive quantifier with the \d digit token: \b\d++E When we use this pattern against 123E, it matches in the same way as if we had used a non-possessive \d. Actually, in theory the match could be a hair faster because the \d++ quantifier doesn't need to remember positions where it may later need to backtrack—it's all or nothing. Now let's use the same pattern against 13245. We expect the match to fail because the string doesn't end with an E. Let's see how the possessive and non-possessive versions compare. In the possessive version \b\d++E, after matching all the digits, the engine advances in the pattern and attempts the next token E. There are no characters left in the string, so this fails. Since the engine has nowhere to backtrack to, the match fails. In the non-possessive version \b\d+E, after failing to match the E token at the end of the string, unless the engine has been optimized to detect that the \d+ token and the E are mutually incompatible, it has positions to backtrack to. It backtracks into the \d+ and gives up the last character matched, which was the 5. It then advances in the pattern and tries the next token E against the 5. That fails, so the engine backtracks into the \d+ again, gives up the 4, advances in the pattern and tries the E against the 4. This fails, and the process repeats itself until the \d+ has given up everything except the 1, at which stage there is nothing left to backtrack and the pattern can finally fail. As you can see, in the regular version the engine spends a lot of time in needless backtracking, whereas in the possessive version the "all or nothing" \d++ allows the match to fail right away. It's worth noting that certain engines (such as PCRE) study the pattern before starting the match, notice that the token \d is mutually exclusive with the token E, and optimize the pattern by automatically turning the \d+ into a possessive \d++. This process is called auto-possessification. PCRE even allows you to turn it off with the special start-of-pattern modifier (*NO_AUTO_POSSESS) Possessive quantifiers are supported in Java (which introduced the syntax), PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python. In .NET, where possessive quantifiers are not available, you can use the (this also works in Perl, PCRE, Java and Ruby). The atomic group (?>A+) tells the engine that if the pattern fails after the A+ token, it is not allowed to backtrack into A+. This means that A+ will not give up any of its characters—it is like a solid block (hence the name atomic). As you can see, this is the same behavior as A++. In fact, A++ is syntactic sugar for (?>A+), as internally most engines convert the first to the second.

Quantifier Cheat Sheet

+once or more
A+One or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A+?One or more As, as few as needed to allow the overall pattern to match (lazy)
A++One or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
*zero times or more
A*Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*?Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
?zero times or once
A?Zero or one A, one if possible (greedy), giving up the character if the engine needs to backtrack (docile)
A??Zero or one A, zero if that still allows the overall pattern to match (lazy)
A?+Zero or one A, one if possible (greedy), not giving the character if the engine tries to backtrack (possessive)
{x,y}x times at least, y times at most
A{2,9}Two to nine As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A{2,9}?Two to nine As, as few as needed to allow the overall pattern to match (lazy)
A{2,9}+Two to nine As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
A{2,} A{2,}? A{2,}+Two or more As, greedy and docile as above. Two or more As, lazy as above. Two or more As, possessive as above.
A{5}Exactly five As. Fixed repetition: neither greedy nor lazy.

The Greedy Trap

The classic trap with greedy quantifiers is that they may match more than you expect. Suppose you want to match tokens that begin with {START} and end with {END}. You may try this pattern: {START}.*{END} Note that Java will require that you escape the opening braces:\{ However, you will find that this pattern matches this entire string from start to finish: {START} Mary {END} had a {START} little lamb {END} …whereas we wanted to find two matches: {START} Mary {END} {START} little lamb {END} Here is what happens. After matching {START}, the engine moves to the next token: .* Because of the greedy quantifier, the dot-star matches all the characters to the very end of the string. The engine then moves to the next token: the { at the beginning of {END}. This fails to match because there are no characters left in the string. But the engine sees that it can backtrack into the dot-star. First, the dot-star gives up the very last character in the string, i.e. }. The engine now tries to match the { token against this character, but fails. The dot-star then gives up the D. Again, the engine fails to match the { token against that character. Repeating this process, the dot-star gives up the N, the E and the {, and and the { token can finally match. Then the rest of the pattern END} matches. Therefore, the final match is the entire string. The dot-star has only given up as many characters as were needed to allow an overall match to succeed. The best-known way to solve this problem is with lazy quantifiers. But lazy quantifiers have their own problems, and it is worth understanding other techniques to overcome the greed of an unfettered dot-star. We will look at five distinct solutions, which you all need to master on your way to your regex black belt.

Lazy Quantifier Solution

The easiest way is to make the dot-star lazy by adding a ? question mark: {START}.*?{END} The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed. Therefore, the pattern only matches one {START}…{END} item at a time, which is what we want. Containing a Lazy Quantifier that Can Eat the Delimiter: Atomic Group Suppose our regex pattern must match not only a {START}…{END} block, but some characters beyond that block, for instance \d+ digits. In such cases, we must tweak the lazy quantifier solution by embedding the lazy dot-star and the {END} delimiter together in an atomic group — like so: {START}(?>.*?{END}) This is because if tokens (such as \d+) beyond {END} fail to match, the engine will backtrack and require the .*? to expand beyond the first {END}, perhaps reaching a second {END} where a match is possible. We don't want this. The atomic group (?>.*?{END}) forbids the engine from backtracking into the lazy .*? after the first {END} has been matched. There are other ways to solve this problem, which is discussed in the Lazy Trap section. Whenever the token quantified by a lazy quantifier is able to eat the delimiter, as in the above example or something like \d*?9, remember to embed the token and the delimiter together in an atomic group: (?>\d*?9) Lazy Quantifiers are Expensive It's important to understand how the lazy .*? works in this example because there is a cost to using lazy quantifiers. When it first encounters .*? the engine starts out by matching the minimum number of characters allowed by the quantifier—which is zero. The engine then advances in the pattern and tries the next token (which is {) against the M in Mary. This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the M. Once again, the engine advances in the pattern. It now tries the { against the a in Mary. This fails, so the engine backtracks and allows the .*? to expand and match the a. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on. As you can see, for each character matched by the .*?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive". On a modern processor, for simple patterns, this will likely not matter. But if you want to craft efficient regular expressions, you must pay attention to use lazy quantifiers only when they are needed. Lower on the page, I will introduce you a far more efficient way of doing things.

Negated Class Solution

Suppose we know that the character { will never be present between the delimiters {START} and {END}. Instead of the lazy quantifier, we can use a negated character class in our pattern: {START}[^{]*{END} The negated character class [^{]* greedily matches zero or more characters that are not an opening curly brace. Therefore, we are guaranteed that the dot-star will never jump over the {END} delimiter. This is a more direct and efficient way of matching between {START} and {END}. Note that in this solution, we can fully trust the * that quantifies the [^{]. Even though it is greedy, there is no risk that [^{] will match too much as it is mutually exclusive with the { that starts {END}. This is the contrast principle from the regex style guide.

Tempered Greedy Token Solution

For the negated character class solution, we assumed that the character { would never be present between the delimiters {START} and {END}. Let's now remove this assumption. Going back to our original naive .* pattern to match between the delimiters, one way to ensure that the .* doesn't jump over the first {END} is to temper it with a negative lookahead. That is what this pattern does: {START}(?:(?!{END}).)*{END} If you look closely, you'll see that we still have a kind of dot-star—a more complex one. In (?:(?!{END}).)*, the * quantifier applies to a dot, but it is now a tempered dot. The negative lookahead (?!{END}) asserts that what follows the current position is not the string {END}. Therefore, the dot can never match the opening brace of {END}, guaranteeing that we won't jump over the {END} delimiter. When Not to Use this Technique For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}. The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version. Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job. When to Use this Technique Suppose our boss now tells us that we still want to match up to and including {END}, but that we also need to avoid stepping over a {MID} section, if it exists. Starting with the lazy dot-star version to ensure we match up to the {END} delimiter, we can then temper the dot to ensure it doesn't roll over {MID}: {START}(?:(?!{MID}).)*?{END} If more phrases must be avoided, we just add them to our tempered dot: {START}(?:(?!{MID})(?!{RESTART}).)*?{END} This is a useful technique to know about.

Explicit Greedy Alternation Solution

Staying with the idea that the character { may be present between the delimiters {START} and {END}, let's look at an elegant technique that is more efficient than both the tempered greedy dot and the lazy dot star. We'll start with an unoptimized version: {START}(?:[^{]|{(?!END}))*{END} We still have a greedy quantifier *. This time, it does not apply to a dot but to a non-capturing group (?:…) that contains an alternation. On the left side of the alternation, [^{] matches one character that is not an opening brace. We can safely do this because we know that a non-{ character will never make us roll over the {END} delimiter. On the right side of the alternation, we are allowed to match a { as long as it is not followed by END}: the negative lookahead (?!END}) asserts that what follows the position after { is not END}. In our language of quantifier techniques, this is a tempered opening brace. The pattern can be further optimized. If we have several non-{ characters in a row (which will be the typical case), at the moment we have to enter and exit the alternation for every single character because the quantifier * applies to the non-capturing group (?:[^{]|{(?!END})). This seems inefficient. If we also had a quantifier on the [^{], we could match multiple non-{ characters without leaving the alternation. To do so, the first idea would be to use [^{]+. However, this leads to a situation where the * quantifier applies to the + quantifier. If the pattern fails, the engine will explore all the ways that the two quantifiers can divide up the "pie of characters", leading to needlessly long backtracking and the situation I call an explosive quantifier (we'll explore in a later section). What we want is to match any non-{ characters as a solid block that cannot be backtracked into. We do this with a possessive quantifier [^{]++ or an atomic group (?>[^{]+). While we're at it, we should also lock up the entire quantified alternation once we exit, because if {END} fails to match, backtracking into the alternation won't help. We also do this either with possessive quantifiers (turning the * into *+) or by wrapping the quantified alternation into an atomic group. We can use the possessive version in Java, PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python: {START}(?:[^{]++|{(?!END}))*+{END} We can use the atomic version in every major engine except Python and JavaScript: {START}(?>(?:(?>[^{]+)|{(?!END}))*){END} In any version of this solution, we do away with the generic dot by explicitly stating what we want: either any number of non-{ characters; or a { as long as it is not followed by END}. This is a prime example of the Say What You Want (and What You Don't Want) principle from the regex style guide. Note that for all the "normal" characters matched by the general case [^{]+ on the left side of the alternation, we don't need to look ahead. Indeed, we only look ahead when we encounter an opening brace—which might only be once, when we hit the {END} delimiter. Because we avoid the look-ahead-fail-backtrack rigmarole, we should expect this pattern to match faster than both the lazy dot-star and tempered-dot solutions, which both require "looking" at each step. This is confirmed by pcretest: Running the patterns a million times each on the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the atomic lazy-dot-star version, 800 for the tempered-dot version and 400 for the Explicit Greedy Alternation solution. Lengthening the test string to {START} Mary Ate a Little Lamb {END}, the gaps between the versions increase drastically: 800 milliseconds per 10,000 runs for the lazy-dot-star, 2,300 for the tempered-dot, and only 500 for the explicit-greedy-alternation solution. This solution takes a little more effort to write as you need to separate the brace case from the non-brace case, but it is well worth it if performance matters.

Unrolled Star Alternation Solution

The Explicit Greedy Alternation solution just above uses a subpattern of the form (A|B)*. In the trick to mimic an alternation quantified by a star, we see that this can be unrolled into A*(?:B+A*)*: zero or more As, then zero or more repetitions of one or more Bs followed by zero or one As. Applying this formula to (?:[^{]|{(?!END}))*, we obtain this fifth solution: {START}[^{]*(?:(?:{(?!END}))+[^{]*)*{END} This solution has pros and cons. On the plus side, it is even faster than the Explicit Greedy Alternation solution it unrolls. pcretest reports that per 10,000 runs, the performance on the short test string is identical, but on the longer test string this solution clocks in at 400 milliseconds, compared to 500 milliseconds for the other. If you are looking to squeeze out every last drop of performance, this is the way to go. On the minus side, the pattern is harder to read. While the intent of the alternation in the original is immediate, it is not the case once the alternation is unrolled. Moreover, one of the elements of the alternation is now repeated—when (A|B)* becomes A*(?:B+A*)* there are now two As. If you ever change A in one place, you may forget to do it in the other—a maintenance hazard. In my view, this is the kind of tweak that should be performed by the engine as an optimimization behind the curtain.

The Lazy Trap

In the right context, lazy quantifiers solve some problems. But if you become overconfident in their power, you can run into new problems. This section explains a common source of confusion. We will use this string: {START} Mary {END}00A {START} little lamb {END}01B As before, we want to match an entire {START}…{END} group, except this time we want to extend the match after the {END} to include some digits and the letter B. At first, it may seem reasonable to tack on \d+B at the end of our lazy quantifier solution: {START}.*?{END}\d+B Looking at the bold string a few lines above, what do you think this pattern matches? Keep reading when you've made up your mind. The pattern matches the entire string from the very beginning to the very end. Do you see why? Lazy quantifiers can jump the fence. The .*? is supposed to expand until {END}\d+B is able to match. Starting the match at the very start of the string, the .*? has no reason to stop expanding after the first {END} — where \d+B cannot match. The .*? therefore continues to expand until a position where {END}\d+B is able to match. Starting the match at the beginning of the string, the shortest match is the whole string. The lesson: remember that the engine allows a lazy quantifier to expand its match as much as needed to allow an overall match. If forced to, a lazy quantifier may jump the fence you thought you had made for it. To contain the .*? in .*?{END} to the section before the first instance of {END}, you need to tweak it or replace it using one of four techniques we have already seen: Bundle the characters preceding {END} together with {END} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {END}: (?>.*?{END}) Use a Tempered Greedy Token: (?:(?!{END}).)*{END} Use an Explicit Greedy Alternation: (?:[^{]++|{(?!END}))*+{END} Use an Unrolled Star Alternation Solution: [^{]*(?:(?:{(?!END}))+[^{]*)*+{END}

The Longest Match and Shortest Match Traps

Sometimes people will try to match a string such as 12 9876 with a regex such as \d+ and expect the engine to return 9876. The + quantifier is greedy, so isn't that the longest match? A{3} matches 3 As A{3,} matches 3 or more As A{,5} matches up to 5 As How to match 3 or 5 As? For instance, {3|5} doesn't work. ^A{3}(?:A{2})?$ The A{3} matches the first three As. The optional non-capturing group (?:A{2})? optionally captures another 2 As.

The Explosive Quantifier Trap

If you're not careful, you can easily build a regular expression that has the potential to backtrack for a billion years—longer than most of us are prepared to wait at 10am on a Monday morning. Furthermore, such expressions can be used for regular expression denial of service attacks (ReDos). This page aims to show you how to detect and troubleshoot such patterns. It's a companion to the tutorial about mastering regex quantifiers. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Introduction

Backtracking is a wonderful feature of modern regex engines: if a token fails to match, the engine backtracks to any position where it could have taken a different path. A greedy quantifier may then give up one character, a lazy quantifier may expand to match one more, or the rightmost side of an alternation may be tried. If a pattern continues to fail, the engine systematically explores all available paths. We have already seen situations where a naive engine can explore more paths than it needs to. For instance, you may remember when we used the pattern \b\d+E to match a series of digits ending with an E. Using this pattern on the string 1234, after the \d+ has finished its work the E token will fail to match. At that stage, it is wasteful for the engine to backtrack into the \d+ and explore to see if the token E might have matched after the 2 or the 3. Possessive quantifiers and atomic groups help us handle such situations by turning a quantified token or a subpattern into a block that cannot be backtracked into. In this example, the syntax for those is \b\d++E and \b(?>\d+)E. In this last example, the amount of potential backtracking needed is proportional to the length of the string. The potential damage isn't too severe. However, you can write regular expressions where the potential for backtracking in relation to the length of the string is exponential. This is so wildly inefficient that your regex engine may well choke. It is therefore vital to learn to recognize such expressions. When there is potential for wild backtracking, quantifiers are always at fault. To describe these situations, I speak of explosive quantifiers. In Mastering Regular Expressions, Jeffrey Friedl refers to these situations as exponential matches, while in The Regular Expressions Cookbook Jan Goyvaerts and Steven Levithan speak of catastrophic backtracking.

A Simple Example

Consider this simple pattern: ^(A+)*B It is not as contrived as it looks: for instance, it could be a window to look at the problem raised by ^(A+|X)*B, where A might stand [aeiou] Let's see what happens when we try to match the string AAAC with that pattern ^(A+)*B. First, A+ matches all the A characters. The greedy * tries to repeat the A+ token, but there are no characters left to match. The engine advances to the next token: B fails to match. The engine backtracks, the A+ gives up the third A. The greedy * tries to repeat A+ and matches the third A. The engine advances to the next token: B fails to match. The greedy * gives up the second A+ token, i.e. the third A. The engine advances to the next token: B fails to match. Now the original A+ gives up the second A… Do you see where this is going? Table of Combinations The table below shows the combinations the engine will try for (A+)*. Since the * quantifies the A+ token, several A+ tokens can contribute to what (A+)* matches at any given time. Each column corresponds to the text matched by one of these A+ substrings. But don't worry too much about the details of the table. What matters is the number of rows.
A+A+A+
AAA
AAA
AA
AAA
AAA
AA
A
When the engine finally gives up, it has tried to match the final B token eight times while (A+)* contained various strings. In fact on these eight attempts (A+)* only contained four distinct strings (the empty string, A, AA and AAA), but some of these were tried multiple times as they were matched in different ways. For instance, we reached AAA in four ways: With one single A+ token matching AAA With one A+ token matching AA and another matching A With one A+ token matching A and another matching AA With three A+ tokens all matching a single A. Note that we tried the final B token eight times, but it took many more steps for the engine to fail, because each time we reached the final B we had to backtrack. The debugger in regexbuddy says that it needs 28 steps before it can fail. In contrast, the possessive ^(A+)*+B and the atomic ^(?>(A+)*)B (which are fair comparisons for this simplified pattern) respectively fail after 5 and 7 steps. I'm sure you can guess what happens when we try the pattern ^(A+)*B against longer strings. The number of steps required to fail explodes.
Number of AsSteps to Fail
1, e.g. AC7
2, e.g. AAC14
3, e.g. AAAC28
456
5112
103,584
203,670,016—RegexBuddy has given up. How about your program?
1004,436,777,100,798,802,905,238,461,218,816
I'm sure you can see the pattern: for each A characters we add to the subject string, the number of steps required to reach failure doubles. From a computational standpoint, this exponential growth is a nightmare. For you big-O geeks out there, the complexity of exploring all the combinations is O(2n). Fortunately, RegexBuddy gives up at three million steps (to compute the last two rows, I just multiplied an earlier row by a power of two). But other tools may not give up, and if your language imposes no limit on backtracking or regex matching time, you could be shipping the kind of software everyone loves to rant about.

Identifying Explosive Quantifiers

We need to learn to recognize situations where this kind of explosion can occur. But how? Let us return mentally to the table of combinations—where each row represents a distinct way for various A+ tokens to match some of the characters in AAA. In the simplest terms, you can reduce what is happening here as a situation where various A+ tokens (which are spawned by the * quantifier) "negotiate" to divide up the "pie of characters" as the engine explores possible combinations. You can interpret the table's rows as a tug of war between three potential A+ tokens. Of course this metaphor of "negotiation" or "tug of war" is not how things really happen—what we have is a regex engine systematically exploring all possible paths. But the metaphor is helpful in our search for symptoms. What we're looking to avoid is such tugs of wars: situations where multiple tokens (either explicitly present in the pattern or spawned on the fly by a quantifier) can match the same portion of the subject string. That's easier said than done because such situations can arise in a number of ways. The following sections aim to make us alert to four kinds of symptoms. Narrowing the Definition of Quantifier In the following sections, when I mention quantifiers, let's agree that I won't be referring to quantifiers that cause minimal or no repetition, such as the gentle ? or the plain {2}. The quantifiers we worry about are those that can repeat a token many times, resulting in an explosion in the number of combinations the engine needs to explore. In that basket, we should include: The plus + (one or more) The star * (zero or more) Range quantifiers with no upper boundary, such as {3,} Finite quantifiers with two or more digits, such as {10}. Remember that two to the tenth power is 1024 (in our example, the engine took 3,584 steps to fail a string with ten As). In particularly vicious configurations, even smaller quantifiers can be explosive. Range quantifiers with a fixed upper boundary comprised of two or more digits, such as {3,10}—for the same reasons as above.

Symptom: A Quantifier Compounds a Quantifier

Whenever you see that a quantifier applies to a token that is already quantified, as in (A+)*, there is potential for the number of steps to explode. Often, the "compounding quantifier" pattern happens when the outside quantifier applies to an alternation, as in (?:\D+|0(?!1))*. Unless you pay attention, you can miss that the (\D+…)* constitutes an explosive quantifier. The lesson here is that when a quantifier needs to apply to another quantifier, we need to prevent the engine from backtracking. We achieve this either by: making the outer quantifier possessive, e.g. (?:\D+|0(?!1))*+ or enclosing the expression in an atomic group, e.g. (?>(?:\D+|0(?!1))*)

Symptom: Contiguous Quantified Tokens are Not Mutually Exclusive

Consider this pattern: ^\d+\w*@ The \d and the \w are both able to match digits: they are not mutually exclusive. Against a string such as 123, the pattern must fail. While trying all the possibilities in order to find the match, the engine will let the \d+ give up characters that will be matched by the \w*. Exploring these paths takes time: the engine takes 16 steps to reach failure. Adding one digit to the test string, e.g. 1234, the engine takes 25 steps to fail. With ten digits, it takes 121 steps. With 100 digits, it takes 10,201 steps. The situation is clearly far better than in the first example. The number of steps required to fail in relation to the size of the string does not grow exponentially, but it still explodes—without looking at it closely its complexity seems to be quadratic or thereabouts, i.e. O(n2). It takes 1,100 digits to reach a million steps. That's a lot more than many subject strings but a lot less than others—that's only a page-and-a-half of average text. The lesson here is to try to use contiguous tokens that are mutually exclusive, following the rule of contrast from the regex style guide.

Symptom: Tokens in Alternation are Not Mutually Exclusive

Let us now consider a variant of our last faulty pattern: ^(?:\d|\w)+@ This too will fail against 123. On the first attempt, each digit will be matched by a \d token, as it is the leftmost side of the alternation. When the @ token fails to match, the engine will backtrack into each alternation and let the \w side match characters that were previously matched by the \d. The engine takes 60 steps to fail. Adding one digit to the test string, e.g. 1234, the engine takes 124 steps to fail. With ten digits, it takes 8,188 steps. With 16 digits, it takes 524,284. For longer strings, RegexBuddy maxes out. The complexity of exploring all the combinations is O(2n). Clearly, this is far worse than the previous pattern ^\d+\w*@, which at first sight looks fairly similar. Why? With the earlier pattern, the engine must find a match that is a series of digits \d, then optionally a series of word characters \w. The pie is always divided in that order—first \d tokens, then \w tokens. In contrast, the second pattern ^(?:\d|\w)+@ gives us many more ways to divide up the pie. The pie can be claimed by word characters tokens first, then digit tokens. Or by word character tokens and and digit tokens intermingled in every way imaginable. In the literature, this symptom is usually shown in the form (A|AA)+, but in my view that's not really helpful. Why would you ever write such a silly pattern? Of course ^(?:\d|\w)+@ is silly too, but it brings out the salient symptom, which is that various components in a quantified alternation are able to "compete" for the same characters. The lesson here is that when we build an alternation that is quantified, we must make sure that distinct branches cannot match the same characters. Do character classes present the same risk? Our vicious pattern ^(?:\d|\w)+@ could be written with a character class: ^[\d\w]+@ Let's forget for a moment that we would never write such a silly pattern—like the others, it is only meant to help us explore potentially explosive patterns. On the face of it, this pattern does the exact same thing as the version with the alternation: at each step, the engine can match either a digit or a word character. Surely it too must explode when the engine fails to find a match, right? It is not so. Suppose we try ^[\d\w]+@ against the string 123. First, [\d\w]+ greedily matches all the digits. For a moment, let's assume that each of those digits (1, 2, 3) was matched by the \d token inside the character class. Please note that we don't know this for a fact. One engine might notice that \d is a subset of \w and optimize the entire character class to \w before even starting the match attempt. Another engine might have its own set of rules about which tokens in a character class to try first. After the @ token fails to match, the engine looks for positions to backtrack. First, the [\d\w]+ gives up the 3. The engine tries to match the 3 with the token @, and fails. At this stage, in the alternation version, the engine would have tried to match the 3 with the \w token on the right side of the alternation. In this case, however, the engine does not attempt the \w inside the [\d\w]. A character class constitutes a solid block, an atomic token. Once it matches a character, you don't backtrack into it to try different ways to make it match. When you give it up, you give it up. Therefore, after the @ token fails to match the 3, the engine's next move is to backtrack once more and force the [\d\w] to give up the 2. Next, the @ token fails to match the 2. There is nothing left to backtrack, and the match attempt fails. In RegexBuddy's way of counting, reaching that point takes seven steps. The number of steps required to explore all the paths is directly proportional to the length of the string: the complexity is O(n), which is the best you can ask for, short of making the character class's quantifier possessive — [\d\w]++ — or enclosing it in an atomic group (?>[\d\w]+).

Symptom: Remote Quantifiers Can Reach into Each Other's Territory

If you're reading this page, I'm sure you can tell what's wrong with a pattern such as ^.*A.*AB Suppose our string is AAAAA. The first dot-star can match the whole string, nothing at all, or anything in between. The second dot-star can match a considerable portion of the string, nothing at all, or anything in between. Before the engine can determine that the match must fail, there will be a tug of war between the two dot-stars. It takes 53 steps for the RegexBuddy engine to fail on this short string, and 178 steps on a string that contains ten A characters. The regex in this example is so short—and we are so used to distrusting dot-stars—that it probably jumps out at you that one dot-star can overreach into the other's territory. But the same situation can arise in less obvious ways. Consider this pattern, which is only slightly longer than the previous one: ^\d*?1\d*?1B The lazy \d*? seems to only want to extend up to the first 1, while the second \d*? extends to the second 1. That seems legit. But when the engine has trouble finding a match, the first \d*? can in fact jump over the first 1 if there are more ones to swallow. You may indeed remember from the Lazy Trap that lazy quantifiers can jump over the fence you thought you had made for them, because they expand as far as needed in order to allow a match. The delimiter 1 is not a true fence because the \d token can match it if it needs to. For instance, against the string 11111C, where the match must clearly fail, at one stage the first \d*? will match all the ones. It takes 59 steps for the engine to fail. With ten ones, it takes 189 steps, and with a hundred ones, it takes 15,354 steps. Once again, we have an explosive quantifier—although nowhere as bad as in our exponential example. If you thought the ^\d*?1\d*?1B was easy to spot, consider that the same phenomenon could be embedded in something like this: .*?{START} (lots of stuff in between) .*?{END} In my view, this is a lot harder to spot—unless you are sensitive to whether the tokens quantified by a lazy quantifier are able to match their intended delimiters. The lesson here is to carefully consider whether a quantified token might reach over into a section of the string that you had intended for another token to match. To contain the .*? in .*?{START} to the section before {START}, you need to tweak it or replace it using one of four techniques we have already seen: Bundle the characters to be matched before {START} together with {START} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {START}: (?>.*?{START}) Use a Tempered Greedy Token: (?:(?!{START}).)*{START} Use an Explicit Greedy Alternation: (?:[^{]++|{(?!START}))*+{START} Use an Unrolled Star Alternation: [^{]*(?:(?:{(?!START}))+[^{]*)*{START}

Further Reading

Time Out feature in C# Regular Expression Denial of Service (ReDos) [Wikipedia] Static Analysis for Regular Expression Denial-of-Service Attacks [PDF]

Mastering Conditional Regex

Conditionals are one of the least used components of regex syntax. Granted, not all engines support them. But in my view, the main reason for the low use of conditionals is that the situations in which they do a better job than alternate constructs is poorly known. This page aims to explain the details of regex conditional syntax and to present the typical situations where using conditionals makes sense. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Support for Regex Conditionals

You can use conditionals in the following engines: .NET (C#, VB.NET etc.) PCRE (C, PHP, R…) Perl Python Ruby 2+ Some of these engines are able to test a richer set of conditions than others. We'll see these differences as we go.

Basic Conditional Syntax

The regex conditional is an IF…THEN…ELSE construct. Its basic form is this: (?(A)X|Y) This means "if proposition A is true, then match pattern X; otherwise, match pattern Y." Often, you don't need the ELSE case or the THEN case: (?(A)X) says "if proposition A is true, then match pattern X." (?(A)X|) means the same — but the alternation bar can be dropped. (?(A)|X) amounts to saying "if proposition A is not true, then match pattern X." If you translate the IF…THEN…ELSE construction literally, it says "if proposition A is true, then match the empty string (which always matches at every position), otherwise match pattern X." Proposition A Proposition A can be one of several kinds of assertions that the regex engine can test and determine to be true or false. These various kinds of assertions are expressed by small variations in the conditional syntax. Proposition A can assert that: a numbered capture group has been set a named capture group has been set a capture group at a relative position to the current position in the pattern has been set a lookaround has been successful a subroutine call has been made a recursive call has been made embedded code evaluates to TRUE

Checking if a Numbered Capture Group has been Set

To check if a numbered capture group has been set, we use something like: (?(1)foo|bar) In this exact pattern, if Group 1 has been set, the engine must match the literal characters foo. If not, it must match the literal characters bar. But the alternation can contain any regex pattern, for instance (?(1)\d{2}\b|\d{3}\b) A realistic use of a conditional that checks whether a group has been set would be something like this: ^(START)?\d+(?(1)END|\b) Here is how this works: The ^ anchor asserts that the current position is the beginning of the string The parentheses around (START) capture the string START to Group 1, but the ? "zero-or-one" quantifier makes the capture optional \d+ matches one or more digits The conditional (?(1)END|\b) checks whether Group 1 has been set (i.e., whether START has been matched). If so, the engine must match END. If not, the engine must match a word boundary. The net result is that the pattern matches digits that are either embedded within START…END at the beginning of the string, or standing by themselves at the beginning of the string. To achieve the same effect without a conditional, we could use ^(?:START\d+END|\d+\b), which forces us to repeat the \d+ token.

Checking if a Named Capture Group has been Set

To check whether the Group named foo has been set, use this syntax: (?(foo)…|…) works in .NET, PCRE (C, PHP, R…) and Python (?(<foo>)…|…) works in Perl, PCRE (C, PHP, R…) and Ruby (?('foo')…|…) works in Perl, PCRE (C, PHP, R…) and Ruby This example would work in .NET and PCRE: ^(?<UC>[A-Z])?\d+(?(UC)_END)$ With (?<UC>[A-Z]) the optional capture group named UC captures one upper-case letter \d+ matches digits The conditional (?(UC)_END) checks whether the group named UC has been set. If so, it matches the characters _END This pattern would match the string A55_END as well as the string 123.

Checking if a Capture Group at a Relative Position has been Set

In PCRE (but not .NET, Perl, Python and Ruby), you can check whether a capture group at a relative position has been set. The relative position can be to the left of to the right of the conditional. Checking a relative group to the left To specify that the relative group to be checked is back from our current position in the pattern, we place a minus sign - in front of an integer. For instance, (?(-2)…|…) checks whether the second capture group to the left of our current position in the pattern has been set. Therefore, (?(-1)X|Y) says: if the nearest capture group to the left of this conditional has been set, match pattern X; otherwise, match pattern Y. Using a relative group in a conditional comes in handy when you are working on a large pattern, some of whose parts you may later decide to move. It can be easier to count a relative position such as -2 than an absolute position. Checking a relative group to the right Although this is far less common, you can also use a forward relative group. This time, we use a + sign in front of an integer: (?(+1)X|Y) This says: if the nearest capture group to the right of this conditional has been set, match pattern X; otherwise, match pattern Y. But how, you may ask, can a capture group to the right of the current position in the pattern already have been set? This can happen in various ways: The conditional and the group live inside a quantified group. For instance, (?:A(?(+1)B)(C))+ matches ACABC. On the first pass through the repeated group, the conditional fails as C has not yet been captured. On the second pass, the conditional succeeds. The conditional has been reached through a subroutine call. For instance, (A(?(+1)B)(C))(?1) matches ACABC. Inside the parentheses that define Group 1, the conditional fails as C has not been captured. On the subroutine call (?1), the conditional succeeds. The conditional has been reached through a recursive call. For instance, (A(?(+1)B)(C)(?R)?D) matches ACABCDD. At the outer level, the conditional fails as C as not been captured. At the first depth of recursion, it succeeds.

Checking if a Lookaround has been Successful

In .NET and PCRE (C, PHP, R…), a conditional can check whether a lookaround can succeed at the current position. For instance, suppose you wish to match the first word of a string, which by default is a vegetable. However, if the string ends with _FRUIT, the first word must be a fruit rather than the default vegetable. You can use this: ^(?(?=.*_FRUIT$)(?:apple|banana)|(?:carrot|pumpkin))\b After the ^ anchor asserts that the current position is the beginning of the string, the conditional (?(?=.*_FRUIT$)…|…) checks whether the lookahead (?=.*_FRUIT$) can succeed. That lookahead asserts that at the current position, the engine can match any characters, then _FRUIT and the end of the string. If the lookahead succeeds, we match a fruit: (?:apple|banana). Otherwise, we match a vegetable: (?:carrot|pumpkin) Without a conditional, this would be a bit heavier to express: ^(?:(?:apple|banana)(?=.*_FRUIT$)|(?:carrot|pumpkin)(?!.*_FRUIT$))\b

Checking if a Subroutine Call has been Made

In Perl and PCRE (C, PHP, R…) you can check whether we are currently in the middle of a call to a specific subroutine. In the case one subroutine call is nested within another, the conditional test succeeds only if the specific subroutine being tested was the last one called. For these tests, we can use both named and numbered subroutines. For instance, (?(R1)…|…)) checks whether we are in the middle of a call to subroutine 1, and (?(R&foo)…|…)) checks whether we are in the middle of a call to a subroutine named foo. Consider this pattern: (A(?(R1)B|C))(?1) It matches the string ACAB. The parentheses around (A…) define Group 1 and Subroutine 1. First, we match the character A. The conditional (?(R1)B|C) checks whether we are in the middle of a call to subroutine 1. After matching the string's initial A, it is not true that we have reached this point in the pattern via a subroutine call, so we must match the pattern in the ELSE branch of the conditional, which is the letter C. (?1) is a call to subroutine 1. First, we match another A. The conditional check succeeds as we have reached this point via a call to subroutine 1, so we must match the pattern in the THEN branch, which is the letter B. Here is the same, but using a named subroutine: (?<foo>A(?(R&foo)B|C))(?&foo) The parentheses around (?<foo>A…) define a capture group and subroutine named foo. First, we match the character A. The conditional (?(R&foo)B|C) checks whether we are in the middle of a call to the subroutine named foo. After matching the string's initial A, it is not true that we have reached this point in the pattern via a subroutine call, so we must match the pattern in the ELSE branch of the conditional, which is the letter C. (?&foo) is a call to the subroutine named foo. First, we match another A. The conditional check succeeds as we have reached this point via a call to the subroutine named foo, so we must match the pattern in the THEN branch, which is the letter B. Nested Subroutine Calls Suppose a part of the pattern calls subroutine 2, which then calls subroutine 1. Once inside subroutine 1, the engine encounters a conditional check on whether subroutine 2 has been called. Even though we are currently within a call to subroutine 2, the conditional test fails because what matters is the last subroutine call that was made—which is the call to subroutine 1. We can see this with these two patterns: (A(?(R1)C))(B(?1))(?2) matches ABACBAC. Within it, (A(?(R1)C)) matches A, (B(?1)) matches BAC and (?2) matches BAC again. (A(?(R2)C))(B(?1))(?2) matches ABABA but not all of ABABAC. Within it, (A(?(R2)C)) matches A, (B(?1)) matches BA, and (?2) matches BA again. The conditional (?(R2)C) fails even when reached via (?2), as the most recent subroutine call when it is reached is the one made by (?1).

Checking if a Recursive Call has been Made

In Perl and PCRE (C, PHP, R…), the conditional (?(R)…|…)) checks whether we have reached this point in the pattern via a recursive call. Consider this pattern: A(?(R)B)(?R)?C It matches the string AABCC. The first time we encounter the conditional, we have not made a recursive call, so we do not have to match a B. The outer level of the recursive match will be A…C The second time we encounter the conditional, we are in the middle of a recursive call, so we must match a B. If we don't recurse again, the depth 1 match is ABC, and the pattern can match ACABC.

Checking that Embedded Code Evaluates to TRUE

In Perl, a conditional can check that an embedded fragment of Perl code evaluates to TRUE. The basic syntax for this is (?(?{Perl code})…|…) For instance, suppose you are using the variable $currency as a Boolean flag. The pattern \d+(?(?{$currency}) dollars) matches two kinds of strings. When $currency is set to FALSE, the conditional test fails and the pattern only matches a series of digits, such as 122. When $currency is set to TRUE, the conditional test succeeds and the pattern matches strings such as 55 dollars.

Conditionals At Work: Balancing Delimiters

Suppose that in a body of text we want to match strings enclosed in two kinds of delimiters: If the string starts with {{ it must end with }} If the string starts with BEGIN: it must end with :END We can use this conditional regex: (?:(BEGIN:)|({{)).*?(?(1):END)(?(2)}}) This will match {{foo}} and BEGIN:bar:END The non-capturing group (?:(BEGIN:)|({{)) matches the opening delimiter, either capturing BEGIN: to Group 1 or capturing {{ to Group 2. .*? matches any characters, lazily expanding up to a point where the rest of the pattern can match. The conditional (?(1):END) checks if Group 1 has been set. If so, the engine must match :END The conditional (?(2)}}) checks if Group 2 has been set. If so, the engine must match }} Alternative Solution This can also be solved with a plain alternation: BEGIN:.*?:END|{{.*?}} However, this expression becomes increasingly more complex when we add potential delimiter pairs, such as <== … ==>, or the content to be matched between the delimiters turns into a longer pattern—as this pattern must be repeated on each branch of the alternation.

Conditionals At Work: Controlling Failure

This section relies on the classic trick (?!) to force failure. As a reminder, Perl and PCRE (C, PHP, R…) also allow you to use (*F) and (*FAIL) Just as we can use a conditional to match a certain pattern if (or unless) condition X is met, we can use a conditional to force a match attempt to fail if (or unless) condition Y is met. Fail If X Suppose we're interested in matching digits \d+ in certain contexts. The digits must be followed by either END or _end. However, if they are preceded by BEG, then END is the only allowable suffix. Therefore, BEG12_end cannot match, whereas BEG00END, 00END and 00_end all match. We can use this pattern: ^(BEG)?\d+(?:END|_end(?(1)(?!)))$ (BEG)? optionally matches BEG, capturing the characters to Group 1. \d+ matches the digits. (?:END|_end(?(1)(?!))) matches either END or _end. On the _end branch, the conditional (?(1)(?!)) checks if Group 1 has been set (i.e., we matched BEG earlier), and if so, the THEN branch (?!) forces the match attempt to fail. Fail Unless Y Let's give a slight tweak to the context in which we'd like to match digits. The digits must still be followed by either END or _end. However, if they end with END, then BEG is the only allowable prefix. Therefore, 00END cannot match, whereas BEG00END, BEG12_end and 00_end all match. We can use this pattern: ^(BEG)?\d+(?:_end|END(?(1)|(?!)))$ (BEG)? optionally matches BEG, capturing the characters to Group 1. \d+ matches the digits. (?:_end|END(?(1)|(?!))) matches either _end or END. On the END branch, the conditional (?(1)|(?!)) checks if Group 1 has been set (i.e., we matched BEG earlier); if not so, the ELSE branch (?!) forces the match attempt to fail. In the example on self-referencing groups, one of the alternate solutions will show a powerful way to use conditionals to control failure in the context of .NET balancing groups.

Conditionals At Work: Self-Referencing Group

This is an advanced technique that you should feel free to skip if you just want to get the gist of conditionals. However, it is required for the black belt program. :) Suppose we want to match strings such as AAA foo BBB, which is framed by the same number of As and Bs. In Perl and PCRE (C, PHP, R…) we could use a recursive solution, such as \A(A(?:(?1)|[^AB]*)B)\z (This also works in Ruby if we replace the (?1) with a \g<1>) But if we want to balance a greater number of tokens, as in AAA foo BBB bar CCC baz DDD, it can becomes interesting to use self-referencing groups, as seen on the page about Quantifier Capture and on the trick to match line numbers. For our task of balancing As with Bs in strings such as AAA foo BBB, we could use something like: ^(?:A(?=A*+[^AB]*+((?(1)\1)B)))++[^B]*+\1$ I know… Please don't scream, we'll ease in gently. One feature of this pattern is that capture Group 1 ((?(1)\1)B) refers to itself with the conditional (?(1)\1). This conditional says: If Group 1 has already been set, match the current content of the Group 1 capture buffer. Match B — regardless of whether Group 1 has been set. This construction has the effect that with each pass through Group 1, the Group 1 capture buffer gets longer by one character B. On the first pass, Group 1 has not been set, so the THEN branch of the conditional does not apply, and Group 1 captures one single B. On the second pass, the conditional applies, so the parentheses must match \1 (a back-reference to Group 1, which at this stage is B) and one additional B. At this stage, Group 1 contains BB. On the third pass, \1 is BB, so the parentheses must capture BBB… and so on. Thanks to this construction, the quantified group (?:A…)+ matches all the characters one by one, and for each A that is matched, the Group 1 capture buffer grows by one B. By the time we exit (?:A…)+, we have matched as many As as the number of Bs captured in Group 1. Later in the pattern a simple back-reference \1 to Group 1 matches these Bs. Alternate Solutions Inside the self-referencing group ((?(1)\1)B), instead of using a conditional, we could use an optional (but possessive) back-reference to Group 1 \1?+. If Group 1 is set, it is matched. And the possessive + forbids the engine from backtracking and giving up the back-reference. We've already looked at the recursive solution. Let's look at a beautiful solution in .NET. Balancing Groups In .NET, we can use balancing groups. This solution also uses a conditional, which is another example of a conditional to control failure. As a reminder, the task is to match strings where the number of As and Bs is balanced, as in AAA foo BBB. We can use this: ^(?<Count_A>A)+[^AB]*(?<-Count_A>B)+(?(Count_A)(?!))$ (?<Count_A>A)+ matches all the As, adding each individual A to the CaptureCollection named Count_A. I gave the group that name because we use the group as a virtual counter. [^AB]*+ matches all the non-A, non-B characters. (?<-Count_A>B)+ matches all the B characters, popping individual A characters from the CaptureCollection as it does do ("decrementing the counter"). (?(Count_A)(?!)) checks if the named capture Group Count_A is set, which can only be the case if we have not removed enough As from the CaptureCollection. This would mean there are fewer Bs then As in the string. In that case, the engine matches the THEN branch of the conditional, which is the classic trick (?!) to force the regex engine to fail and attempt to backtrack. For efficiency, each quantified group should be made atomic: ^(?>(?<Count_A>A)+)(?>[^AB]*)(?>(?<-Count_A>B)+)(?(Count_A)(?!))$ I know, the atomic version (which is far preferable for the engine) looks awful… Do you happen to know the people at Microsoft in charge of .NET regex? If so, please lobby them to support possessive quantifiers (and subroutines, and recursion). And if you don't mind, please shoot me a message as I'd love to know how to reach them. Subject: About : 3. Not so useful: checking if a lookaround is successful. Let me add my opinion about the 3rd point : I had a problem for which I found a solution with this syntax, and that not seems to work if I use only the lookaround. Please consider a matching test : "if the string ends with END, it should contain WORD, otherwise all is permitted" : - with the conditional regex I write this : R1 : ^(? (? =. *END$). *WORD. *END|. *)$ with this R1 regex, "abcd" matches, "theWORD is END" matches, but "only END" doesn't match because it ends with END but WORD is missing. That's what I need : presence of WORD is tested only if string ends with END. - without the conditional it becomes : R2 : ^(? =. *END$). *WORD. *END|. *$ with R2 regex, the last test "only END" matches and that's not what I need So I think that there are cases for which checking if a lookaround is successful is so useful. Otherwise, please give me another regex that works for my problem (maybe it exists one, I'm not a regex guru ^^). Regards, Yosh Reply to YosheE Hi Yoshe, Sorry about the delay, I have been traveling then had to catch up on a million things. Finally looked at your message today. Congratulations for building an interesting example! With two lookarounds, there are several solutions. Here's a simple solution with a single lookaround: (?x) ^.*?WORD.*END$ | ^(?:(?!END$).)*$ In a majority of cases, your conditional implementation probably runs faster. I've added that as an additional example for case 3. Wishing you a beautiful day, Rex Subject: Hi! Thi is really help full thank you

Recursive Regular Expressions

Recursion is an arcane but immensely helpful feature that only a few regex engines support. Before you invest your time sutdying this topic, I suggest you start out with the recursion summary on the main syntax page. If at that stage you are still interested, you can return here to dive deeper. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Supported Engines

Recursion is supported in the following engines: PCRE (C, PHP, R…) Perl Ruby 2+ Python via the alternate regex module JGSoft (not available in a programming language) Historical Note A bit of history: recursion appeared in PCRE seven years before it made it to Perl regex. This explains the differences in behavior between the two engines—which, if you insist on knowing at this early stage, pertain to atomicity and the carryover of capture buffers down the stack. When Perl introduced recursion into its regex it behaved differently from PCRE's, and PCRE stuck to its original behavior for backward compatibility. Entire Recursion and Local Recursion The recursive syntax is simple. In PCRE and Perl, you just add (?R) anywhere you like in your pattern. Basically, (?R) means "paste the entire regular expression right here, replacing the original (?R). In Ruby, you use \g<0> In its simplest form, a recursive pattern is the same as a plus quantifier. Of the few people I've seen mentioning recursive patterns on the net, nearly all use it for the same purpose—to match nested parentheses. Is that just tremendously useful, are they all copying one another, or are they copying Jeffrey Friedl's book? To see how recursion works, let's start with something a little simpler.

The Paste Method

I am going to show you two ways to work with recursive patterns. I call them the "paste method" and the "trace method". The paste method has little practical utility, but in the beginning it can give you an easy way to think about recursive patterns. So we will start with that. Here are a first pattern and a sample subject string. Recursive Pattern #1 Pattern:\w{3}\d{3}(?R)? Subject:aaa111bbb222 What does this pattern do? First, it matches three word characters: \w{3}, then three digits: \d{3}. On our test string, with "aaa111", the expression can match on these first steps. Next, since the expression has worked thus far, the (?R) "pastes" the whole pattern in its own place. (Bear in mind that this is only a manner of speaking. The actual regex engine wouldn't know pasting from pasta.) The question mark at the end of (?R)? is the usual "one or nothing" operator. It makes it so that if the expression in the "pasted pattern" fails, the engine can match "empty" in that same spot. At this stage, the expanded expression looks like this: Pattern 1 (first expansion): \w{3}\d{3}(?:\w{3}\d{3}(?R)?)? The bold part of the pattern is where the full pattern has been pasted in place of (?R). As you can see, I have encapsulated the pasted pattern in a non-capturing group. Why? Because I needed to apply the question mark that was at the end of the (?R)? to the pasted pattern—the non-capturing group is just a way to express that syntax. Without the final question mark, we would not need the non-capturing group. By the way, even though the (?R) is inside parentheses, the parentheses do not capture anything. That is why I used a non-capturing group rather than simple parentheses. After matching "aaa111", we are now at the beginning of the bold part of the expression. Our first job is to match three more word characters and three more digits. Luckily, with "bbb222", our test string supplies these. Next, we bump against (?R)? once again. The (?R) pastes itself in place. If you wrote out the whole expression at this stage, it would look like this: Pattern 1 (second expansion): \w{3}\d{3}(?:\w{3}\d{3}#(?:\w{3}\d{3}(?R)?)?)? This time, I have inserted a bold hash character (#) to show where we are in the expression at the moment. I have also bolded the question mark of the pattern we just pasted in order to emphasize that this new pattern is optional. At this stage, we try to match three word characters again. But we are at the end of our subject string (aaa111bbb222), so the new sub-pattern fails. Thanks to the bold question mark, we can go back to where we were before trying to match the pasted pattern. In the first expansion, that location is the point just before the (?R)?. The engine rolls over the (?R)?, successfully completes the expression, and returns a match: "aaa111bbb222". As you can see, if you approach them like this, recursive expressions are nothing to be scared of. But the record should show that our recursion accomplished nothing more than the puny plus quantifier in: Alternate to recursive pattern #1: (?:\w{3}\d{3})+ And as you can imagine, with anything a bit complex, the paste method would quickly become difficult to follow. Fortunately, the "trace method" eliminates the problem.

The Trace Method

As you may have gathered from the other pages on this site, I am a big fan of a regex tool called RegexBuddy, which helps me create and troubleshoot all kinds of expressions. I was thrilled when its author introduced support for recursive regex in version 4 (released in September 2013). A link to the free trial is in the right pane. With the method I am about to show you, you can analyze recursive patterns of any complexity. All you need is a spreadsheet. The box below traces our recursive pattern #1 as it tries to match our subject. #DepthPositioninRegexStringPos.NotesBacktracks S1D0\w{3}\d{3}(?R)?aaa111bbb222Match. S2D0\w{3}\d{3}(?R)?aaa111bbb222Match. S3D0\w{3}\d{3}(?R)?aaa111bbb222TryDepth1. S4D1\w{3}\d{3}(?R)?aaa111bbb222Match.3 S5D1\w{3}\d{3}(?R)?aaa111bbb222Match.3 S6D1\w{3}\d{3}(?R)?aaa111bbb222#TryDepth2.3 S7D2\w{3}\d{3}(?R)?aaa111bbb222# NoMatch.BacktoS6.6,3 S8D1\w{3}\d{3}(?R)?#aaa111bbb222#D1succeeds,backtoD03 S9D0\w{3}\d{3}(?R)?#aaa111bbb222#D0succeeds. Returnoverallmatch:aaa111bbb222 Here is how it works. In the leftmost column of your spreadsheet, enter the step number. For easier reference later, I call these steps S1, S2, S3 etc. In the next column, you will have the depth level of the recursion. For easier reference, I call these depths D0, D1, D2 etc. In this example you can see how we start at D0, then go to D1 and D2, then back up to D1, and finally D0, which the expression needs to complete in order to succeed. The other levels are all made optional by the question mark at the end of (?R)?, so they are allowed to fail, as happens at Step S7, Depth D2. In the "Position in Regex" column, you have the expression of the current depth level. For instance, at step S4, we reach depth D1, and the expression shown is the pattern from the depth level being evaluated. By only showing the current level, you do not create a spaghetti of patterns and sub-patterns, as the paste method tends to do. At each step, bold the part of the expression that is being evaluated. When we reach the position at the end of the string, I add a bold hash character (#). In the next column, paste the string on each row, and bold the part of the string that is being evaluated or that just matched. When we reach the position at the end of the string, I add a bold hash character (#). In the Notes column, explain what is happening. In the Backtracks column, keep track of the sequence of steps to which you are allowed to backtrack if the current pattern fails. For instance, at S3, we decide to try D1, so S4 and the following steps on D1 have a backtrack mark to S3 in case D1 fails. For expressions that have capture groups, create a column for each capture group, and show the value of each capture group at each step. We will see an example of this later. Navigating the Depths of Recursion With the Trace method, you can follow a match through complex recursions. To obtain an overall match, depth 0 (D0) must succeed all the way to the end of the expression. In the middle of D0, the engine may have to dip down a number of levels. These levels all eventually succeed or fail, throwing the engine back to the prior level. At some stage, the engine gets back to D0, and either fails or eventually succeeds in finding a match.

Recursion Depths are Atomic

One feature of PHP recursion that's important to understand is that each recursion level is atomic. What does this mean? Suppose your expression sends you to D1, made optional by a question mark. You complete D1 successfully. But back on D0, the engine fails to match the next character. Now, D1 may have given the engine a number of options in the form of quantifiers and alternations. When D0 fails, the engine does not go back inside D1 to try the unexplored options (different quantities or other sides of alternations). Instead, it discards D1 as a block. This is the behavior of an atomic group. This atomic feature of recursion levels can have profound effects on your match. Later on, we will see an example of that. It is a feature of PHP's PCRE engine. Perl, on the other hand, can backtrack into recursion levels.

Leaving a Way Out of the Recursion

In Pattern 1, you saw that whenever you paste the whole expression in place of the recursion marker (?R), you inherit another (?R). You have to, since it's part of the whole expression! To avoid madness and infinite loops, you need to make sure that at some stage the (?R) will stop breeding. In Pattern 1, this was achieved by adding a question mark after the (?R). This ensures that if a recursion level fails, the engine can continue with the match at the level just above. Another way to make sure you can exit a recursion is to make the (?R) part of an alternation. Consider this: Pattern 2: abc(?:$|(?R)) This pattern matches series of the string "abc" strung together. This series must be located at the end of the subject string, as it is anchored there by the dollar sign. The pattern matches "abc", "abcabc", but not "abc123". How does it work? After each "abc" match, the regex engine meets an alternation. On the left side, if it finds the end of string position (expressed by the dollar symbol in the regex), that's the end of the expression. On the other hand, if the end of the string has not yet been reached, the engine moves to the right side of the alternation, goes down one level, and tries to find "abc" once again. Without some kind of way out, the expression would never match, as eventually any string must run out of "abc"s to feed the regex engine.

Using Recursion to Match Palindromes (mirror words)

Instead of looking at the classic "match nested parentheses" pattern presented elsewhere, I will now show you a pattern that is just as powerful but easier to read. As an exercise, you can tweak it to match nested parentheses. Here's the pattern: (\w)(?:(?R)|\w?)\1 What does this do? This pattern matches palindromes, which are "mirror words" that can be read in either direction, such as "level" and "peep". Let's unroll it to see how it works. (?x)# activate comment mode (\w)# capture one word character in Group 1 (?:(?R)# non-capturing group: match the whole expression again, |# OR \w?)# match any word character, or "empty" \1# match the character captured in Group 1 The pattern starts with one word character. This character is mirrored at the very end with the Group 1 back reference. These are the basic mechanics of how we are "building our mirror". In the very middle of the mirror, we are happy to have either a single character (the \w in the alternation) or nothing (made possible by the question mark after the \w). Note that the pattern is not anchored, so it can match mirror words inside longer strings. Here is some php code that tests the pattern against a few strings. <?php $subjects=array('dontmatchme','kook','book','paper','kayak','okonoko','aaaaa','bbbb'); $pattern='/(\w)(?:(?R)|\w?)\1/'; foreach ($subjects as $sub) { echo $sub." ".str_repeat('-',15-strlen($sub))."-> "; if (preg_match($pattern,$sub,$match)) echo $match[0]."<br />"; else echo 'sorry, no match<br />'; } ?> And here is the output: dontmatchme -----> sorry, no match kook ------------> kook book ------------> oo paper -----------> pap kayak -----------> kayak okonoko ---------> okonoko aaaaa -----------> aaaaa bbbb ------------> bbb It worked perfectly! Well, almost perfectly. For the last string ("bbbb"), a match is found, but not the one we expected. What is happening there? This has to do with the atomic nature of PHP recursion levels. To explain it properly we will need to use the trace method in order to see every little step taken by the PHP regex engine. For the fully-traced match, click the image. Recursive Regex Atomic Levels Fail Backtracking Now pay close attention to step 22. At this stage, D0 has matched the first b, D1 has matched bbb, completing the string, and D0 cannot continue. If this were Perl, D0 could backtrack into D1, where other D1 matches could be explored: when D1 matches the bold characters in (\w)(?:(?R)|\w?)\1, it returns bb, and D0 can match the final b, returning the complete intended match: bbbb. However, this is not how the PCRE engine used by PHP's preg_match function works. Instead of going back into D1, the engine gives up D1 as a block. D0 then completes the match by eating two more "b"s, leaving the last one untouched, and returns "bbb". May this serve as a warning about the potentially unexpected outcomes of recursive regular expressions!

Numbered Recursion

Now suppose we wanted to match mirror words only if they occupy the entire string. We would have to make sure that the match starts at the beginning of the string, and ends at the end. Easy, you might say, add a caret and a dollar anchor: Attempt at anchored recursive pattern: ^(\w)(?:(?R)|\w?)\1$ Wait, not so fast… If you add anchors to the expression, when you hit the (?R), the anchors will be pasted back into the middle of the expression, yielding something like this: Attempt at anchored recursive pattern (first expansion): ^(\w)(?:^(\w)(?:(?R)|\w?)\1$|\w?)\1$ Now you have two carets preceding two distinct characters. This pattern can never match. Fortunately, you can build a recursive expression without using (?R), which repeats then entire expression. Instead, using the "subroutine expression" (or sub-pattern) syntax, you can paste a sub-pattern specified by a capture group. For instance, to paste the regex inside the Group 1 parentheses, you would use (?1) instead of (?R). Here is how our corrected anchored recursive pattern looks: Anchored recursive pattern: ^((\w)(?:(?1)|\w?)\2)$ Everything between the two anchors now lives in a set of parentheses. This is Group 1. Therefore, the captured word at the start is now Group 2. In the middle, the repeating expression pastes the subpattern defined by the Group 1 parentheses in place of the alternation: the anchors are left out. At the end, the first character is mirrored by the back reference to Group 2. This works perfectly. Except, once again, for the "bbbb" string. To see exactly why, you can use the trace method as shown in the earlier example.

Groups Contents and Numbering in Recursive Expressions

In a recursive regex, it can seem as though you are "pasting" the entire expression inside itself. In an expression where you have capture groups, as the one above, you might hope that as the regex shifts to deeper recursion levels and the overall expression "gets longer", the engine would automatically spawn new capture groups corresponding to the "pasted" patterns. But it is not so. First off, "pasting" is only a way of speaking. And as explained in the section on group numbering, groups are strictly numbered from left to right as you find capturing parentheses by reading the expression on the screen—and I mean the original expression, not one filled with layers of virtual paste operations. Group numbers are preserved from one depth level to the next. But what about their contents? There are a few simple rules about how group contents travel from one depth level to another. In PCRE, As you go down depth levels, the contents of a group (such as Group 1) stays the same at first… But the deeper level can overwrite the contents of a Group set above… Until you return to the higher level, where the captured groups resume their value. In contrast, in Perl, as you go down depth levels, the contents of a group (such as Group 1) are wiped out. To see this, I suggest a simple exercise for which I have prepared an expression and a test string. The expression matches the test string. The hints in the code box explain the value changes of the four capture groups. Subject: baacdbcd aacdbcda Expression: ^(.)((.)(?:(d)\1\3\4|\3(?2)))[ ]\2\3 Hints: The recursion (in bold) calls the Group 2 pattern, i.e.: ((.)(?:(d)\1\3\4|\3(?2))) At depth 0, the left side of the alternation fails. Value of \1: b throughout the match. Value of \2: aacdbcd once depth 1 recursion ends. Value of \3: a on depth 0, c on depth 1, a again once we return to depth 1 Value of \4: a on depth 0 then discarded, d on depth 1, unset again on depth 0 For the fully-traced solution, click the image! Recursive Regex Capture Groups in Levels

More about Recursive Expressions

There are great examples of regex recursion in several sections of the site. These will show you practical ways to use recursion in situations we haven't explored here. Quantifier Capture Matching Line Numbers For more information about recursive regexes, you can visit the PHP manual's page on recursive patterns and see if some of the examples posted there speak to you. For me, the best reference on recursive expressions lives in the PCRE documentation written by Philip Hazel, the creator of the PCRE engine. first I want to thank you for all the effort you made creating a site like rexegg. I've learned so much and now I want to give something back. I found a neat solution for your problem described in the RECURSION section of your page. This one actually overcomes the pcre engine limitation. I'll give you a link to regex101.com and you can decide for yourself if it's cool or not. :) Thank you for your kind message and most of all congrats on finding a clever way of achieving that classic task! At the moment I don't have the brainspace to study what you've done (surgery in a couple of days), but I've added your message as a comment to the Recursion page. Hope that's okay with you, if not please get in touch. Subject: a man a canal a plan panama So I checked if the palindrome matcher would match "amanaplanacanalpanama", and it didn't quite work. You can see what it matches here: http://rubular.com/r/F1jdF0hKpq Seems to be not "greedy" enough, matching several smaller palindromes in that string instead of the largest. Is it the ruby regex engine, or a limitation of the regex? Reply to Rob Hi Rob, You're right. Don't know enough about Ruby recursion to offer any insights. Kind regards, Rex Subject: str_pad You can use str_pad instead of str_repeat (I know, it doesn't have anything to do with regex). Thanks for this site, it's awesome. Subject: Recursion number? Thanks for your articles. I haven\'t found a better explained website about regex. It seems you can really put yourself in the learner\'s shoes and don\'t enjoy seeing us struggling like others with their eloquence. This article helped me to reduce half of the code in my expression already around 500 lines long! I need to [regex question follows] Reply to Sergio Hi Sergio, Thank you for your very kind message. Sorry for not being able to reply to your regex question, I am flat out from 6am to midnight at the moment. May I suggest the forums? Kind regards, -A

Character Class Subtraction, Intersection & Union

Some regex engines let you do some fancy operations within your character classes. You can match characters that belong to one class but not to another (subtraction); match characters that belong both to one class and another (intersection), or match characters that belong to either of several classes (union). At the moment, the engines that officially support one or several of these features are .NET, Java, Ruby 1.9+ (Onigmo engine) and Matthew Barnett's regex module for Python. In addition, Perl 5.18 introduced experimental support for set operations with its Extended Bracketed Character Classes, which I plan to document at some point. For other engines, I'll present some workarounds. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Character Class Intersection

In Java, Ruby 1.9+ and the regex module for Python, you can use the AND operator && to specify the intersection of multiple classes within a character class: […&&[…]&&[…]] specifies a character class representing the intersection of three sub-classes—meaning that the character matched by the class must belong to all three sub-classes. For instance, [\S&&[\D]] specifies one character that is both a non-whitespace character and a non-digit. Character class intersection really comes into its own when using Unicode properties, as it lets us zoom in on a desired set of characters. For instance, the regex [\p{InArabic}&&[\p{L}]] matches a character that is both in the Arabic Unicode block and a letter—in other words, an Arabic letter. Likewise, [\p{ASCII}&&\p{L}] matches an ASCII character that is also a letter. Brackets are optional but… Use them with negation As long as the class after the && is not a negated class [^…], the brackets are optional. Therefore, the above regex can be written [\p{InArabic}&&\p{L}] When negation is involved, I recommend always using brackets because things can get messy. For instance, what does [^a&&[ab]] mean? For Java, it means [[^a]&&[ab]] and therefore only matches b. For Ruby, it means [^[a&&[ab]]], i.e. "not the intersection of two subclasses" (which is a), and it therefore matches any character that is not an a. Combined with negated classes, intersection is very useful to create character class subtraction. Workaround for Engines that Don't Support Character Class Intersection For engines that don't support character class intersection, we can simply use a lookahead. For instance, our Arabic letter regex [\p{InArabic}&&\p{L}] can be written like this in Perl: (?=\p{InArabic})\p{L} The lookahead asserts that the following character belongs to the Arabic Unicode block. Then \p{L} matches a letter, which is guaranteed to be an Arabic letter.

Character Class Subtraction

The syntax for character class subtraction differs depending on whether you're using Java, Ruby 1.9+ or .NET. Character Class Subtraction in Java and Ruby 1.9+ Java and Ruby do not have dedicated syntax for character class subtraction. Rather, the feature is just a logical by-product of their character intersection syntax. For instance, [a-z&&[^aeiou]] matches characters that are both English lowercase letters and not vowels. In effect, it subtracts the vowel class [aeiou] from the class of letters [a-z]. The effect is to match all English lowercase consonants. Subtraction becomes particularly useful with Unicode properties. For instance, [\p{InArabic}&&[^\P{L}]] subtracts non-letters \P{L} from the set of characters in the Arabic Unicode block—guaranteeing we match an Arabic letter. Character Class Subtraction in .NET In .NET, with […-[…]], you can specify that the character to be matched belongs to a certain class (everything before the hyphen), except if it belongs to another class (the embedded character class, which is "subtracted" by the hyphen). For instance, the class [a-z-[aeiou]] matches an English lower-case consonant. Using Unicode properties, you can use this feature to zoom in on a useful character range. For instance, you could try [\p{IsArabic}-[\d]] to match one character in the Arabic code block, except if it is a digit. Do not think that gives you an Arabic letter, though, as the Arabic code block also includes punctuation and various marks and symbols. In contrast, [\p{IsArabic}-[\D]] is much more useful: it gives us one character in the Arabic code block, except if it is a non-digit—guaranteeing that it is an Arabic digit. You can have nested subtraction—using subtraction within a class being subtracted. For instance, having defined [a-z-[aeiou]] as English lowercase consonants, we could subtract those from the word character class: [\w-[a-z-[aeiou]]] Note that [^a-z-[0-9]] is not interpreted as (using pseudoregex) [^{a-z-[0-9]}], but as (pseudoregex) [{^a-z}-[0-9]]. But you would never do that: [^a-z0-9] is much simpler. Character Class Subtraction in the regex module for Python The syntax is the same as for .NET, except for one added hyphen. For instance, the class [a-z--[aeiou]] matches an English lower-case consonant. In addition, when the subtracted class does not include a range, its brackets are optional. The above can therefore also be written as [a-z--aeiou] Workaround for Engines that Don't Support Character Class Subtraction For engines that don't support character class subtraction, we can simply use a negative lookahead. For instance, our English consonant regex [a-z&&[^aeiou]] can be written like this in PCRE (PHP, R…), Perl, Python and JavaScript: (?![aeiou])[a-z] The negative lookahead asserts that the following character is not a lowercase vowel. Then [a-z] matches a letter, which is guaranteed not to be a vowel.

Character Class Union

The syntax for the union of character classes depends on which engine you use. Character class union in Java and Ruby 1.9+ In Java and Ruby 1.9+, you can embed character classes within a character class like so: […[…][…]] to specify that the character to be matched can belong to either of the specified classes. For instance, [\D[9]] matches one character that is either a non-digit or a 9. Since this can also be written [\D9], unions tend to be useful only in convoluted cases that involve negation or other character class operations (subtraction and intersection). For instance, [0[^\W\d]] is a reduced word character class where the only allowable characters are letters, underscores and the digit 0. Each embedded class has access to the entire character class syntax, so it too can embed classes, leading to multiple levels of nesting, as in […[…][…[…][…]]]. Looking at it this way, I don't know what situation would ever led you to specify such a class, but this could come in handy if you build patterns dynamically, concatenating classes that originate in various parts of your application. (Thanks to nhahtdh for bringing this to my attention.) Character class union in the regex module for Python In the regex module for Python, to create the union of multiple character classes, we use the OR operator ||. For instance, [0||[^\W\d]] specifies a character that is either 0 or a word character that is not a digit. Workaround for Engines that Don't Support Character Class Union For engines that don't support character class unions, we can simply use an alternation within a non-capturing group. For instance, our regex [0||[^\W\d]] can be rewritten like this in all major engines: (?:0|[^\W\d])

Backtracking Control Verbs

Several regex flavors have special patterns that instruct the engine about how to match, rather than about what to match, as the other tokens do. In the documentation, these patterns are bundled under the label of special backtracking control verbs, although at first sight some verbs, such as (*ACCEPT) (which tells the engine to return the string it has matched so far) may not seem directly related to backtracking. (Almost) everything you always wanted to know about backtracking control verbs but never dared to ask. In practice, the idioms I find the most useful are and the combination. If you're only looking for regex that you can put to immediate use, feel free to skip to these sections. Otherwise, stay with me as we explore the various backtracking control verbs. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Engines that Support Backtracking Control Verbs

At the moment, backtracking control verbs are a rarity: Perl introduced them in version 5.10 (final release: December 18, 2007) Four months earlier (28 August 2007), PCRE 7.30 "followed suit" with a back-to-the-future move as Perl 5.10 hadn't yet been officially released. Because of PCRE's wide use, the verbs can be found in contexts such as PHP, R and Apache. For Python, Matthew Barnett introduced a limited set of the verbs in the September 14, 2015 release of his regex engine (a brilliant alternate to the feature-poor re). So far the supported verbs are (*PRUNE), (*SKIP) and (*FAIL), probably the most important ones. The slow speed of adoption of the backtracking control verbs reflects some simple truths: they are rarely used, likely because they are poorly known—and little understood by a considerable proportion of those few who have heard of them. Frankly, this lack of awareness is not an issue because there is so much "basic" material that most people who only occasionally use regex need to master before the features offered by backtracking verbs become meaningful. But with this page, I hope to make them more accessible to those who are interested in polishing their regex—and I count myself firmly in that camp, since: 1. There is always more to learn, 2. There is a lot to forget, and 3. The regex landscape doesn't stand still, as new engines are introduced and old ones evolve.

The Main Backtracking Control Verbs

Four verbs can be called the "main" or "real" backtracking control verbs: (*THEN), (*PRUNE), (*SKIP) and (*COMMIT). In fact, the above list arranges them in "order of strength". Although the most important ones are (*PRUNE) and (*SKIP), it will be easier to gain a well-rounded understanding of the verbs if we explore them in this order. During Forward Matching, the Verbs are "Invisible" The first thing to understand about these four verbs is that as long as the engine is moving forward in the pattern and the string, the verbs have no influence on the matching. In other words, they always match. This is a zero-width match, much as the lookahead (?=), which asserts that an empty string can be found at the position immediately to the right of the string cursor: it is always true. For instance, (*PRUNE) has no influence whatsoever when the pattern \d+(*PRUNE)\D+ is set to work on the string 12 Monkeys: the regex just matches, and the special semantics of (*PRUNE) are never activated. The verbs activate when the engine needs to backtrack across them The verbs are triggered at the point when the engine tries to give them up in order to continue a match attempt. Specifically, all four backtracking control verbs forbid the engine from crossing them while backtracking—in other words, the engine can never go back to their left. What the engine must do at that point depends on the verb, the engine's actions being arranged on a scale of potency from (*THEN) to (*PRUNE), (*SKIP) and (*COMMIT). To understand how the verbs work, we must therefore understand how backtracking works, which is no small feat considering that various optimizations may influence how different engines backtrack. Nevertheless, treading carefully, you'll usually have a good idea of what behavior to expect.

(*THEN)

This verb is probably the least useful of the four backtracking control verbs. (*THEN) is designed to work within an alternation—on the left side of one or more | tokens. As with the three other backtracking control verbs, when the engine passes (*THEN) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string). If the engine needs to backtrack, it cannot backtrack across (*THEN) (same as with the other backtracking control verbs). At that stage, if the engine is on the left side of an alternation (the | token), it gives up trying to match the current branch and starts trying to match the next branch. Consider this regex: Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers) We'll use it against this string: Comedian: Bill Burr -- Comedian: Peter Sellers The match is Peter Sellers. Here's a top-level explanation. What we have here can be seen as a construct of the form if:then… elseif:then… else:then The B\w+, E\w+ and P\w+ fragments are meant to match a first name. The idea is that if the first name starts with a B, THEN the last name must be Murray… If the engine matches a first name starting with a B but the last name is something else than Murray, the engine is instructed not to slowly backtrack each of the tokens of the failed branch's pattern B\w+, but to give up the entire branch (this might remind you of an atomic group) and skip directly to the next branch in the alternation, which is E\w+ (*THEN)Murphy. In that branch, the idea is the same: if the first name starts with an E, THEN the last name must be Murphy, otherwise don't bother backtracking across the first name and skip to the next branch: P\w+ Sellers In pseudo-regex, the expression reads: - Match Comedian: - If the first name starts with B, THEN match Murray - Elseif the first name starts with E, THEN match Murphy - Else a match first name starting with P, a space, and Sellers In our example, the engine starts a match attempt at the beginning of this string: Comedian: Bill Burr -- Comedian: Peter Sellers After matching Comedian:, a space character, the Bill in Bill Burr and the space character, the engine matches the always-true (*THEN), but at that stage it fails to match the M in Murray. It cannot backtrack across the (*THEN), so it gives up the content of the alternation's first branch (Bill and a space) and tries to match the E at the start of the middle branch. That fails, so the engine tries to match the P in the third branch. That too fails, so the entire match attempt has failed. The engine advances in the string to the position preceding the o in Comedian and tries a second match attempt. That fails immediately. After a number of other immediately-failed match attempts, the engine tries a match attempt at the position before the C in Comedian: Peter Sellers. This match attempt succeeds. Here are code snippets if you'd like to try it with the two engines that currently support (*THEN). # Perl $comedian_regex = qr/Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)/; if ('Comedian: Bill Burr -- Comedian: Peter Sellers' =~ $comedian_regex ) { print "\$&='$&'\n"; } else { print "No match\n"; } // PHP $comedian_regex = '~Comedian: (?:B\w+ (*THEN)Murray|E\w+ (*THEN)Murphy|P\w+ Sellers)~'; echo preg_match($comedian_regex, 'Comedian: Bill Burr -- Comedian: Peter Sellers', $m ) ? "$m[0]\n" : "No match\n"; Is this useful? Not for this simple example. Consider the simpler alternate, where the two (*THEN) have been removed: Comedian: (?:B\w+ Murray|E\w+ Murphy|P\w+ Sellers) This matches the same strings. Autopossessification The (*THEN) are just meant to speed up the process by cutting down on backtracking. But the time lost to compile this more complex pattern more than offsets any time gained during matching. Furthermore, for this example, there is no time gain during matching (as least with the PCRE engine: I haven't timed Perl). That is because PCRE would never backtrack into the first name in the first place. Before matching, the engine studies the pattern, and an optimization kicks in that turns the \w+ token into a possessive \w++. PCRE is able to do this because the token that follows \w+ is a space character. The \w token and the space character are mutually exclusive: even if it backtracks into \w+, the engine will never be able to match a space character where a word character was matched earlier. Therefore PCRE can treat \w as a \w++. This process is known by a charming term: autopossessification. Try not to say it while chewing on oatmeal. When you turn off PCRE optimizations by using PCRE's start of pattern modifiers (*NO_START_OPT) and (*NO_AUTO_POSSESS), the (*THEN) pattern edges out the plain pattern. But only by a hair: this makes sense because Bill does not give \w+ much to backtrack. That being said, when the expression to the left of a (*THEN) (within the same branch of the alternation) is particularly complex and time-consuming, I am sure there are situations when the verb would be useful. For my part, I have never used it except to try it out. When (*THEN) is not found within an alternation The (*THEN) verb works by sending the engine to the next branch of an alternation. When (*THEN) is not found within an alternation, it behaves like (*PRUNE).

(*PRUNE)

If you remember to use it, this verb is probably the second most useful of the four. As with the other three backtracking control verbs, when the engine passes (*PRUNE) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string). If the engine needs to backtrack, it cannot cross (*PRUNE) from right to left (same as with the other backtracking control verbs). At that stage, it gives up on the match attempt. As usual, if any characters are left in the string, the engine then advances to the next string position and starts a new match attempt. Atomic bomb This behavior probably reminds you of atomic groups. You could design some uses of (*PRUNE) so that it acts the same as an atomic group or a possessive quantifier, but in most cases it will behave differently. This is because you can drop (*PRUNE) anywhere—and once you try to backtrack across it, the whole match attempt explodes. In contrast, with atomic groups and possessive quantifiers, only a fragment of the match is taken out. If it helps, think of (*PRUNE) as an atomic bomb. When (*PRUNE) is the same as an atomic group First, let's look at an example where (*PRUNE) behaves the same as an atomic group or a possessive modifier. Let's say you're a security analyst sifting through your database for people called Jones whose first name has six or more letters. Your search should produce strings such as Quincy Jones: music producer, Rashida Jones: actress, Indiana Jones: movie character. At first, you come up with \w{6,} Jones:.*, then you decide to make the first name \w{6,} atomic since there's no point backtracking once you've established that you have the wrong last name. The following three expressions would be equivalent: - (?>\w{6,}) Jones:.* (atomic group) - \w{6,}+ Jones:.* (possessive quantifier) - \w{6,}(*PRUNE) Jones:.* ("atomic bomb") The reason (*PRUNE) behaves the same as an atomic group or possessive quantifier in this situation is that it appears immediately after the first sub-expression. The match attempt explodes at a point that happens to be the same place where the atomic and possessive versions cause the match to fail: once they've given up the \w{6,} there is nothing left to backtrack. Usually, (*PRUNE) is not the same as an atomic group Let's now look at the more general case, where (*PRUNE) behaves differently from an atomic group. This time, we want to match actor names. If the first name has two to four letters, the last name must be Murray; however, Bill Burr is acceptable too. First, we try with an atomic group: (?>\w{2,4} )Murray|Bill Burr|Peter Sellers Against the string Bill Burr -- Peter Sellers, this regex matches Bill Burr. In the first match attempt (at the beginning of the string), after the first name and the space character match in the leftmost side of the alternation, the M in Murray fails. The engine gives up the atomic group, then restarts the match attempt in the same position—before the initial B. This time the middle part of the alternation—Bill Burr—is a direct match. Let's now try the (*PRUNE) version of this pattern: \w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers Against the same string Bill Burr -- Peter Sellers, this pattern never matches Bill Burr. Instead, it matches Peter Sellers. The difference is that on the first match attempt, in the leftmost side of the alternation, after the M in Murray fails to match, the engine tries to backtrack across the (*PRUNE). At that stage, the match attempt explodes—the other parts of the alternation are never visited. The engine advances to the next position in the string (between the initial B and the i) and starts a whole new match attempt. After this and several more match attempts fail, the engine starts a new match attempt at the string position immediately preceding the P in Peter Sellers, and the match succeeds with the rightmost branch of the alternation. Here are some code snippets if you'd like to try this in the three engines that currently support (*PRUNE). # Perl $actor_regex = qr/\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers/; if ('Comedian: Bill Burr -- Comedian: Peter Sellers' =~ $actor_regex ) { print "\$&='$&'\n"; } else { print "No match\n"; } // PHP $actor_regex = '~\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers~'; echo preg_match($actor_regex, 'Bill Burr -- Peter Sellers', $m ) ? "$m[0]\n" : "No match\n"; # Python # if you don't have the regex package, pip install regex import regex as mrab # print(regex.__version__) should output 2.4.76 or higher actor_regex = r'\w{2,4} (*PRUNE)Murray|Bill Burr|Peter Sellers' print(mrab.search(actor_regex, 'Bill Burr -- Peter Sellers')) # <regex.Match object; span=(13, 26), match='Peter Sellers'> Is this (*PRUNE) useful? For moderately complex expressions that may entail a lot of backtracking, (*PRUNE) can save the engine a lot of time. It's a powerful weapon to have in your regex arsenal: you drop it anywhere, and it causes the match attempt to explode if the engine ever tries to backtrack across it. Certain conditions must be met before (*PRUNE) becomes efficient: - the time saved must outweigh the longer time needed to compile the regex. If the only potential savings is backtracking across two letters (as with a \w{2,4}, this is not a place for (*PRUNE). - (*PRUNE) must save real-life backtracking. On paper, it may look like a (*PRUNE) saves you some backtracking, but your engine may have performed some optimizations that would prevent backtracking from ever happening anyway: see the section on autopossessification. When (*PRUNE) does work, in many cases (*SKIP) works even better. Teaser: in PCRE, you can accomplish similar results to (*PRUNE) with a callout that returns a positive value.

(*SKIP)

The next backtracking control verb on our potency scale is my favorite. I'll give a quick explanation of (*SKIP) in relation to (*PRUNE) for those of you who have been following from the top, then I'll give the long version for those who skipped directly to this section. The short version Like (*PRUNE), (*SKIP) acts like an " bomb": when the engine tries to backtrack across it, the match attempt explodes. In the case of (*PRUNE), if any characters are left in the string, the engine then advances to the next string position and starts a new match attempt. In the case of (*SKIP), the engine advances to the string position corresponding to the place in the pattern where (*SKIP) was encountered, and starts a new match attempt at that position. In other words, the engine skips to the string position corresponding to where (*SKIP) was matched—potentially saving a lot of fruitless match attempts. The longer version As with the other three backtracking control verbs, when the engine passes (*SKIP) from left to right, nothing visible happens: the token acts as an always-true assertion at that point in the string. If the engine needs to backtrack, it cannot cross (*SKIP) from right to left (same as with the other backtracking control verbs). Instead, it gives up on the match attempt. At that stage, the way the engine works, it would normally advance to the next string position and starts a new match attempt. Instead, (*SKIP) causes the engine to skip to the position in the string corresponding to where (*SKIP) was matched, and to start the next match attempt at that position. This can save time by avoiding fruitless match attempts. Double bomb (*SKIP) is like a bomb that operates on two planes. Not only does backtracking through it blow up the match attempt (as (*PRUNE) does), it also blows up the part of the subject string leading up to it. PCRE, Python: later match failure always triggers (*SKIP) (unless it is followed by other verbs) Here it's important to understand how backtracking works. In theory, when a match fails somewhere after a (*SKIP), although you and I can sometimes tell that there is nothing to backtrack, a regex engine will always try to backtrack to the beginning of the string as it looks for other paths to explore. Therefore, if (*SKIP) is the last backtracking control verb in a pattern, a match failure beyond the (*SKIP) should always trigger the (*SKIP). If another verb occurs to the right of the (*SKIP) and to the left of the failure, that verb gets triggered first. In practice, if you were writing a regex engine, you would try to avoid backtracking when it's obvious that backtracking would be fruitless (for instance, when the pattern contains no quantifiers or alternations). Internally, among all kinds of clever optimizations, Perl, PCRE and Python's regex package are smart enough to avoid fruitless backtracking. But externally, PCRE and Python behave as though they were backtracking all the way back. Perl's behavior is inconsistent. For instance, given the string aaaardvark aaardwolf, suppose we run a match attempt with this regex: aa(*SKIP)ard\w+ The engine starts a match attempt at the beginning of the string. It matches aa, passes over the (*SKIP), matches the third a then chokes on the r token because the next character in the string is a. Internally, our engines are smart enough to fail the match right away because there is nothing to backtrack (no quantifiers, no alternations). Externally, the engines behave as though they were conducting a naive path exploration that would cause them to give up the last a then attempt to backtrack across the (*SKIP) in search of a different match. The (*SKIP) is triggered: the match attempt explodes; the engine skips in the string past the initial aa before starting its second match attempt. At this position, there is only one a character, so this match attempt fails, as do other attempts until we reach the position preceding aaardwolf. In contrast, without the (*SKIP), the pattern would match the string aaardwark starting on the second a. Perl: Inconsistent behavior on later failure Unlike PCRE and Python, when a pattern looks like it should fail to the right of a (*SKIP), the Perl engine does not always fire the (*SKIP). For instance, with the earlier pattern aa(*SKIP)ard\w+ Perl matches aaardvark in aaaardvark aaardwolf. Perhaps even worse, with a{1,2}(*SKIP)ard\w+, when matching against aaaardvark aaardwolf, after matching aaa and failing on the r token, Perl doesn't try to backtrack across (*SKIP) despite the quantified a{1,2}—a backtrackable expression which should clearly lure the engine to trigger the (*SKIP). The (*SKIP) doesn't fire, and the engine matches aaardvark instead of the expected aaardwolf. When I reported this as a bug, a Perl team dev kindly explained that (*SKIP) doesn't fire if the "study phase" at the start of a match attempt is able to decide that the match should not even be attempted. This leads me to guess that at the first position in the string, the optimizer sees that the characters ar or aar must be matched but can't be. The match attempt aborts before it even starts, so technically (*SKIP) is never crossed and the engine starts the next match attempt at the position following the first a. This "optimized behavior" contradicts the pattern writer's directions, so it seems undesirable to me. Since backtracking control verbs are described as experimental, it's entirely possible that the Perl team will decide to change the behavior at some stage. Here are some code snippets if you'd like to try (*SKIP) in the three engines that currently support it. # Perl if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; } # $&='ABC' # with (*PRUNE) instead, the match would be 23A // PHP echo preg_match('~aa(*SKIP)ard\w+~', 'aaaardvark aaardwolf', $m) ? "$m[0]\n" : "No match\n"; # matches aaardwolf, whereas Perl matches aaardvark # Python # if you don't have the regex package, pip install regex import regex as mrab # print(regex.__version__) should output 2.4.76 or higher print(mrab.search(r'aa(*SKIP)ard\w+', 'aaaardvark aaardwolf')) # <regex.Match object; span=(11, 20), match='aaardwolf'> Apart from potentially avoiding lots of fruitless match attempt, (*SKIP) is particularly useful as part of the (*SKIP)(*FAIL) construct, which we'll study shortly. In the section about (*MARK), we'll see that you can also make (*SKIP) cause the engine to skip to a specific "bookmark" in the string, rather than to the position where (*SKIP) was encountered.

(*COMMIT)

The fourth and "strongest" backtracking control verb is probably the easiest to grasp. As with the other three backtracking control verbs, when the engine passes (*COMMIT) from left to right, nothing happens (the token acts as an always-true assertion at that point in the string). We are committed to finding a match in this match attempt, or none at all. If the engine needs to backtrack, it cannot cross (*COMMIT) from right to left (same as with the other backtracking control verbs). At that stage, it gives up on the match attempt. Normally, the engine would then advance to the next string position and start a new match attempt. But when (*COMMIT) fires, the engine abandons any further match attempt, and the overall match just fails. You can read the (*COMMIT) token as "past this point, we are committed to finding a match in this match attempt, or none at all". But (*COMMIT) does not always commit One thing to bear in mind is that other backtracking control verbs may influence the engine's behavior if it fails to match further to the right. For instance, if a (*PRUNE) or a (*SKIP) are crossed to the right of a (*COMMIT), when the match later fails and the engine starts to backtrack, it will try to cross (*PRUNE) or (*SKIP) before (*COMMIT) is reached. This fires the (*PRUNE) or (*SKIP) behavior, so a new match attempts can take place despite the (*COMMIT), which never fires. At least that's the theory. Perl behaves inconsistently depending on whether (*PRUNE) or (*SKIP) is backtracked into. This is the object of a bug report. Here are some code snippets if you'd like to see how this works in the two engines that support (*COMMIT). # Perl if ('123ABC' =~ /123(*COMMIT)B|.{3}/ ) { print "\$&='$&'\n"; } else { print "No match\n"; } # No match # with /1(*COMMIT)23(*SKIP)B|.{3}/ the match is ABC: # (*COMMIT) never fires, but (*SKIP) does. # with /1(*COMMIT)23(*PRUNE)B|.{3}/ there is no match! (Bug.) // PHP echo preg_match('~123(*COMMIT)B|.{3}~', '123ABC', $m) ? "$m[0]\n" : "No match\n"; // No match: (*COMMIT) fires echo preg_match('~1(*COMMIT)23(*PRUNE)B|.{3}~', '123ABC', $m) ? "$m[0]\n" : "No match\n"; // 23A: (*COMMIT) never fires echo preg_match('~1(*COMMIT)23(*SKIP)B|.{3}~', '123ABC', $m) ? "$m[0]\n" : "No match\n"; // ABC: (*COMMIT) never fires

Backtracking Control Verbs inside Lookarounds, Subpatterns or Atomic Groups

Within a lookaround, subpattern or atomic group, a backtracking control verb only affects the sub-match. For instance, (*COMMIT) only commits the engine with respect to the match within a lookaround—not with respect to the overall match. Mileage may vary so experimentation is key. Double Negatives If you should wish to use backtracking control verbs within negative lookarounds (come on, you know you're asking for trouble!) remember to pay close attention to logic. For instance, a (*SKIP) that causes the failure of the pattern that dwells within a negative lookahead thereby causes the negative lookahead assertion to succeed. Likewise, an (*ACCEPT) (a verb we'll soon see) makes the pattern succeed, resulting in a failure of the negative lookahead assertion.

Backtracking Control Verbs inside Repeated Groups

When a backtracking control verb lives within a repeated group, PCRE fires the verb at the point where it is backtracked. In contrast, Perl, strangely, chooses to ignore verbs when the quantified group has not yet been fully matched. This is the object of a bug report. These examples may make this a little clearer. # Perl if ('1213' =~ /(?:1(*COMMIT)2)+./ ) { print "\$&='$&'\n"; } # $&='121' # The engine matches the first '12', then it matches # '1' and the (*COMMIT) token. When the second '2' fails, # the engine manages to backtrack across (*COMMIT), # which never fires. Bug? // PHP echo preg_match('~(?:1(*COMMIT)2)+.~', '1213', $m) ? "$m[0]\n" : "No match\n"; // No match: (*COMMIT) correctly fires after the second '2' // fails to match.

(*ACCEPT)

The next two verbs are bundled with backtracking control verbs in the documentation, but I would drop the backtracking part of the description and keep the control. (*ACCEPT) doesn't seem to relate to backtracking. (*FAIL) does cause the engine to backtrack—which is not the same as controlling what happens when the engine backtracks, as in the case of the first four. (*ACCEPT) is delightfully simple, but its utility is limited. When the engine encounters (*ACCEPT), it immediately returns the portion of the string it has matched so far. If the engine matches (*ACCEPT) in the middle of a capturing group, that group is set to whatever characters have been matched up to that point. The only use case for (*ACCEPT) that I'm aware of is when the branches of an alternation are distributed into a later expression that is not required for all of the branches. For instance, suppose you want to match any of these patterns: BAZ, BIZ, BO. You could simply write BAZ|BIZ|BO, but if B and Z stand for complicated sub-patterns, you'll probably look for ways to factor the B and Z patterns. A first pass might give you B(?:AZ|IZ|O), but that solution doesn't factor the Z. Another option would be B(?:A|I)Z|BO, but it forces you to repeat the B. This pattern allows you to factor both the B and the Z: B(?:A|I|O(*ACCEPT))Z If the engine follows the O branch, it never matches BOZ because it returns BO as soon as (*ACCEPT) is encountered—which is what we wanted. Here is sample code to try (*ACCEPT) in the two languages that support it. For each language, the second fragment demonstrates the behavior inside a capture group. # Perl if ('BOZ' =~ /B(?:A|I|O(*ACCEPT))Z/ ) { print "\$&='$&'\n"; } # BO # with 'BAZ' as subject, the match would be BAZ if ('BOZX' =~ /B((?:A|I|O(*ACCEPT))Z)X/ ) { print "\$1='$1'\n"; } # Group 1: O # with 'BAZX' as subject, Group 1 would be AZ // PHP echo preg_match('~B(?:A|I|O(*ACCEPT))Z~', 'BOZ', $m) ? "$m[0]\n" : "No match\n"; // BO // with 'BAZ' as subject, the match would be BAZ echo preg_match('~B((?:A|I|O(*ACCEPT))Z)X~', 'BOZX', $m) ? "$m[1]\n" : "No match\n"; // Group 1: O // with 'BAZX' as subject, Group 1 would be AZ Are there other ways to factor the B and Z patterns? Sure. Conditionals come to mind, for instance: - Option 1: B(?:(A|I)|O)(?(1)Z) (if A or I have been captured to Group 1, then match Z) - Option 2: B(?:A|I|(O))(?(1)|Z) (if O has been captured to Group 1, then match the empty string, otherwise match Z) Please note that (*ACCEPT) is not the opposite of (*FAIL)

(*FAIL)

This verb is available in Perl, PCRE and Python's alternate regex package. (*FAIL) means at the present token in the string, the current token does not match. The result is no different from trying to match a b character with a p token: the engine chokes on the token, and its next move is to backtrack in order to find a different way to make the current match attempt succeed. Alternates to (*FAIL) If you want to shave three characters, you can write (*F) instead of (*FAIL). And in most languages that don't support (*FAIL), you can simply use the classic (?!), which has the same effect. How does (?!) work? It's a negative lookahead that asserts that at the current position in the string, it is not possible to match the empty string. Since the empty string can be matched at any position, the assertion fails. I recall reading in one of the engines' docs that internally (*FAIL) is translated to (?!)—unless it was the other way around. The regex syntax offers many options besides (?!) to force the engine to fail at a certain point in the pattern—consider contradictory pairs such as (?=A)B. But (?!) is the most popular, probably because it is so compact. Some of these ways are explored in the trick about forcing a failure. Use Cases for (*FAIL) There are a number of use cases for (*FAIL) and its equivalents. You can use (*F) within conditionals to enforce certain balancing conditions. For instance, (INTRO)?(?:MAIN1|MAIN2(?(1)|(?!))) is a long-winded way of ensuring that MAIN2 is only matched if INTRO has been matched before. This approach is used elegantly in .NET balancing groups. It is also developed in the section on Conditionals At Work: Controlling Failure In my trick to write pre-defined subroutines for engines that don't support it, I use (*FAIL)—or its alias (?!)— to force the engine to backtrack after defining a capture group without intending to consume characters at that position in the string. In a moment, we'll look at the (*SKIP)(*FAIL) construct, which is a delightful way of excluding certain patterns from the match. In Perl, which has extensive callback facilities, (*FAIL) can be used to explore all the branches of a match tree. Consider this example: 'abc' =~ /\w+(?{print "$&\n";})(*F)/ This prints abc, ab, a, bc, b, c. After the \w+ matches the whole string, the code capsule prints the match, then (*F) forces the engine to backtrack. The engine gives up the c, the callback prints ab, the (*F) forces the engine to backtrack, and so on. (*FAIL) is not the opposite of (*ACCEPT) It's natural to imagine that (*FAIL) would be the opposite of (*ACCEPT), but that is not the case. If it were the opposite of (*ACCEPT), then (*FAIL) would mean at this point in the match, fail the match attempt. The engine would then throw away whatever had been matched thus far, and perhaps begin a new match attempt at the next starting position in the string. Forcing the Match Attempt to Fail If you truly want the opposite of (*ACCEPT) in order to abort the match attempt, you will need to use something like (*PRUNE)(*FAIL) or (*COMMIT)(*FAIL): In the case of (*PRUNE)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match. When it tries to backtrack across the (*PRUNE), the match attempt fails. The engine then tries a new match attempt at the next starting position in the string, if any. In the case of (*COMMIT)(*FAIL), once the engine encounters (*FAIL), it tries to backtrack in order to find a successful match. When it tries to backtrack across the (*COMMIT), the match attempt fails, and the engine also abandons any further match attempts. Soon we'll study a surprisingly useful variation on these themes: (*SKIP)(*FAIL).

(*MARK)

This verb is used either on its own or in conjunction with (*SKIP). You use it to tag (and in certain cases "bookmark") a position in the string, as in (*MARK:after_the_digits). Note that (*:some_tag) is an alias for (*MARK:some_tag) When the verb is used by itself, you typically pepper several instances of it in your code, as in (*MARK:tag1), (*MARK:tag2). You are later able to determine which path the engine has used to return the match by inspecting the $REGMARK variable in Perl or PCRE's pcre_extra data block (please refer to the documentation). When the verb is used in conjunction with (*SKIP), the (*SKIP:some_tag) syntax specifies a "bookmark" in the string, the position where the engine should start its next match attempt if the match attempt explodes when the engine tries to backtrack across (*SKIP:some_tag). If you'd like to see how this works, here's a code sample. # Perl if ('123ABC456' =~ /123(*MARK:past_digits)[A-Z]+(*SKIP:past_digits)9..|.{6}/ ) { print "\$&='$&'\n"; } # $&='ABC456' # instead of skipping past ABC (default (*SKIP) behavior), # the engine only skips past 123 // PHP echo preg_match( '~123(*MARK:past_digits)[A-Z]+(*SKIP:past_digits)9..|.{6}~', '123ABC456', $m) ? "MARK $m[0]\n" : "No match\n"; // match: ABC456 // instead of skipping past ABC (default (*SKIP) behavior), // the engine only skips past 123 Marking with (*PRUNE) and (*THEN) You can use (*PRUNE:some_tag) and (*THEN:some_tag) to record the matching path if you'd like to later inspect the $REGMARK variable in Perl or PCRE's pcre_extra data block. One difference between these and (*MARK:some_tag) is that while (*SKIP:some_tag) looks for (*MARK:some_tag), it does not look for (*PRUNE:some_tag) or (*THEN:some_tag).

Using (*SKIP)(*FAIL) to Exclude Unwanted Matches

You might recall from the that often, saying what you don't want is an important strategy to achieve your regex goals. We typically do that with negative character classes such as \D and [^"], or with negative assertions such as (?!A). Sometimes, we want a bit more sophistication to express what we don't want. For instance, suppose we'd like to match all individual words in a text (defined by the pattern \b\w+\b) except if such words live between curly braces, {like these}. For this kind of situation, the (*SKIP)(*FAIL) construct is wonderful. With it, instead of cooking up some convoluted negative logic, you express exactly what you want to avoid; if you find it, you skip it; if you don't find it, you match what you want. Here is how we could match single words so long as they don't live inside a set of curly braces. {[^}]*}(*SKIP)(*FAIL)|\b\w+\b Note that for this pattern, we assume we know that our text can never contain {nested{braces}}. Here is how the pattern works. First, notice the central pivot around the alternation |. On the left side, we use {[^}]*} to try to match what we don't want: anything within a set of curly braces. For this, we match a left curly brace, then [^}]* matches zero or more characters that aren't a right curly brace (another example of saying what we don't want), then we match a right curly brace. If the engine matches a set of curly braces, it then encounters the (*SKIP) verb, which always matches. Next, it encounters the token (*FAIL), which always fails. At that stage, the engine tries to backtrack across (*SKIP) in the hope of finding a different way to match within the current attempt. But (*SKIP) causes the match attempt to explode, and (*SKIP) also tells the engine to start the next match attempt at the string position corresponding to where (*SKIP) was encountered. This means the engine will never again look at the set of curly braces it just matched. The content we want to avoid has been skipped (matched then thrown away). At the start of each match attempt, if the engine is unable to match a set of curly braces, it jumps to the right branch of the | alternation. There, \b\w+\b attempts to match an individual word. If this fails, the match attempt fails. The engine then advances to the next position in the string, and tries a whole new match attempt (once again trying to match a set of curly braces first.) In effect, (*SKIP)(*FAIL) says:
Throw away anything you can match to the left of me.
This is a very useful technique, and it also appears on the page about the best regex trick. If you'd like to try (*SKIP)(*FAIL), here is sample code for the three engines that support it. # Perl while ('good words {and bad} {ones}' =~ /{[^}]*}(*SKIP)(*FAIL)|\b\w+\b/g ) { print "matched: '$&'\n"; } # matched: 'good' # matched: 'words' // PHP if(preg_match_all('~{[^}]*}(*SKIP)(*FAIL)|\b\w+\b~', 'good words {and bad} {ones}', $matches)) var_dump($matches[0]); /* array(2) { [0]=> string(4) "good" [1]=> string(5) "words" } */ # Python # if you don't have the regex package, pip install regex import regex as mrab # print(regex.__version__) should output 2.4.76 or higher print(mrab.findall(r'{[^}]*}(*SKIP)(*FAIL)|\b\w+\b', 'good words {and bad} {ones}')) # ['good', 'words']

Documentation

That's about all I have to say about backtracking control verbs for the time being. If you'd like to learn more, feel free to dip into the documentation—though my hope is to have made the verbs far more approachable than in those documents. PCRE documentation on backtracking control verbs Perl documentation on backtracking control verbs Smiles, Rex You have some regex patterns on https://www.rexegg.com/backtracking-control-verbs.html which extend outside of the 'article' column. The first place is below the '#then' anchor, after the text "Here are code snippets if you'd like to try it with the two engines that currently support (*THEN). " The second place is below the '#pruneisdifferent' anchor, after the text "Here are some code snippets if you'd like to try this in the three engines that currently support (*PRUNE). ". I'm able to see this on Chrome in Ubuntu 16 and Chrome in OS X. The code that spills over is within a PRE element within a DIV named "codebox". Could the CSS for this be updated to either wrap that text, allow that div to scroll any overflowing content, or even consider widening the 'article' width slightly? I've been trying to get as well versed in regex as possible and this was a topic that I felt I only had a loose understanding on. I've only skimmed this section a little, but I plan on using this as a more robust reference for future practicing. Very helpful for removing SQL comments, etc. Reply to Mike Hi Mike, Thank you so much for your positive feedback, you made my day. I only finished this page five days ago after putting days of work into it. I thought it was an esoteric topic that would attract few readers, so it's wonderful to see that at least one person is enjoying it already. Wishing you a fun weekend, Rex

Regex Gotchas

On this page, I'd like to collect some regex phenomena that may trip or puzzle you for a moment. Often, such little problems lead the regex apprentice to discover that he hadn't fully understood one aspect of regex he thought he had already mastered. For me, such Gotcha! moments are a great source of satisfaction and learning. I tried to write these problems in a Question and Answer format. The first ones are really simple in the sense that they would only trip you on your first day with regex. As the collection grows, I hope that further down the list even accomplished regexers will find something to arouse their interest. Please note that this page is a recent venture. There aren't many Gotcha! yet, but I intend to flesh out the collection over time. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

The right case

Question: I made this regex: [a-z]+. Why isn't it matching Cat? By default, a regex pattern is case-sensitive by default. How you turn on case-insensitivity depends on your engine.

Stuck on a Line

Question: I made this regex: My .* cat. Why isn't it matching this string? My dog and my cat By default, the . does not match line breaks. How you make it match carriage returns, new lines and other line break characters depends on your engine.

Only the Word, Please

Question: I made this regex: cat. My tool finds a match in the word certificate, but I only want to find cat on its own. What to do? The easiest fix is to use word boundaries \b, which match positions where one side is a word character (letter, digit, underscore) and the other side is not a word character (for instance it is the beginning of the string or a space character). This gives you: \bcat\b Improved word boundary The regex above will not find cat in _cat25, because there is no boundary between an underscore and a letter, nor between a letter and a digit: these are all what regex defines as word characters. If you think that digits and underscores should count as a word boundary, \b will therefore not work for you. If you would like to use a boundary that detects the edge between an ASCII letter and a non-letter, you can make it yourself. See the section about a DIY "real word boundary" on the page about regex boundaries.

Empty-Handed

Question: I made this regex: \d*|\w+. The engine reports that there is a match, but it is empty. Why isn't it matching Cat? On the face of it, this pattern seems to match either digits or some word characters, so we might expect Cat to match. However, when the engine attempts a match at the position before the initial C, the \d* is able to match, since it is true that there are zero or more digits at that position. The match is returned, and the right side of the alternation is never visited. In all engines, if the engine is instructed to find multiple matches, it will also find other "empty" matches after the C and the other letters. One difference among engine is that Perl and PCRE (C, PHP, R…) will also match Cat. That is because after finding a zero-width match, these engines will attempt another match at the same position in the string, and backtrack into alternations and other subexpressions as needed to find a non-zero-width match.

It Doesn't Match Enough

Question: I made this regex: [129]|18 to match the numbers 1, 2, 9 or 18. Why isn't it matching 18? The character class [129] matches the 1 in 18. The match is returned, and the right side of the alternation is never visited. Anchors or boundaries would resolve this problem: ^(?:[129]|18)$ Conclusion: when setting up an alternation, be mindful of what each branch matches, especially on the left side, and all the more so if you are using stars and question marks. If the leftmost alternation can never fail, for instance, you can be sure that other sides of the alternation will never match.

Phantom Replacements

Question: I made this regex: X* and want to use it with this replacement string: Y. When I run the replacement on string X, I get YY, and when I run it on A, I get YAY. What is happening? This is the same problem as in Unwanted Matches, but with a replacement. Against string X, the regex X* matches twice. At the position before the X, it matches X. At the position after the X, X* is allowed to match zero characters X, so it matches an empty string. All major languages except Python will replace both matches with a Y, giving you YY. Note that the Python exception goes away when you use Matthew Barnett's regex module instead of re: print (regex.sub("(?V1)X*", "Y", "X") ) yields YY like every other engine. Likewise, against the string A, the regex X* matches twice. At the position before the A, it matches the empty string. At the position after the A, it also matches an empty string. All major languages will replace both matches with a Y, giving you YAY. Try it in Python: print( re.sub("X*", "Y", "A") )

The Engine Doesn't Try All Options Inside the Lookahead

Question: Here is my string: _rabbit _dog _mouse DIC:cat:dog:mouse The DIC section at the end is a list of allowed animals. I want to match all the _tokens named after an allowed animal, so I expect to match _dog and _mouse. I made this regex: _(?=.*:(\w+)\b)\1\b But it only matches _mouse. It looks like the lookahead is not trying all the options. Why? Because lookarounds are atomic (the link explains this example in detail). Once the engine leaves a lookaround, its assertion has either returned true or false. From the engine's standpoint, that is all it wants to know. If a lookaround returns true, the engine tries to match the next tokens. If something fails further down the pattern, the engine has no reason to revisit the lookaround: true is always true.

Regex Syntax Tricks

On this page, I'd like to collect some useful regex tricks. The idea here is not to assemble a cookbook of regex recipes to match this or that—for that, see the cookbook page and the many pages of tricks linked on the left. Rather, the idea is to present more general regex syntax tricks—by which I mean that each of these tricks can be seen as a way to accomplish something which, if you read the manual to the letter, might first seem impossible as it is not covered by the syntax. The earlier of these tricks will be well-known to regex heads, but I hope that even avid regexers will find something new further down. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Forcing Failure: Regex Constructs that Never Match

Sometimes, you want to stop the regex engine in its track by throwing in an expression that always fails. Once the engine fails to match an expression or token, it attempts to backtrack in the hope of finding another path that succeeds. You might wish to exploit this path exploration, or design your pattern so that this backtracking never succeeds—below we'll examine why we might want a token that never matches. Before looking at the "standard solution", let's explore some options. To force the engine to choke and turn back, one strategy is to require two incompatible things, such as using a lookahead to assert that the next character is an a, then immediately trying to match one character that is not an a: (?=a)[^a]. This can simply never match. You can do the same with shorthand classes: (?=\d)\D, (?=\w)\W or (?=\s)\S. And you can twist the logic any way you like, as in (?=[^a])a, or by using a negative lookahead, as in (?!a)a. This latter version asserts that what follows is not an a, then tries to match an a. This last pattern suggests that there is no need to involve entire character classes such as \D or [^a]. Indeed, (?=z)a is just as quite effective as our original (?=a)[^a]. With word boundaries, you could use \b\B (requiring that a position be both a boundary and not a boundary), A\bA (requiring a boundary between two letters, where there is none) or =\BA (requiring a non-boundary where there is a boundary). Ideas involving anchors ^ and $ are less reliable because of multi-line mode. Standard Solution To force a regex engine to fail and turn back, the most-used expression is probably this: (?!) Its compactness is unbeatable. How does it work? (?! … ) is a negative lookahead asserting that what is in the parentheses after the ?! cannot be matched. In (?!) there is nothing between the (?! and the closing parentheses, an absence that the engine interprets as the empty string. The regex engine translates (?!) as "assert that at this position, the empty string cannot be matched." The empty string can always be matched, so engine fails to match the (?!) assertion. This causes it to backtrack in search of a different match, if any can be found. I first came across this little gem in the , but I don't know who invented it. In Perl, PCRE (C, PHP, R…) and Python's alternate regex engine, there is a special token that can never match, forcing the engine to fail and causing it to backtrack in search of another match. It can be written (*FAIL) or (*F). Internally, PCRE optimizes (?!) to (*F). You can read all about (*FAIL) on my page about backtracking control verbs. Why would you want a token that never matches? Among other uses, tokens that always fail come in handy: - in conditionals that check for balancing conditions in the string (e.g. the engine found an "opening construct", now you want it to match the corresponding closing construct), - to explore the branches of a match tree. For details, please read the use cases for on my page about backtracking control verbs. When I use this trick, is the regex match guaranteed to fail? Not necessarily. While the (?!) construct and its equivalents never match, that doesn't imply that the patterns containing them won't match: after failing to match (?!), the engine attempts to backtrack, and it may find an alternate path that succeeds. If your pattern is designed in such a way that backtracking is fruitless, the match attempt will fail. If you want to guarantee failure in all cases, please read how to force the match attempt to fail on my page about backtracking control verbs.

Parity and beyond: check that string length is a multiple of n

Without using string functions, can you check that a string has an even number of characters? This is easily done: ^(?:..)+$ The non-capturing group (?:..) matches two characters. The + quantifier repeats that one or more times. To check that a string's length is a multiple of n, a more general version would be ^(?:.{n})+$ If your string spans several lines, you could start out with something like (?s)\A(?:.{n})+\z, using the single-line (DOTALL) mode s to ensure that the dot matches line breaks, and the A and z anchors to anchor the whole string. Each line break character will count as one character, so this is probably less useful.

Find an Upper-Case Word that Later Appears in Lowercase

This idea was shown on the inline modifiers section of the main syntax page. (\b[A-Z]+\b)(?=.*(?=\b[a-z]+\b)(?i)\1) The upper-case word is captured to Group 1. Then we match characters up to a point where the lookahead (?=[a-z]+\b) can assert that what follows is a lower-case word, then we set case-insensitive mode on and we match the content of Group 1. In .NET we could use a lookbehind instead of the double lookahead: (\b[A-Z]+\b)(?=.*\b[a-z]+\b(?i)(?<=\1))

Atomic Groups for Engines that Don't Support It

Among the major engines, JavaScript and Python share the dubious distinction of not supporting atomic grouping syntax, such as (?>A+) This hack lets you get around it: (?=(A+))\1. The lookahead asserts that the expression we want to make atomic—i.e. A+—can be found at the current position, and the parentheses capture it. The back-reference \1 then matches it. So far, this is just a long-winded way of matching A+. The magic happens if a token fails to match further down in the pattern. The engine will not backtrack into a lookaround, and for good reason: if its assertion is true at a given position, it's true! Therefore, Group 1 keeps the same value, and the engine has to give up what was matched by A+ in one block. The page on Reducing (? … ) Syntax Confusion looked at two examples of atomic groups. Referring to that page for what they do, here are the equivalent "pseudo-atomic" groups in JavaScript and Python: (?>A+)[A-Z]C translates to (?=(A+))\1[A-Z]C and (?>A|.B)C translates to (?=A|.B)\1C

Pre-Defined Subroutines for Engines that Don't Support It

Perl, PCRE (C, PHP, R…) and Python's alternate regex engine support the terrific (?(DEFINE) … ) syntax that allows you to pre-define named regex subroutines and to use them to build beautiful modular patterns such as the the regex to match numbers in plain English. At present, the only other engine I know that supports subroutines, Ruby 1.9+, does not support the (?(DEFINE) … ) syntax. I came up with this trick in order to use pre-defined subroutines in these two engines. As with (?(DEFINE) … ), the idea is to place the definitions in a section at the beginning of the regex—although you can put them anywhere before the subroutine is called. Each definition looks like this: (?:(?<foo> … )(?!))? Here the name of the subroutine is foo, and the actual subroutine goes in the place of the ellipsis …. For instance, (?:(?<uc_word>\b[A-Z]+\b)(?!))? defines a subroutine for a word in all-caps. Once the subroutine is defined, you call it with \g<foo> (Ruby). This makes it very convenient to call patterns repeatedly (especially if they are long), to maintain them in one place, and so on. How does the trick work? Before we look at a full-length example, let's discuss how this works. The external parentheses in (?:(?<foo> … )(?!))? define a non-capturing group. The group is made optional by the ? quantifier. As we'll soon see, the non-capturing group always fails, and the zero-or-one quantifier ? is what allows us to go on after that internal failure. Within the non-capturing group, we start with a standard named group definition (?<foo> … ). This both creates a named group and a named subroutine. The engine doesn't know that at this stage we only aim to define the subroutine without actually matching characters. Therefore, if it finds characters that match the pattern in the named group, it matches them. We don't want that—as our goal in our definitions is to simply define without moving our position in the string. This is why after defining the named group and subroutine, we force the non-capturing group to fail by using (?!), which we saw earlier in the page with the forced failure trick. Of course we don't want the whole regex to fail, which is why we use the ?quantifier to make the non-capturing group optional. Full Example This example is identical to the one in the section on pre-defined subroutines: it defines a basic grammar, allowing us to match sentences such as five blue elephants solve many interesting problems and many large problems resemble some interesting cars This is for Ruby: (?x) # whitespace mode ####### DEFINITIONS ######### # pre-define quant subroutine (?:(?<quant>many|some|five)(?!))? # pre-define adj subroutine (?:(?<adj>blue|large|interesting)(?!))? # pre-define object subroutine (?:(?<object>cars|elephants|problems)(?!))? # pre-define noun_phrase subroutine (?:(?<noun_phrase>\g<quant>\ \g<adj>\ \g<object>)(?!))? # pre-define verb subroutine (?:(?<verb>borrow|solve|resemble)(?!))? ####### END DEFINITIONS ####### ##### The regex matching starts here ##### \g<noun_phrase>\ \g<verb>\ \g<noun_phrase>

Mimic an Alternation Quantified by a Star

You want to match a series or joe_ or bob_ tokens, as in: joe_joe_joe_bob_joe_bob_ We allow an empty match. Your first thought is probably (?:bob_|joe_)* But can you perform the same task without using an alternation? This will do it: (?:bob_)*(?:(?:joe_)+(?:bob_)*)* After (?:bob_)* matches zero or more bob_ tokens, we match (zero or more times) one or more joe_ tokens followed by zero or more bob_ tokens. In more general terms, (A|B)* becomes A*(?:B+A*)*. We use this technique in the Unrolled Star Alternation solution to the Greedy Trap problem.

Miscellaneous Tricks

In this section, I would like to add links to tricks that are described on other pages but do not fully satisfy this page's specification of regex syntax tricks that perform tasks that may not at first seem like they are covered by the engine's syntax. double-negative delimiters I've implemented an infinite lookbehind demo for PCRE. Subject: A simpler 'Mimic of Alternation'? In your last example on this page, why does (A|B)* becomes A*(? :B+A*)*? Isn't it much simpler that (A|B)* becomes (A*B*)*? P.S. My name is "Oz". Reply to Ozz Hi Oz, Because of this: http://www.rexegg.com/regex-explosive-quantifiers.html

PCRE Callouts

PCRE has a terrific feature: callouts, specified with the syntax (?C…), where the dots stand for an optional argument. For instance, (?C), (?C12) and (?C'beyond the digits') are all valid callouts. If you call PCRE's matching function in the standard way, when the engine encounters (?C…), it ignores it and continues its match attempt. However, if you specify a callout function before calling PCRE's matching function, then when the engine encounters (?C…), it temporarily suspends the match and passes control to that callout function, to which it provides information about the match so far. The callout function then performs any task you see fit, then it returns a code to the engine, letting it know whether to proceed normally with the rest of the match. This feature enables PCRE to supply similar functionality to Perl's code capsules. The goal of this page is to provide you with working code to get you started with callouts. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Basic Syntax

The basic syntax for a callout within a PCRE pattern is (?C…) The optional argument in the dots takes two forms: either an integer or a string, as in (?C12) or (?C'beyond the digits'). The argument is passed to the callout function, which can choose to use it or ignore it. Argument as Identifier One way to use the callout arguments is as identifiers. If you have several callouts in your pattern, the callout function can then test the value of the identifier to handle various cases. Argument as Value You can also instruct the callout function to do something directly with the value of the argument. For instance, if you have several callouts with integers—say (?C8), (?C16), (?C32)—the callout function might use the value in an expression. Likewise, if the argument is a string, the callout function might use that value directly, for instance by displaying it. Argument as Code In a dynamic language, a string callout argument might even contain a piece of code to be evaluated at run-time… Go for it, implement that on your company's website, everyone will love that security feature! Form of the Argument - You can ommit an argument and just use (?C), in which case the argument will be set to 0. - If using an integer argument, the value must be 255 or less. - If using a string identifier, various delimiters are possible: a set of {curly braces}, or a pair of one of the identifiers in character class [`'"^%#] - The delimiter can be escaped within the identifier by doubling it, as in (?C'What''s Up?')

The Callout Function

The callout function can perform any tasks you see fit. It receives a lot of information about the match: the current position in the string and the pattern, the temporary match, and more—we'll explore these values in Program 2. Return Values After the callout function has done its job, you make it return a value to the engine. - A zero tells the engine to resume its match attempt where it left off. - A positive value tells the engine to fail at the current position in the pattern, just like a (?!) construct or a . This causes the engine to start backtracking in search of a matching path. - A negative value tells the engine to fail the overall match (the current match attempt fails, and no further attempts are attempted).

Testing PCRE Callouts

To work with PCRE callouts, you either need to be using the PCRE library directly or to work in a language or tool that has implemented the callout feature. In a tool (such as Notepad++, which supports PCRE), this is unlikely: how would you specify the callout function? In a language other than C and C++ (which can call PCRE's functions directly), callouts may not be a priority for the developers of the language. For instance, in PHP, the preg_match function makes no room for callouts. If you're using a .NET language such as C#, Visual Basic or F#, you're in luck.

Testing PCRE Callouts using PCRE.NET

Out of the box, .NET has a terrific regex engine. But on the page about C# regex, I also praise PCRE.NET, an alternate engine for .NET, a wrapper around the PCRE library provided by Lucas Trzesniewski. With this interface to PCRE, you can access all of PCRE's rich syntax, including callouts. PCRE.NET is a snap to add to any .NET project. In Visual Studio, Press Ctrl + Q for the Quick Launch window, type nuget and select Manage Nuget Packages for Solution. In the search window, type pcre.net, making sure that the filters pull-down is set to All. Install. To whet your appetite for callouts, I will provide two short but fully functional C# programs that demonstrate PCRE's callout functionality. The notes after each program explain some salient features of the API.

Program 1: Exploring Substrings

This program replicates the delightful Perl capsule explained on this page: if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {} It prints out substrings of the test string abc: abc ab a bc b c For the why, please read the explanations. Here is C# code to do the same. Granted, it is longer than the Perl one-liner, but you already knew that Perl and C# are different beasts. using System; using PCRE; class Program { static void Main() { string subject = "abc"; var combo_regex = new PcreRegex(@"\w+(?C'temp: ')(*FAIL)"); combo_regex.Match(subject, callout => { Console.WriteLine(callout.String + callout.Match.Value); return PcreCalloutResult.Pass; } ); Console.WriteLine("Press Key"); Console.ReadKey(); } } Here is the output: temp: abc temp: ab temp: a temp: bc temp: b temp: c Press Key Callout Specified as Lambda The key feature is that when we call the Match constructor, in addition to the standard subject string, we pass the callout function. There are several ways to pass the callout function. In this example, for brevity, we pass a lambda. If you plan to reuse the callout function, it probably makes sense to pass it as a delegate. We will see how to do that in a later example. Argument used as Value One interesting feature is that the callout's argument is the string "temp:" This string is output on every temporary match report via callout.String Return Values Note that the callout returns PcreCalloutResult.Pass This maps to the zero value that tells the engine to resume the match attempt where it left off. The other possible return values are: PcreCalloutResult.Fail, equivalent to 1, telling the engine to fail the current match attempt, after which the engine, as usual, advances to the next position in the string and starts a new match attempt. PcreCalloutResult.Abort, equivalent to -1, telling the engine to fail the overall match (the current match attempt fails, and the engine does not advance in the string to try other attempts). Match object discarded Usually, when we call combo_regex.Match(), we assign the resulting match object to a variable. In this case, we don't care about the match object, so no assignment was made. Alternate implementation In the section on the callout function's return values, I mentioned that a positive value acts like a (*FAIL). This means we can obtain the same result as above by removing the (*FAIL) and returning a positive value, which PCRE.NET expresses as PcreCalloutResult.Fail. This fragment outputs the same temporary matches as before: string subject = "abc"; var combo_regex = new PcreRegex(@"\w+(?C)", PcreOptions.NoAutoPossess); combo_regex.Match(subject, callout => { Console.WriteLine(callout.Match.Value); return PcreCalloutResult.Fail; } ); But there is one subtlety: PcreOptions.NoAutoPossess, coming up next. The Ghost of Autopossess (and of other Optimizations) The PcreOptions.NoAutoPossess option sets PCRE's PCRE2_NO_AUTO_POSSESS option, which can also be turned inline by the (*NO_AUTO_POSSESS) start of pattern modifier. (Except that at the moment of writing there seems to be a bug with this latter syntax.) As a reminder, the autopossess optimization turns some quantifiers into possessive quantifiers when the token that follows is incompatible with the quantified token (there is no shared ground, so no reason to backtrack). For instance, \d+\D is automatically optimized to \d++\D. For the same reason, the \w+ in our pattern is automatically optimized to \w++. We need to turn that off, otherwise when the callout returns a positive value, the engine cannot backtrack into the atomic \w++, so the match attempt fails without further exploration. In this case, the engine advances to the next position in the string to try the next match attempt, yielding this much shorter output: abc bc c For the same reason, if you want to make sure that callouts always work as you expect, you should turn off other optimizations as well. Putting all the optimization killers in one place: - PCRE2_NO_AUTO_POSSESS, set inline with (*NO_AUTO_POSSESS) or in PCRE.NET with PcreOptions.NoAutoPossess - PCRE2_NO_START_OPTIMIZE, set inline with (*NO_START_OPT) or in PCRE.NET with PcreOptions.NoStartOptimize - PCRE2_NO_DOTSTAR_ANCHOR, set inline with PCRE2_NO_DOTSTAR_ANCHOR or in PCRE.NET with PcreOptions.NoDotStarAnchor

Program 2: Exploring Callout Properties

This second program is designed to explore the properties of the PcreCallout object passed to the callout function. The simple pattern (?:([A-Z])\d(?C8))+ matches one uppercase letter followed by one digit, multiple times, for instance Q1G5. After matching each digit, we find the callout token (?C8). The 8 is a simple identifier that is passed to the callout function just in case we want to do something with it— which would come in handy if we had multiple callouts. Please excuse the minimal indentation: I wanted all lines to fit inside the code box. using System; using PCRE; class Program { static void Main() { // This function shows info about the args it receives Func<PcreCallout, PcreCalloutResult> callout_info = delegate (PcreCallout info) { // In the pattern string, the position after (?C12) is 17 Console.WriteLine("\nPosition in the Pattern: " + info.PatternPosition); // The position in the string when the callout is called: // 2, 4, 6 Console.WriteLine("Position in the String: " + info.CurrentOffset); // This will print the 12 in C12 Console.WriteLine("Callout Number: " + info.Number); // If we has a sting identifier, as in (?C'combo'), we // would access it via s.String. See Program # 1. // We didn't call Match with a string offset: 0 Console.WriteLine("StringOffset: " + info.StringOffset); // The last group capture: Group 1 Console.WriteLine("Last Capture Group: " + info.LastCapture); // Value of the last capture Console.WriteLine("Last Capture: " + info.Match.Groups[info.LastCapture].Value); // Temporary Match Console.WriteLine("Temporary Match: " + info.Match.Value); return PcreCalloutResult.Pass; }; var callout_info_regex = new PcreRegex(@"(?:([A-Z])\d(?C12))+"); string subject = "A1B2C3"; var firstmatch = callout_info_regex.Match(subject, callout_info); if (firstmatch.Success) { Console.WriteLine("\nOverall Match: " + firstmatch.Value); } Console.WriteLine("Press Key"); Console.ReadKey(); } } Here is the output: Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: A Temporary Match: A1 Position in the Pattern: 18 Position in the String: 4 Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: B Temporary Match: A1B2 Position in the Pattern: 18 Position in the String: 6 Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: C Temporary Match: A1B2C3 Overall Match: A1B2C3 Press Key Callout Specified as Delegate In the previous example, we passed the callout function as a lambda. In this example, we create a function that can be reused (it shows information about the callout arguments) so we pass it as a delegate. The Match constructor has one more argument than usual: the callout delegate. callout_info_regex.Match(subject, callout_info); Capture Collection Normally, in PCRE, when a capture group is quantified, as in (?:(\d)\D)+, the engine only returns the last value of the capture group. That is how most engines work. In contrast, the standard .NET engine has a feature called capture collections that let you examine all intermediate captures. PCRE callouts take you some of the way in the direction of capture collections. In this example, each pass displays the last capture group.

Program 3: Debugging with Auto Callout

If you set the PCRE2_AUTO_CALLOUT option, the engine acts as though there were a callout after each token. Each callout has the same argument: 255, as in (?C255) This option can be interesting if you want to inspect the progression of a match attempt—perhaps for debugging. Note that this kind of functionality is also offered in PCRE's bundled utility. Here is a simple example in PCRE.NET, where the option is called AutoCallout. The simple pattern \d+9\b is designed to cause backtracking against our test string 1492 1999. using System; using PCRE; class Program { static void Main() { var end_with_9 = new PcreRegex(@"\d+9\b", PcreOptions.AutoCallout); string subject = "1492 1999"; var the_match = end_with_9.Match(subject, call => { Console.WriteLine(call.Match); return PcreCalloutResult.Pass; }); if (the_match.Success) { Console.WriteLine("\nOverall Match: " + the_match.Value); } Console.WriteLine("Press Key"); Console.ReadKey(); } } Here is the output: 1492 149 14 149 1 492 49 4 49 92 9 2 1999 199 1999 1999 Overall Match: 1999 Press Key

Program 4: Infinite Lookbehind

One feature famously absent from Perl and PCRE is infinite lookbehind. The following program shows two simple ways of implementing this feature using a callout. The code is shown in C#, but Method 1 would work in any language that provides a full API to PCRE. Note that the code is meant as a stub—for instance error handling is absent. Before diving in, here are the general ideas. One Callout to Rule them All If you're going to use a lot of callouts, and especially some fancy features such as infinite lookbehind, it makes sense to me to make one big callout function that can handle a number of common cases. You might object that a collection of small methods is better. But remember, when we call the match function, we can only pass a single callout. Your pattern, on the other hand, might include several callouts to which you'd like to assign different tasks. This is when you need one callout that can handle multiple cases. If it grows too big, sure, you can let it distribute the work to other methods, but it remains the one entry point. In the demo program, the CalloutSwitch delegate checks for callouts of this form: (?C'keyword:action'). Of course other forms can be checked as well. We will implememt lookbehind with two methods, which will be passed with callouts in these shapes: Method 1 (pure PCRE): (?C'infinite:c+ba') Method 2 (Frankenstein): (?C'.net_lb:(?<=abc+)') Method 1: Pure PCRE Lookbehind Solution In the position where you want an infinite lookbehind such as (?<=abc+), place a callout such as (?C'infinite:c+ba'). Note that the lookbehind pattern has been reversed. In the callout, we parse the reversed lookbehind regex out of the argument: c+ba We use the string position argument to extract a substring from the subject start up to that point, and reverse that substring. We attempt to match the reversed regex on the reversed substring, and pass a zero or "force backtrack" return value depending on whether that match attempt succeeds or fails. Two details are worthy of note: 1. we must anchor the lookbehind pattern, so that the forward matching function only looks for it at the position immediately preceding the cursor. Instead of appending a ^, we accomplish this with PCRE's PCRE2_ANCHORED option (expressed as PcreOptions.Anchored in PCRE.NET) 2. We disable optimizations (see the Ghost of Autopossess). Method 2: Frankenstein Solution (PCRE marries .NET regex) Reversing the pattern in the lookbehind as in the first method is not always obvious (we'll explore some limitations below). For such situations, I provide a Frankenstein solution that calls .NET regex from within a PCRE callout. This time, the callout looks like (?C'.net_lb:(?<=abc+)') And now… the code. using System; using System.Linq; using PCRE; using System.Text.RegularExpressions; class Program { static string subject; // simple display of match results public static void display_match(PcreMatch theMatch) { Console.WriteLine(subject + " => " + (theMatch.Success ? "Match = " + theMatch.Value : "No Match")); } // One Callout to Rule them All /* Stub of callout delegate to handle lookbehinds and other constructs. Checks for callouts of this form: (?C'keyword:action') A lookbehind is specified as: Want this: (?<=abc+) => Write this (?C'infinite_lb:c+ba') Note that the lookbehind pattern is reversed */ static Func<PcreCallout, PcreCalloutResult> CalloutSwitch = delegate (PcreCallout callData) { int pos = callData.CurrentOffset; // Check if the callout has this form: (?C'keyword:action') string[] sides = callData.String.Split(':'); if (sides.Length > 1) { switch (sides[0]) { // Method 1: Pure PCRE case "infinite_lb": string subject_behind = subject.Substring(0, pos); // Reverse the subject string lookbehind_subject = new string(subject_behind .ToCharArray() .Reverse() .ToArray()); var lookbehind_regex = new PcreRegex(sides[1], PcreOptions.Anchored); var lookbehind = lookbehind_regex .Match(lookbehind_subject); return lookbehind.Success ? PcreCalloutResult.Pass : PcreCalloutResult.Fail; // Method 2: Frankenstein (PCRE marries .NET) case ".net_lb": // In the case of a DotNet lookbehind, we expect // something like (?C'.net_lb:(?<=abc+)') // Ensure the lookbehind operates at the right spot var dotnetRegex = new Regex("^.{" + pos.ToString() + "}" + sides[1]); var dotnetLB = dotnetRegex.Match(subject); return dotnetLB.Success ? PcreCalloutResult.Pass : PcreCalloutResult.Fail; // implement other interesting callouts case "neg_infinite_lb": break; default: break; } } // We didn't handle the callout: resume return PcreCalloutResult.Pass; }; // Test it! static void Main() { var purePCRELookbehind = new PcreRegex( @"(?C'infinite_lb:c+ba')\d+", PcreOptions.NoAutoPossess | PcreOptions.NoStartOptimize | PcreOptions.NoDotStarAnchor ); var frankensteinLookbehind = new PcreRegex( @"(?C'.net_lb:(?<=abc+)')\d+", PcreOptions.NoAutoPossess | PcreOptions.NoStartOptimize | PcreOptions.NoDotStarAnchor ); // First subject: this should match 42 subject = "05 AB99 abcc42 hp16"; var theMatch = purePCRELookbehind.Match(subject, CalloutSwitch); display_match(theMatch); theMatch = frankensteinLookbehind.Match(subject, CalloutSwitch); display_match(theMatch); // Second subject: this should fail subject = "05 AB99 abcd42 hp16"; theMatch = purePCRELookbehind.Match(subject, CalloutSwitch); display_match(theMatch); theMatch = frankensteinLookbehind.Match(subject, CalloutSwitch); display_match(theMatch); Console.WriteLine("Press Key"); Console.ReadKey(); } } Here is the output: 05 AB99 abc42 hp16 => Match = 42 05 AB99 abc42 hp16 => Match = 42 05 AB99 abcd42 hp16 => No Match 05 AB99 abcd42 hp16 => No Match It works! Some Limitations For the pure PCRE part of the demo, the lookbehind works so long as the pattern can easily be reversed. For (?<=abc+), the translation to c+ba was direct. But if our lookbehind pattern starts to contain some convoluted syntax, as in (?<=a(?=bc)), the reversal may not be so direct. This kind of lookbehind creates what I've dubbed a "back to the future regex". It requires that we inspect not just the portion of string before the cursor, but also the portion after the cursor. As a guess, I might approach it by reversing the whole string (or an adequate portion if some smart rules can be found) and passing that string to the match function with an adequate offset. Reversing the regex inside the (?<=a(?=bc)) lookbehind, we would pass a(?<=bc)) to the match function. Now suppose the c was instead a c+. Our reversal would now contain an infinite lookbehind. You see the problem. For a case like this, the Frankenstein solution is the way to go. One limitation that applies to both methods is cases when the lookbehind contains capture groups, as in (?<=(\d+)). There is no mechanism to relay those capture groups to the calling function. Another limitation is if the lookbehind contains references to previous captures, as in (?<=\1\d+). When building the regex inside the callout, we'll need to replace the reference with the current content of the group, and, in the case of the pure PCRE method, to reverse it. I'm sure there are many other limitations. Callout within the lookbehind itself, backtracking control verbs… Let your imagination run wild. The goal of these demos is only to explore some workarounds for infinite lookbehind. The main point is that for "standard cases", it looks like we can implement the feature. Automatically Reversing the Pattern For a more general solution, your callout would pass the lookbehind the way you want it, without reversing it: (?C'infinite:abc+'), and you would then call a tokenizer that reverses the regex to c+ba… Easier said than done! If you implement such a pattern reverser, please let me know.

Uses for Callouts

I'm sure you already see that the potential of callouts is huge. Program 4 showed directions to implement infinite lookbehind. This section mentions two others of PCRE's "missing features". Capture Collections Program 2 showed how a callout can use the value stored in a quantified capture group on each pass. In that example, we displayed the value. If we added the values to a list, that would start to feel like a capture collection, except that when the engine backtracks, the list would end up with more elements than actually contributed to the match. Maybe another callout to reset the list when you enter the quantified group, could bring us closer to the goal. Balancing Groups The standard .NET regex engine contains an unusual feature called balancing groups, which can be used instead of recursion (a feature absent from that engine) to check that certain constructs (such as (parentheses)) are properly balanced. You could implement that feature with PCRE callouts. Upon matching an opening parenthesis, you place a callout such as (?C'open'), and upon matching a closing parenthesis, you place a callout such as (?C'close'). At the end of the pattern, you place a callout such as (?C'check_count'). In the callout function, you increment or decrement a counter depending on whether the identifier is open or close, returning a negative value to the engine if the counter falls below zero. At the end of the match, the callout function handles the check_count argument by checking that the counter is back to zero, indicating a balanced count. This is a direct translation of the typical balancing groups recipe.

Further Details

If you plan to use callouts, you may want to be aware of some details that may influence their operation. For instance, PCRE's autopossess optimization may interfere with callouts. This and other details are covered in the callout page of the PCRE documentation. Another place to look for examples is the suite of tests for PCRE.NET. Smiles, Rex

Capturing Quantifiers and Quantifier Arithmetic

This page of the regular expressions "black belt program" discusses a special power that is only available to those who have managed to steal a green egg from the velociraptor by the cave near the mountain top: capturing quantifiers, and quantifier arithmetic. I have not used this feature myself because I got lost on the way to the cave. Joke aside, quantifier capture is a feature I started wondering about in January 2014 and mulled over in the background until drafting this page three months later. It's a simple idea, so I imagined it would be implemented in some engines. However, after making enquiries from several people who write regex engines, it appears that the feature is not implemented at the moment. On April 13 2014, Jan Goyvaerts, who is probably the arch-expert on cross-engine syntax because he has to maintain many engines for RegexBuddy, wrote:
I do not know of any regex flavors that allow you to capture a quantifier by itself or that allow you to use a backreference inside a quantifier in any way.
If the situation changes, please write at the bottom of the page.

Quantifier Capture

Sometimes, it would be handy to be able to "capture a quantifier"—meaning some way to capture the number of times a quantifier such as "+" or "*" is able to match. For instance, suppose you were interested in matching balanced lines such as these: @@ "Snow White" == "1987" -- "Animation" // "Erich Kobler" or @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas" but not @@@ "Groundhog Day" = "1993" ----- "Comedy" // "Harold Ramis" The point is that you want to make sure you have the same number of "@" characters as of "=" character, "-" characters and "/" characters. For that purpose, a regex such as the following will not do because it will match all lines. ^@+ "[^"]+" =+ "[^"]+" -+ "[^"]+" /+ "[^"]+"$ Balancing the number of {@,-,=,/} is fairly straightforward in languages that use .NET regex thanks to the balancing groups feature, and I give a demo of this lower down. In PCRE, it is also possible but far less straightforward, as you need to use some neat tricks: the syntax is far too complex and error-prone for it to be useful on a regular basis. I explain these tricks (which I did not invent) lower on the page. What we really need is a syntax such as the following: (*) captures the number of characters the star quantifier is able to match. Likewise, (+), (?) and ({2,9}) capture the number of characters these quantifiers are able to match. \q1 refers to the first captured quantifier, \q2 refers to the second captured quantifier and so on. \q+1 refers to the next captured quantifier. \q+2 refers to the second-next captured quantifier, \q-3 refers to the third-previous quantifier. With this syntax, we could easily match our desired pattern with the following regex: ^@(+) "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /{\q1} "[^"]+"$ Alternately, using relative addressing, you could use either of the following expressions, where the quantifier is captured further down the string.. ^@+{\q1} "[^"]+" ={\q1} "[^"]+" -{\q1} "[^"]+" /(+) "[^"]+"$ ^@+{\q+2} "[^"]+" ={\q+1} "[^"]+" -(+) "[^"]+" /{\q-1} "[^"]+"$ The syntax would also allow you to use captured quantifiers in range quantifiers such as {\q1,\q2}. An Alternative to Recursion In some cases, capturing quantifiers would elegantly replace recursion. For instance, suppose you want to match a number of zeros and ones framed by the same number of Ls and Rs, a in "LLL100110100RRR". With recursion, you can write: L((?R)|[01]*)R With captured quantifiers, you can write: L(+)[01]+R{\q1} Usually, the non-recursive version would be easier to write and read.

Quantifier Arithmetic

A natural extension of capturing quantifiers is to play with the captured integers. For instance, it won't be long until you want to match twice as many (i.e., 2x) instances of a certain character or sequence than you matched of another sequence or character. Or you might want to match three more instances (x+3), or some other function (such as 2x+5). This could be implemented either directly in the syntax, or with the help of callbacks. A biologist looking at genomes comprised of the letters A, T, G and C might have a particular interest in finding sequences where a certain number x of "A"s are followed by any number of T and Gs, then "2x + 1" Cs, as in: ATGGTTTGTCCC, but not AAATGGTTTGTCCC Without callbacks, the syntax to accomplish this kind of arithmetic could become cumbersome. Here are two possible implementations, without and with callback: A(+)[TG]+C{2*\q1+1} A(+)[TG]+C{CALL_VERB somefunction(\q1)} In conclusion, it seems to me that quantifier capture (as a first step) and quantifier arithmetic (as a second step) would nicely enhance the expressiveness of regular expressions and would be a logical extension of the syntax.

Current Solutions to Balance the Number of Certain Characters

If you don't have quantifier capture in your regex flavor, you can still check that strings like the one shown higher on the page are balanced. To make the example easier to understand, I simplified the kind of strings we are trying to validate: L555M002R or LLL88MMM7281RRRbut not LLL88M7281RRR The idea is that we want to make sure we have the same number of L, M, and R characters (think of them as "Left", "Middle" and "Right" separators.

.NET: Balancing Groups

In languages that can tap into .NET regex, checking that a string such as LL00MM11RR has the same number of L, M, and R characters is fairly straightforward thanks to balancing groups: ^(?:L(?<c1>)(?<c2>))+[^LM]+(?<-c1>M)+[^MR]+(?<-c2>R)+(?(c1)(?!))(?(c2)(?!))$ This is fairly long, but it is simple. First, notice that the string is anchored by ^ and $ to prevent the engine from matching a balanced string within an unbalanced string. The non-capturing group (?:L(?<c1>)(?<c2>))+ is quantified with a + and matches all the L characters. It also contains two named capture groups c1 and c2. These groups are empty, so they don't match anything; or, rather, they match an empty string. You may know that .NET deals with quantified capture groups in a special way: it adds the successive captures of a given group to a collection of captures. This means that each time an L is matched, a capture is added to the capture collection for named groups c1 and c2. We will use c1 and c2 as counters. By the time the engine exits the quantified non-capturing group, the c1 and c2 groups both hold the same number of captures, which is the number of Ls that were matched. The [^LM]+ eats up all the characters up to the first M. The (?<-c1>M)+ is a quantified group that eats up all the Ms. It may look like it is a named group called "-c1", but -c1 is .NET regex syntax to pop (and discard) the last capture in group c1. This means that each time we match an M, one c1 capture is thrown away. In essence, we are decrementing our c1 counter. If we have already matched as many Ms as Ls, group c1 is empty. If there are any Ms left at that point, when the engine tries to pop one capture from the c1 group, it cannot do so, and the regex fails. This ensures that there cannot be more Ms than Ls in the string. Later, we will add a check to ensure that there are no more Ls than Ls. The [^MR]+ eats up all the characters up to the first R. The (?<-c2>R)+ matches all the Rs while decrementing c2, ensuring that there cannot be more Rs than Ms. The (?(c1)(?!)) is a conditional that checks if capture group c1 is set. If c1 is set, the engine tries to match (?!), which is a trick to force the regex to fail. The conditional therefore forces the regex to fail if there are captures left in gorup c1, which would mean that we have not matched enough Ms to fully decrement our c1 "counter". This expression ensures that we cannot have more Ls than Ms. Likewise, (?(c2)(?!)) ensures that we cannot have more Ls than Rs. That's a bit of syntax to explain, but I hope you'll agree that once you understand that syntax, writing such an expression is straightforward.

PCRE: Balancing by Building Capture Groups Accretively

In PCRE, checking that a string such as LL00MM11RR has the same number of L, M, and R characters is possible, but tricky. This trick and the next are shown for a more complicated pattern on this Stack thread. I have modified the recipes slightly for easier comprehension, but not in its essence. Later if you are interested you can inspect my version for the "Star Wars pattern". I gave detailed comments in the code box below, but it may be helpful to have a high-level overview of certain aspects before diving in. Group 1 will capture all the "L" characters. Group 2 captures the Ms, Group 3 captures the Rs. The expression is in two parts. First, we match all the Ls, and as we do so, a lookahead checks that we have at least as many Ms and Rs. Second, when we are done matching all the Ls and have satisfied ourselves that we have at least as many Ms and Rs as we want, we continue the match, specifying exactly the characters we want, which (among other effects) ensures we have no more Ms and Rs than needed. In the first part, as the engine matches each L character at a time, the content of Groups 1, 2 and 3 change each time a new "L" is matched. The parentheses of Group 2 actually refer to the current value of Group 2, i.e., \2, in the expression (\2?+M). What does this do? After the first L is matched, Group 2 is undefined, and the "?" in \2?+ makes \2 optional, allowing that part of the expression to match. Group 2 then matches the first M, and the value of Group 2 becomes "M". After the second L is matched, Group 2 is still "M", so the \2 in (\2?+M) matches "M", then we match the second "M", and the value of Group 2 becomes "MM". The "+" in \2?+ ensures that if we fail to match the M that follows, the engine doesn't backtrack by activating the optional \2? and dropping the first M. Without the "+", we could match strings such as "LLL1M2R". See the section on atomic groups. Please understand that although Group 2 looks like a self-reference, the expression in Group 2 refers to the previously stored value. Therefore, the value \2 of Group 2 after the closing parenthesis is not what it was inside the parentheses. Ready? Here we go. ^((?:L(?=[^M]*(\2?+M)[^R]*(\3?+R)))+)\d+\2\d+\3$ Easy, right?… Just kidding. Here is the commented break-down. (?xm) # Free-spacing mode, multi-line ^ # Assert Beginning of Line ( # Begin Group 1 (?: # Non-capturing group, which will be repeated L # Match one L (?= # Begin Lookahead [^M]* # Match any chars that are not M ( # Begin Group 2 \2?+ # Match Group 2 if possible, and if so # do not later give up the match. # In other words if Group 2 can be matched, match it. # This could be expressed as (?(2)\2)+ # After we match the first L, Group 2 starts out undefined # so the ? will be used. # After we match the 2nd L, Group 2 is M # so at that point we must match M. # After we match the 3rd L, Group 2 is MM # so at that point we must match MM. M # Match M ) # End Group 2 # After matching the first L, Group 2 is M # After matching the second L, Group 2 is MM # After matching the third L, Group 2 is MMM # etc. [^R]* # Match any chars that are not R (\3?+R) # Group 3 follows the same principle as Group 2 # If you have a hard time following, simplify # the test string # and remove the Group 3 section ) # End Lookahead )+ # Repeat the non-capturing group ) # End Group 1 # If we stopped right there, the regex would match strings # that have x characters L and at least x each of characters {M,R} # but possibly more: there would be no guarantee of balance # To validate that we have no more than needed, # we now match (or lookahead) precisely what we want # after all the L characters we have matched. \d+ # Match some digits \2 # Match the characters captured in Group 2 \d+ # Match some digits \3 # Match the characters captured in Group 3 $ # Assert End of Line Hope you enjoyed this one! Working through it is a great exercise. But if it shows one thing, apart from the cleverness of certain coders, it's that realistically, to balance strings as we have done, you need something like the quantifier capture syntax advocated on this page. In the unlikely case you'd like to see the same principle applied to the @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas" example from the top, the code is here.

PCRE: Balancing with Recursion

As a reminder, we are trying to check that a string such as LL00MM11RR has the same number of L, M, and R characters. This method uses regex recursion. For those who jumped in to this point from another page, the task at hand is to validate balanced strings such as L00M123R or LLL22MMM1111RRRbut not LLL22M1111RRR The idea is that we want to make sure we have the same number of L, M, and R characters (think of them as "Left", "Middle" and "Right" separators. The overall structure of this expression is that of a password validation regex. We have two lookaheads to validate some conditions, then we watch what we want, if possible. The first lookahead validates that the Ms balance with the Ls. The second lookahead validates that the Rs balance with the Ls. The structure of the first lookahead (which is equivalent to the second one) is as follows. We begin Group 1, which is the group whose pattern we will repeat (recursion). Group 1 matches "L, stuff, then M". The "stuff" in the middle is either another instance of Group 1 (L, stuff, M) or, if there are no more Ls to consume, any characters that are neither L nor M. If you trace this recursion on paper, you will see that for "LLL00MMM123RRR", within this lookahead the engine matches L (level 0), then L (level 1), then L00M (level 2), then M (closing level 1), then M (closing level 0). Ready? Here we go. (?xm) # Free-spacing mode, multi-line ^ # Assert Beginning of String # The function of the following lookahead is to check # that the Ms balance with the Ls (?= # Begin Lookahead ( # Begin Group 1 # Group 1 will match "L stuff M", where # "stuff" may recurse to "L stuff M" # The base case for "stuff" when we run out of Ls # will be characters that are neither L nor M. L # Match L (?> # Begin Atomic group (?-1) # Recursion to the next level: # Match the pattern defined by the previous # defined group, i.e. Group 1, i.e. match the # next L then what follows... | # OR (if we cannot match an L) [^LM]++ # Match characters that are neither L nor M # but do not give up the match # if what follows fails (possessive) ) # End Atomic Group M # Match M, completing the "L stuff M" of Group 1. ) # End Group 1 (?!M) # Assert that the next character is not "M" ) # End Lookahead # The next lookahead has the same structure as the previous one # Its function is to check that the Rs balance with the Ls (?=(L(?>(?-1)|[^LR]++)*R)(?!R)) # We now know that the Ls, Ms and Rs are balanced # What's left to do is to actually match what we want. L+\d+M+\d+R+$ # Match what we want Was that awesome? I thought so when I worked through these tricks! To understand them, I carefully commented each line in RegexBuddy, then worked an example on paper. I highly recommend this procedure as a way to understand such complex expressions. The solution to this problem is trivial in Perl. It's also the simplest. No backtracking, no recursion, no back references. The key is the postponed sub regular expression and the inlined code block. (? {}) Execute any code based on the progress of the regular expression. (?? {}) Postponed sub expression that is calculated when it is encountered based on any code. The embedded qr creates the sub regex that is used. Here are two examples. One just in a boolean context and one that captures each segment of the string. The tokens L, M, and R can represent arbitrary regular expressions. ( it won't let me post this ) This process can be used to instrument any regex and to add unlimited complexity to any regex process. This stupid form won't let me post a bracket character so here is the regex code with the character class brackets replaced by double curley brackets. Replace {{ and }} with square brackets. Pp "LLL88MMM7281RRR" =~ m/^(? {$L=0}) (? :L(? { $L++ }))++ {{^M}}*+ (?? { qr{M{$L}(? !M)} }) {{^R}}* (?? { qr{R{$L}} }) $/x; 1 pp "LLL88MMM7281RRR" =~ m/^(? {$L=0}) ((? :L(? { $L++ }))++) ({{^M}}*+) ((?? { qr{M{$L}(? !M)} })) ({{^R}}*+) ((?? { qr{R{$L}} })) $/x; ("LLL", 88, "MMM", 7281, "RRR") Reply to Chris Wagner Thank you Chris, I very much appreciate your solution for those of us who don't know Perl. Would love to see that in Javascript!

Regex Cookbook

This page presents recipes for regex tasks you may have to solve. If you learn by example, this is a great spot to spend a regex vacation. The page is a work in progress, so please forgive all the gaps: I thought it would be preferable to have an incomplete page now than a complete page in 25 years—if that is possible. I also haven't proofed this page as thoroughly as the others, so please report any bugs using the form at the bottom. In , many of the recipes focus on ultra-specialized tasks such as matching Canadian postal codes or US social security numbers. I have a lot of respect for Jan Goyvaerts, but for me that's a little weak. If you learn by example, this is a great spot to spend a regex vacation. To me, many of the "recipes" are a repeat of the same general concept. I don't find this approach conducive to challenging the mind and expanding one's understanding of regular expressions. So here I try the approach I would have liked to see in the book. This page tries to present expressions that are "topologically different" from one another to expose you to as many uses of regex syntax as possible—hoping to help the regex student improve his or her fluency. Making all examples "different" is not always possible (or desirable), but that's the general idea. It's hard to place expressions into neat categories as there can be considerable overlap, but here is the general organization of the expressions on this page: 1. Capturing 2. Validating 3. Finding 4. Replacing and Inserting I'll acknowledge that these distinctions are often a bit arbitrary, but they give you different things to look at.

Capturing

How do I parse the values of a complex string, such as a url's GET parameters? [parsing] How do I capture text inside a set of parentheses? [parentheses] How do I match text inside a set of parentheses that contains other parentheses? [complex parentheses] How do I parse the values of a complex string, such as a url's GET parameters? Suppose you wanted to extract the values for day, name and fruit from this string: site.org?day=7&name=adam&fruit=apple It is very likely you would have ready-made tools to extract these values, such as the GET array. But if you had to do it with regex, you could use this: \?day=(\d)&name=([^&]+)&fruit=(\w+) The values are captured in Groups 1, 2 and 3. If not all strings contained all the parameters, you could make the components optional: \?(?:day=(\d))?&?(?:name=([^&]+))?&?(?:fruit=(\w+))? How do I capture text inside a set of parentheses? This is a common request on forums. You have a file with text such as Acapulco airport (ACA) and you want to grab the text in the parentheses. Here is a recipe to capture the text inside a single set of parentheses: \(([^()]*)\) First we match the opening parenthesis: \(. Then we greedily match any number of any character that is neither an opening nor a closing parenthesis (we don't want nested parentheses for this example: [^()]*. This is the content of the parentheses, and it is placed within a set of regex parentheses in order to capture it into Group 1. Last, we match the closing parenthesis: \). How do I match text inside a set of parentheses that contains other parentheses? This requires a small tweak and a regex flavor that supports recursion. We're still going to match the opening parenthesis at the very start and the closing parenthesis at the very end. Inside, we'll match "stuff that's not parentheses" (or nothing), followed by zero or more sequences of (i) a repeat the whole pattern (expressed below as (?R), and (ii) more "stuff that's not parentheses" (expressed below as (?2)). \((([^()]*+)(?:(?R)(?2))*)\) I can't guarantee that this works in every situation as recursive patterns are fickle, but here's PHP code that tests the expression on various sets of nested parentheses. <?php $regex='~\((([^()]*+)(?:(?R)(?2))*)\)~'; $strings=array('Airport: (ACA)','equation1: (1+(a+b))','equation2: (1+(a+b)+c)','equation3: (1+(a+b)+(2+2)+c)', 'equation4: (1+(a+b)+(2+(7/5)-2)+c)'); foreach($strings as $string) if(preg_match($regex,$string,$match)) echo $string.' <b>capture:</b> '.$match[1].'<br />'; ?> This is a bit different from the expression offered by Jeffrey Friedl in : (?:[^()]++|\((?R)\))*, which you'd have to tweak before you could pop it in the code above in order to capture the contents of the parentheses: \(((?:[^()]++|(?R))*)\). In my tests, I have found this expression to be up to twenty percent faster when the match works as planned, but slower by the same amount when a parenthesis is missing.

Validating

What you can validate, you can also search for, so this section is also about finding How do I validate that a number is over 15? [values] How do I validate that a time string is well-formed? [formats] How do I validate that a list is made of certain items, in any order? [unordered list] How can I validate that a string contains the text "75", but only once? [password validation technique] How can I validate that a binary string contains ten 1s at the most? [reverse password validation technique] How do I validate that a number is over 15? This example gives you an idea of what you have to do to validate numbers within a certain range using regular expressions—and of why you should probably look for other methods first. Because you are working on the string, rather than values, you have to think of the position of the digits that may be used to create the numbers within your range. Here are two approaches to validating a number over 15. ^(?:1[5-9]|[2-9]\d|[1-9]\d\d+)$ With this approach, we progress in numerical order with multiple alternations, first trying to match numbers between 15 and 19, then numbers between 20 and 99, then numbers 100 and above. ^(?:1(?:[5-9]|\d\d+)|[2-9]\d+)$ With this approach, we look at two cases: either the first digit is a 1, or it is anything else. How do I validate that a time string is well-formed? Here's an expression I came up with. ^(?:([0]?\d|1[012])|(?:1[3-9]|2[0-3]))[.:h]?[0-5]\d(?:\s?(?(1)(am|AM|pm|PM)))?$ It matches times in a variety of formats, such as 10:22am, 21:10, 08h55 and 7.15 pm. How do I validate that a list is made of certain items, in any order? Scenario: you want to make sure that the string only contains items from a list, delimited by a comma (for instance). These items could be objects, numbers, names. For instance: 212, 415, 850. Here is a general solution: Example 1: ^(?:peas|onions|carrots)(?:,(?:peas|onions|carrots))*+$ Example 2: ^(?:415|212|850)(?::(?:415|212|850))*+$ (note that here the delimiter is a colon.) Explanation: You need one of the words to be present at least once. Then it is optionally followed by a comma and another word, multiple times. If you are using PCRE, you can use the repeating syntax for a more compact, maintainable expression: ^(peas|onions|carrots)(?:,(?1))*+$ How can I validate that a string contains the text "75", but only once? This is similar to the password validation presented on the Lookaround page: you set a number of conditions before matching the string. ^(?=.*?75)(?!.*?75.*?75).*$ How can I validate that a binary string contains ten 1s at the most? ("reverse password validation technique") This is a variation on the password validation technique: we look ahead to make sure that the string does not contain what we don't want, then we match. ^(?!0*(?:10*){10}1)[01]+$ After anchoring the expression, in the negative lookakead, we build a generic binary string that has at least 11 ones. This is what we don't want. To build that string, we state that it can start with any number of zeros. Once the zeroes are consumed, we have a one, followed by optional zeroes. That's our first one. We repeat this ten times, bringing us to ten ones. Finally, we add one last one to get over the limit.

Finding

How do I match a number with one to ten digits? [boundaries] How can I match all lines except those that contain a certain word? [exclusion] How can I match paragraphs that contain MyWord, but only proper paragraphs starting with two carriage returns? [paragraphs] Match numbers followed by letter or end of string Match pairs of characters in the correct slots Okay, let's start easy. How do I match a number with one to ten digits? You could do something like \b\d{1,10}\b. The boundaries are there to make sure you don't match a portion of a twenty-digit number when you really only want to match a number that has ten digits at the most. For this kind of simple max, I really recommend you print out the cheat sheet. How can I match all lines except those that contain a certain word? Typically, this would be used in a case where you want to capture something on each line, except those that present certain features. Let's go with the simple case where you want to match all lines, except those that contain "BadWord". This will match your lines: (?m)^(?!.*?BadWord).*$ If you want to exclude BadWord only when it stands on its own, set it apart with the \b boundary: (?m)^(?!.*?\bBadWord\b).*$ Also note that this is a potential application of the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages. (?m)^.*?\bBadWord\b.*$|(^.*$) Match numbers followed by letter or end of string In the string 00-11A22B33_44, suppose you are interested in matching numbers, provided they are followed by a letter or the end of the string. You can solve that with: \d+(?=[A-Z]|$) The lookahead (?=[A-Z]|$) asserts: what follows is either an uppercase character, or the end of the string—exactly what we want. The trick here is to not be shy to use the $ anchor in a context where it is not on its own, at the end of the string. Dollars are people too! If you've only seen basic regex tutorials, you could be forgiven for assuming that the ^ anchor only ever appears at the very beginning of an expression, while the $ anchor always sits quietly at the very end. You can use anchors anywhere in your pattern. They are assertions like any other. How can I match paragraphs that contain MyWord, but only proper paragraphs starting with two carriage returns? This question is about finding text within specific formatting. If a paragraph starts with a single carriage return, you are not interested. You are only interested in the first paragraph or those set off by two carriage returns. On systems where a carriage return only inserts a newline character (such as Unix), you could start with this: (?m)^(?<=^\A|\n\n).*?SomeWord.*$ The lookbehind ensures that the line is either the first in the text, or that it is preceded by two newlines. On Windows, in the place of \n\n, you would want \r\n\r\n. For something portable, on PCRE, use \R, which matches any newline sequence. Your expression would look like this: (?m)^(?<=^\A|\R\R).*?SomeWord.*$ Match pairs of characters in the correct slots Suppose you want to match all two-digit numbers that start with a 6. Further, you think of your string as a series of pairs, so you would want to match 68 in 116822, but not in 168122. Let's proceed step by step. To match the first pair that starts with a 6, you could use ^(?:[^6].)*(6\d) and retrieve the match from Group 1. The anchor ^ ensures that we start looking at the beginning of the string. The non-capture group (?:[^6].)* matches zero or more pairs of characters that do not start with a 6 (using the parity trick to stay in sync with the two-character slots in the string), then the parentheses around (6\d) capture our match to Group 1. In Perl, PCRE (PHP, R…) or Ruby 2+, we could do away with the capturing group and match the string directly by using \K, which forces the engine to drop what was matched previously: ^(?:[^6].)*\K6\d. Likewise, in .NET, we could use infinite lookbehind: (?<=^(?:[^6].)*)6\d But we don't want to match just one pair: from 00611122665564, we want to extract 61, 66 and 64. This is a place where the match continuation anchor \G comes in very handy. \G matches the beginning of the string, or the position immediately following the previous match. It is supported in .NET, Perl, PCRE (PHP, R…), Java and Ruby. It will ensure that our second and next matches do not fall out of sync with the two-character slots in the string. Here is the general option with capture groups: \G(?:[^6].)*(6\d) In engines that support \K, we would use \G(?:[^6].)*\K(6\d) to get a direct match. And in .NET, we would use an infinite lookbehind: (?<=\G(?:[^6].)*)(6\d)

Replacing and Inserting

I suggest you try to think of the regex replace feature of your language or text editor as not only a way to replace text, but also to insert. Remember that a regex pattern can match not only text strings but also positions in text. For instance, the pattern ^ matches the beginning of a string or line (depending on the engine and mode), and (?=@) matches a position preceding an AT—without matching the characters themselves. When you use a replacement function on a position match, where no actual characters are matched, you are not really replacing anything: rather, you are inserting characters at the matched position. Insert text at the beginning (or end) of a line How do I replace one tag delimiter with another? ["surround" replacement] How do I replace the string "//" in a whole file, but only when it is part of a path? [selective replacement] How do I replace curly Quotes ("smart quotes") with straight quotes? [utf8] How do I convert a whole string to lowercase except certain words? [selective transformation] How do I replace all words that appear on the black list, but not those on the white list? [black list] How do I fix unclosed tags? [forced failure] Insert text at the beginning (or end) of a line To insert text at the beginning of a line, we simply match the position at the beginning of the line, without matching any characters. To do so, in all engines except Ruby, we must turn on multi-line mode, which allows the ^ anchor to match at the beginning of lines. For instance, in .NET, Java, Perl, PCRE (PHP, R, …) and Python, you can use this regex to search: (?m)^ and replace with your chosen line prefix. Likewise, to insert a suffix at the end of lines, you can use this regex to search: (?m)$ and replace with your chosen line suffix. How do I replace one tag delimiter with another? Let's say you want to replace [square brackets] with <pointy brackets> without changing the stuff in the brackets. Search: \[([^]]+)] This search expression matches an opening bracket, then anything that is not a closing bracket, then a closing bracket. The content of the brackets is captured in Group 1. Replace: <\1> The replacement expression just places the capture (Group 1) within a brand new set of pointy brackets. How do I replace the string "//" in a whole file, but only when it is part of a path? Let's say in a page of text you want to replace all instances of // or \\ with a single forward slash. No problem, that's what your replace function is designed to do. In PHP: $string=preg_replace('~//|\\\\~','/',$string); (the backslashes need to be escaped). By the way, this is a great example of why something like a tilde (~) often works better than / as a delimiter. With / as a delimiter, the regex would look like this: $string=preg_replace('/\/\/|\\\\/','/',$string);. The real "problem" is if you wanted to replace all instances of //, but only in parts of your text file that look like this: Document=root//folder1//folder2//(maybe_more_folders)//file.extension You can't do a plain replace, as instances of // that you don't want to touch would also be replaced. You can't capture the various parts of the file path into groups and build a generic replace string, because you don't know how many subfolders are in the string. For this kind of problem, I use two distinct solutions depending on the context and my mood. Solution #1: Variable-Width Lookbehind. This simple solution works in ABA and RegexBuddy (.NET flavor), which have variable-width lookbehinds. You search for (?<=Document=.*)// and replace with a single slash. Solution #2: Replace function with Callback. This solution works if your programming language has a replace function that allows you to call another function. The replace function passes the whole match. The "callback function" works on the match and returns the replacement string. In this instance, Document=[^/]*+(?>//[^\s]+) matches the type of string we are looking for. In PHP, we can use: $string=preg_replace_callback('~Document=[^/]*+(?>//[^\s]+)~', function ($match) {return str_replace('//','/',$match[0]);}, $string); Solution #3: Multiple replacements. This solution works in environments where you can run a replace operation multiple times (until you exhaust any replacements to be made). For instance, in this case, we can safely assume that no path will have a hundred subfolders, so we can run the replace operation a hundred times. On my system, I can run this kind of operation in Directory Opus (for file renaming) and EditPad Pro. The trick here is to build an expression that will continue to match the string you want to alter, even after you have made several replacements. In our example, (Document=[^/]*+(?>/(?!/)[^/\s]+)*+)(//) will capture before the first // in Group 1, then capture the first //. You replace the match with \1 and a single /, then you repeat the operation as many times as necessary. How do I replace curly Quotes ("smart quotes") with straight quotes? This is not a hard regex problem: we just want to replace some characters with some other character. It's a character set problem. You need to know every unicode code point (or the few ASCII codes) for curly quotes. The regex is self-explanatory: I'll just give you the solution, first for utf-8 then for ASCII. For utf-8 text (which is what I have on my website), I use the two replace lines in the code below. <?php $string='“Take me to ‘the station’ ”, he said.'; echo 'Before: '.$string.'<br />'; $string=preg_replace('~[\x{0091}\x{0092}\x{2018}\x{2019}\x{201A}\x{201B}\x{2032}\x{2035}]~u',"'",$string); // single curly quotes $string=preg_replace('~[\x{0093}\x{0094}\x{201C}\x{201D}\x{201E}\x{201F}\x{2033}\x{2036}]~u','"',$string); // double curly quotes echo 'After: '.$string; ?> Output: Before: “Take me to ‘the station’ ”, he said. After: "Take me to 'the station' ", he said. For ASCII-encoded text, you can use this: <?php $string='“Take me to ‘the station’ ”, he said.'; echo 'Before: '.$string.'<br />'; $string=preg_replace('~[\x145\x146]~',"'",$string); // single curly quotes $string=preg_replace('~[\x147\x148]~','"',$string); // double curly quotes echo 'After: '.$string; ?> How do I convert a whole string to lowercase except certain words? Input: Tomatoes AND orangeS AND ParsleY We want to convert the whole sentence to lowercase, except the word AND. Here are three ways to handle this. 1. Match all words except AND, and replace them to their lowercase version using a callback function (preg_replace_callback in PHP). Match: (?!\bAND\b)\s*\b\w+\s* Here is a working example: <?php $string='Tomatoes AND orangeS AND ParsleY'; $regex='~(?!\bAND\b)\s*\b\w+\s*~'; $string=preg_replace_callback($regex,function ($m) {return strtolower($m[0]);} ,$string); echo $string; ?> 2. Progressively match the whole string, capturing word groups in Group 1 and 'AND' in Group 2, then rebuild the string. This is heavier programmatically, but, according to my benchmarks (running each piece of code a million times), it is a 33% faster—thanks to the averted callbacks. <?php $string='Tomatoes AND orangeS AND ParsleY'; $regex=',((?!\bAND\b)\s*\b\w+\s*)(\bAND\b|$),'; preg_match_all($regex, $string, $matches, PREG_PATTERN_ORDER ); $size=count($matches[1]); $string=''; for ($i=0;$i<$size;$i++) $string.=strtolower($matches[1][$i]).$matches[2][$i]; echo $string."<br />"; ?> 3. Use the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages. (?m)^.*?\bBadWord\b.*$|(^.*$) How do I replace all words that appear on the black list, but not those on the white list? Let's say you want to replace all instances of the word sax with with '###', even when it is part of other words such as "saxophone", but not when it is part of "Essax" and other words on a white list. And let's say you have a whole blacklist of "bad words" words besides "sax", each word with its own whitelist of acceptable uses. Crafting a custom regex for each word is a bit long. The easier procedure is to replace each instance of the "bad words" that occur in a white list word with something distinctive. For instance, add "@@@" to the end of every white list word that contains "sax"—turning "Essax" into "Essax@@@". With a simple lookahead, you can then replace sax everywhere, except when it is part of a word that ends in "@@@": sax(?!\w*@@@). Last, all you have to do is zap all the "@@@". How do I fix unclosed tags? Here is an example I'm particularly fond of because it's a great use of conditionals. The problem: in this string a<1bc<2>3>de<<4f5g the numbers are supposed to live in complete tags, like so: <1> Sometimes the opening tag is missing, sometimes the closing tag is missing, sometimes there are multiple opening tags, sometimes the tag is properly formed. To match these numbers, if we make both tags optional, as in <*(\d+)>*, then we will erroneously match the 5, which is supposed to be tagged. To ensure there is at least one tag, one solution is to say "match opening tags and optionally match closing tags, OR optionally match opening tags and match closing tags. This looks like this: Match: <+(\d+)>*|<*(\d+)>+ Replacement: <\1\2> This works great, but the alternation can give the engine a lot of work. Isn't there a way to say "at least one of the tags has to be present"? With conditionals, there is: Match: (<)*(\d+)(>)*(?(1)|(?(3)|(?!))) Replacement: <\2> The first part of the expression matches optional opening tags, a number, and optional closing tags. The opening tags are captured in Group 1. The number is captured in Group 2. The closing tags are captured in Group 3. After all this matching takes place (without using an alternation), a conditional expression checks that at least one of the two tags was present (and therefore captured). Here is the logic: IF Group 1 was captured: (?(1)…THEN no need to match anything, OTHERWISE (no Group 1 capture), IF Group 3 was captured: (?(3)…THEN no need to match anything, OTHERWISE (neither tag group was captured), THEN fail: (?!). The key here is to force the regex to fail unless we are happy with the match. (See forced regex failure on the tricks page for more about forcing a regex to fail.) Now tell me… how neat is that? Smiles, Rex Hey Dude! This site is awesome! I can tell you've really put a lot of work and thought into it. It's been a while since I've dipped into Regex, but I'm excited to re-learn it Under Capturing in the third title you have a spelling error, is says to instead of do. "How to I match text inside a set of parentheses that contains other parentheses? " Just wanted to give you a heads up. I love your site! Jacob Hawks Reply to Jacob Hawks Hey man, thanks for the kind words and typo report! Really appreciate it. Fixed. Wishing you a great day.

Interesting Character Classes

My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

How do these Character Classes Work?

Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]… If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too. Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]

Useful ASCII Ranges

All Printable Characters in the ASCII Table [ -~] All Printable Characters in the ASCII Table—Except the Space Character [!-~] All "Special Characters" in the ASCII Table (?![a-zA-Z0-9])[!-~] All "Special Characters" in the ASCII Table—Without Using Lookahead [!-/:-@\[-`{-~] All Latin and Accented Characters (?i)(?:(?![×T÷t])[a-zà-]) All English Consonants [b-df-hj-np-tv-z]

Obnoxious Ranges

Alphanumeric Characters [^\W_] This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline. Binary Number [^\D2-9]+ This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".

Strange or Beautiful Ranges

Square Brackets This will work in .NET, Perl, PCRE and Python. [][] The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[]. Words you can Type with your Left Hand (But you'll need a QWERTY keyboard.) (?i)\b[a-fq-tv-xz]+\b Words you can Type with your Right Hand (QWERTY keyboard) (?i)\b[ug-py]+\b Words that only use Letters from the Top Row (QWERTY keyboard) (?i)\b[eio-rtuwy]+\b

Line-Break-Related

Any Character Including Line Breaks These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text). Any White-Space Character Except the Newline Character You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set. [^\S\n] Alternative to [\r\n] for Java and Ruby 2+ (?![ \t\cK\f])\s This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.

Language-Related

French Letters [a-zA-Zàaéèêùüàéèêùü] German Letters The controversial capital letter for , now included in unicode, is missing in many fonts, so it might show on your screen as a question mark. [a-zA-Züü] Polish Letters [a-pr-uwy-zA-PR-UWY-Zńóó] Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-Zńóó] Italian Letters [a-zA-Zàèéìíòóùúàèéìíòóùú] Spanish Letters [a-zA-Záéíóúüáéíóúü] Subject: Spanish Thx!! I couldnt figure out how to keep my spanish characters while cleaning up some tweets. Reply to cesar Hola Cesar, Me encanta oír que hayas podido resolver tu problema. Deseándote un buenísimo día, -Rex

Regex Optimizations

The best regex optimization, in my opinion, is to start with good style. So I highly recommend you visit the page on regex style if you haven't already done so. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Optimizing Your Regular Expressions

Suppose you travel to an exotic country. With no prior knowledge of the exotic language spoken in that land, you learn to say such things as "Take me to the East Gate". Once you get into a cab, you realize that requesting your destination is not enough. Unless you also say how you want to get there, you might be taken all over town. That seems a bit unfair, doesn't it? You could feel the same about optimizations in regex. They feel like a bit of a scam. A bad regex engine is like a rude cab driver. Let me explain what I mean. In a programming language, you can use high level functions without really having to worry about how they work. You say "open this, find that, print the other", and it just works. When you start to learn regex, you may think that it's enough to know how to say "find that", naively assuming that the whole point of the syntax is to let you specify what you are looking for. But that's not the case. In some situations, if you don't tell the regex engine how to find what you want, it may take you all over town. A search that might resolve in a hundred steps can take ten thousand if written the wrong way. Now a well-optimized regex engine will be able to look at your pattern and be a little forgiving. It sees a lot of foreigners in town, and even if you didn't express yourself most elegantly, it appreciates the effort, tries to understand what you really meant, and takes you to your hotel as fast as possible. So in some sense, optimizing your regex expressions means learning tricks to speak to impolite or uncivilized regex engines, i.e., engines that haven't been optimized. And that's the scam. Why should I use your programming language if you haven't taken the trouble to make it efficient? What's more, optimizing your regex can force you to write expressions that are harder to write and read. That too seems a tad unfair. Nevertheless, studying optimizations is fun and useful. When you study optimizations, you deepen your understanding of how the engine works, and that knowledge helps you write your expressions faster and more accurately.

What is an Optimization, Exactly?

Before we start, I should define what I mean by optimization. In this section, I am not talking about the "big picture" regex writing practices we discussed in the Elements of Regex Style, practices that hinge on fundamental features of your regex engine, such as the different paths down which a greedy and a lazy quantifier may lead your engine before you end up matching the same string. Here, we are talking about "micro-optimizations" akin to turning a bolt by a quarter of a turn on your V-6. Frankly, I'm not sure I always know what should qualify as a "big-picture practice" or as a "micro-optimization". In the end, I suppose the results produced by each technique make that decision for us. About the optimization tests on this page On this page, I took one trick from the and tested how one particular version of one particular engine (PCRE 8.12) used within one particular version of one particular language (PHP 5.3.8) responds to each of them. At some stage, I would like to run the same tests in .NET, Python, Java, JavaScript and Ruby. In the meantime, even if you don't use PHP, the optimizations are still interesting to study.

Unrolling an Alternation Quantified by a Star

In the trick to Mimic an Alternation Quantified by a Star, we see that (A|B)* can be unrolled to A*(?:B+A*)*. Is there a time benefit to doing away with the alternation? Using pcretest, I benchmarked two patterns against this string: 14e987eaie7e7e7e9876ei14ou The patterns: (?:\d|[aeiou])* and \d*(?:[aeiou]+\d*)* The original pattern (with the alternation) compiles faster: 1.6 millionth of a second vs. 2.2 for the unrolled version. However, it executes a lot slower: 1.7 millionth of a second vs. 0.8. This seems like a potentially useful optimization to implement at the engine level (in one's own code, it is a bit hard to maintain).

In Alternations, put the Most Likely Cases First

The engine reads from left to right. In the world of web addresses, dot-com is more frequent than dot-net and dot-biz, so if you were checking for these three domains in a pool of random names, in theory you would write \.(?:com|net|biz)\b rather than \.(?:biz|net|com)\b In practice, in PCRE there doesn't seem to be a difference—I ran two patterns two million times and compared the results. In case you're interested in doing some benchmarking of your own, here is the (very basic) code I used for this test. <?php $start=time(); for ($i=0;$i<1000000;$i++) $res = preg_match('~\.(?:com|net|biz)\b~', 'apache.com'); $lap=time(); for ($i=0;$i<1000000;$i++) $res = preg_match('~\.(?:biz|net|com)\b~', 'apache.com'); $end=time(); $time1 = $lap - $start; $time2 = $end - $lap; echo $time1."<br />"; echo $time2."<br />"; ?>

Expose Literal Characters

Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples. Example 1: AA* should be faster than A+. They mean the same: at least one A, possibly followed by more A characters. I ran these two patterns two million times on the string BBBCCC. Both took the same amount of time. This tells me that the PHP PCRE engine must be "polite" as far as this optimization is concerned—meaning, it does it for you. Just use A+. Example 2: th(?:is|at) should be faster than this|that. I ran these two patterns two million times on the string that. The second ("less optimal") pattern was actually faster by eight percent, earning me half a millionth of a second per run, nothing to write home about. Again, the optimization must be built into PHP's PCRE regex engine. Maybe the theoretically "more optimal" pattern loses out somewhere in the compilation. Should you use it? Where speed is concerned, the answer depends on the engine you're using. For me, regardless of which engine I happen to use, the decision is one of readability—and therefore maintainability. For instance, if I am matching all numbers from 10 to 19, I will always bring out the 1 and use 1[0-9]. For me, that is easier to read and maintain than spelling out each number, all the more so when working with a more complex range of numbers. On the other hand, if I were crafting a regex for all the two-letter abbreviations of States in the USA, I would spell them out: \b(?:AL|AK|AZ|…)\b. I would do so even though I have tools at my fingertips that will automatically compress long alternations into their optimized counterparts (regex-opt in C, Regexp::Assemble and Regexp::Assemble::Compressed in Perl). That is because any day of the week, I would rather have to debug this: \b(?:AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI| MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA| WA|WV|WI|WY)\b than that: \b(?:AZ|TX|[AFI]L|[CGILMPVW]A|[CM]O|[CMUV]T|[DMN]E|[HMRW]I|[IMNS]D|[IMT]N|[KM]S|[KNW]Y|[NW]V|[NO]H|[NS]C|N[JM]|[AO][KR])\b

Use and Expose Anchors

The beginning- and end-of-string anchors ^ and $ can save your regex a lot of backtracking in cases where the match is bound to fail. In theory, ^.*abc fails faster than .*abc. I ran this two million times on a "failure string" (forty z characters in a row). As in the last example, the "less optimal" pattern was faster by eight percent, earning me half a millionth of a second per run. Again, PCRE sounds polite. The lost time might have to do with processing the extra anchor. It is also advised to expose anchors, which means taking anchors out of alternation parentheses when possible. For instance, ^(?:abc|def) is preferred to ^abc|^def. I ran each of these two preg_match functions two million times. $res = preg_match('~^(?:abc|def)~', 'definition'); $res = preg_match('~^abc|^def~', 'definition'); The first saved me one second (out of fourteen). On the one hand, that's a seven percent improvement. On the other hand, on a single run, that's only an improvement of half a millionth of a second. Should you use it? I use anchors wherever I can as a matter of good style—and saving unneeded backtracking. As to whether to expose them outside of alternations, I usually do so as well, not because it is faster but because it tends to be more readable.

Distributing into the Alternation

The last technique I tried is what Jeffrey calls "distributing into the alternation": \b(?:com\b|net\b) vs. \b(?:com|net)\b This technique did speed up the script by seven percent, saving a millionth of a second per run. Will I use it? Probably not. I like exposing boundaries.

Automated Optimization: Study Mode

PCRE has a "Study" modifier you can tag at the end of a pattern. For patterns that do not start with a fixed character and are not anchored, this modifier causes the regex engine to study the string a little more before applying the pattern, just in case some optimizations can be discovered. To use this mode, add a capital S after the closing delimiter, as in: $pattern='/\d+\b(day|night)\b/S'; Apparently, this study mode can be useful when parsing long documents such as web pages. It may not help, but it costs less than a hundred thousandth of a second.

Optimizations: In Conclusion

When I tested them in PHP, the "micro-optimizations" in did not seem to speed up the code. Does that mean they are bad? On the contrary. Maybe Philip Hazel and the other contributors to the PCRE engine read Jeffrey's book, found that the optimizations improved matching speed—or match failure speed, which is often more important—and incorporated them into the engine. In his book, Jeffrey hints that he wouldn't be sad for that to happen. At least in PHP, I suggest you forget "micro-optimizations" such as the ones presented in this section. Good style matters more. PCRE won't "scam you" when you board the cab: it's a polite engine. Just write regex that works, focusing on the big picture to avoid patterns that slow you down by orders of magnitude. Mainly, this means making judicious use of anchors, quantifiers (lazy vs. greedy), groups (atomic or backtracking-savvy), and anything that can make your regex more specific than the all-too-common match-all dot-star soup—such as literal characters and negative classes. For more, read the section about the .

Two Tools: Grep in PCRE, Test your PCRE patterns

This page focuses on two little-known but delightful tools that are part of the official PCRE distribution: pcregrep and pcretest. It also lets you download Windows versions of both tools. The first tool is a grep utility that uses the PCRE engine. For those who haven't spent much time in unix, grep is a command-line utility to search text patterns (using regular expressions of course) inside a set of files—all the files on your hard drive if that's your pleasure. The second is another command-line utility, one that will allow you to debug or test the performance of PCRE patterns far more thoroughly than you could through an interface such as PHP. Like the rest of the PCRE library, the tools are available in source form on the PCRE website. On this page, I aim to make available Windows executables of the current version of the tools as soon as they are released.

PCRE1

A lot of tools and languages are still using the PCRE1 branch (which is currently in bugfix mode), so please don't think you need to poopoo it! I try to keep up with the new versions that get released (bug fixes).

PCRE2

Versions of PCRE 10.0 and above are called PCRE2. In PCRE2 releases, the grep and test utilities are called pcre2grep and pcre2test. Whenever you see pcregrep or pcretest on this page, substitute with pcre2grep or pcre2test if you are using PCRE2.

Downloading the Tools

In this section, you can download Windows binaries of pcregrep and pcretest that I've compiled from the PCRE source code. In other words, these programs are ready to run. I would like to always have the latest version for you to download. There are a number of compile options for pcregrep and pcretest. The ones below are "fully loaded" and include the "just-in-time", "utf" and "unicode_properties" options. As of 8.35, they are also compiled with the ANYCRLF option. Read the Download pcregrep (or pcre2grep) and pcretest (or pcre2test)

PCRE2 Branch

Versionpcregreppcretest
10.37 (26 May 2021)downloaddownload
10.36 (4 Dec 2020)downloaddownload
10.35 (9 May 2020)downloaddownload
10.34 (21 Nov 2019)downloaddownload
10.33downloaddownload
10.22downloaddownload
10.21downloaddownload
10.20downloaddownload
10.10downloaddownload

PCRE1 Branch

Versionpcregreppcretest
8.45 (15 Jun 2021)downloaddownload
8.44 (12 Feb 2020)downloaddownload
8.43 (23 Feb 2019)downloaddownload
8.38downloaddownload
8.37downloaddownload
8.36downloaddownload
8.35downloaddownload
8.34downloaddownload
8.32downloaddownload
8.31downloaddownload
Before we dive into how to use pcregrep and pcretest, some thanks are in order.

A Few Words of Thanks

This "little page" is only possible because of a long chain of work by a lot of highly-skilled people. I will start in reverse order, with the last (in that long chain) but not least. When I started on this page, it had been twenty years since I'd compiled any of my own C code, and I was anxious at the thought of compiling in the age of Windows 7. I tried compiling PCRE with the CMake option mentioned in the help file, but got stuck for a silly reason. Of course I had no way of knowing that it was a silly reason. Daniel Richard took the time to examine my workflow and pinpoint what I was doing wrong. Thanks to him, I was able to solve the problem in minutes, which gave me the great joy of compiling PCRE on my own machine. Thank you, Daniel. Before Daniel, there are the makers of MinGW and CMake, the two open-source projects that allowed me to so easily compile the source code into Windows executables. I don't know who these guys are, but they're awesome. On the PCRE side, some people—I don't know their names—maintain the CMake file that makes it possible for jokers like me to compile PCRE. Thanks, guys! Finally, to Philip Hazel (the father of the PCRE project), Zoltan Herczeg (the keeper of the Just-In-Time compiler) and others on their team, I am immensely grateful for that wonderful engine that has given me so much pleasure. May you all live long lifes on beautiful streets lined with chocolate fountains.

Installation Notes

No installation is required for either pcregrep or pcretest. However, if you want the grep tool to be at your fingertips when you need it, here is what I suggest you do. Rename pcregrep.exe to grep.exe. Life is too short to type extra letters. Copy grep.exe to the C:\Windows\System32 folder. This system folder is in the system's path variable, which means that when you try to run a program, Windows looks there. That will allow you to run pcregrep (now called grep) from any folder. I copy pcretest to the same folder, that way I don't need to remember where it lives. To run grep, open a command prompt, which is never more than five keyboard strokes away: Windows key, type "cmd", press Enter. If you use a marvellous program called Directory Opus (probably the ultimate productivity application for Windows), you can also invoke a command prompt from the current folder by using a keyboard shortcut such as Ctrl + Shift + R. That's what I do. From the command prompt… Start grepping, debugging or optimizing!

Quick Outline

The page is rather long, so I want to give you an idea of the outline. The section about pcregrep starts immediately below. To , click the link.

What's special about pcregrep?

There are other versions of grep for Windows floating around. It's only natural, as so many unix people are attached to their command-line tools. This is no recent phenomenon: I remember a time in the mid-1980s when I was given a floppy disk with a number of unix-like commands, such as "ls" (filled with options) or "cp" or "mv". These were designed to replace the DOS commands we all used at the time. There was no "move" command in DOS. These unix-like utilities were awesome, and I used them for many years. One grep version I tried lately is the one bundled with Cygwin. I don't like it because the regular expression syntax it uses is rudimentary. It has a "-P" flag for Perl-like regex, but the manual states that it is experimental, and indeed it worked poorly when I tried it. So I looked for a command-line Windows grep with solid regex matching, but I couldn't find anything… Until I remembered pcregrep, which I had one come across and hoped to compile some day. I spend more time in PHP than in other coding environments, so PCRE is my "home" regex flavor and I have come to love it. So what could beat pcregrep? I should mention that pcregrep searches, but it does not replace. For replacing text across files on Windows, my workhorse is "ABA Replace", an amazing GUI tool with solid regex matching. You can read my review of ABA Replace on the Tools page. And yes, there are other GUI tools, such as the search function of Directory Opus, or PowerGrep, which is "not for me", even though I love some of Jan's other software.

Using pcregrep

Remember, we renamed "pcregrep" to "grep" to save on keyboard strokes. For the most part, the pcregrep utility has the same syntax as the usual GNU grep. If you don't know that syntax, don't worry, we'll start from scratch. Here is the basic pcregrep syntax.
grep list_of_options regex_pattern files_to_match
The full syntax is in the manual file which is included in the download. But manual pages can be confusing, so here are some examples that work on my Windows machine.
CommandDescription
grep --helpDisplays a list of the options you can use with the command. You can send the output to a file with "grep --help > grephelp.txt". But note that the manual page in the download has much more detail.
grep toto *Looks for the string "toto" in all files in the current directory. Returns all the lines that match, with a little context. Complains that it cannot open directories.
grep -s toto *As above, but the "s" option makes the engine shut up (or silent) about the fact that it cannot open directories. That should probably be the default option on Windows—if you don't want to see the warnings, just get into the habit of adding an s to your options when you are searching all files.
grep -s toto *.txtAs above, but only looks in all files with a "txt" extension.
grep -s \btoto\d{3}\b *.txtAs above, except that instead of looking for a simple string, we are using a regex pattern. Note that there is no delimiter. See the rest of the site for pattern syntax. This particular regex will match strings such as "toto123" as long as they are not embedded in a string of "word characters". You get the idea: going forward, to focus on grep features, many examples will use simple text matches instead of regex.
grep -r toto .Looks for the string toto in all files, recursively from the current folder.
grep -r --include=.*\.txt toto .Looks for the string toto in all files, recursively from the current folder, but only in files with a "txt" extension. Note than pcregrep uses a PCRE regex to specify the names of the files in which to search.
grep -r --exclude=.*\.dat toto .Specifies file names to exclude from the search, using a PCRE regex to define the set of files to exclude.
grep -ri toto .As four lines above, with the addition of the "i" option, which makes the search case-insensitive and allows the command to match "toTO". The "-ri" also showcases how to combine short options.
grep -r (?i)toto .As above, but setting case-insensitivity in the regex itself. See the page about (? syntax.
grep -f patterns.dat *.txtReads patterns from a file called patterns.dat (one pattern per line, up to 100 patterns) and matches each pattern against all files with a "txt" extension!
grep -v toto *.txtInverts the match, so that only lines that do not match the pattern are reported.
grep -o \btoto\w\b *.txtThe "o" option only reports each line's match, without the context.
grep -l toto *.txtThe "l" option says to only list the names of the files that contain a match, without showing the matches
grep -L toto *.txtThe "L" option says to only list the names of the files that do not contain a match. Not the same as "-vl", which shows files that contain lines that do not match (some lines may match).
grep -n toto *.txtAdds the line number to the reported match.
grep -c toto *.txtOnly reports the number of matches in each file.
grep -NANYCRLF toto *.txtBy default, because this is Windows, grep treats \r\n (CRLF) as a new line. This option makes grep treat CR, LF or CRLF as new lines, which comes in handy if you are testing Unix files. See the manual for other options such as -NLF
grep -so1 toto(\d{3}) *We saw the s option before (silent). The "o1" option tells pcregrep to only report the Group 1 matches—in this case, the three digits after "toto". You could likewise specify -o2 to only report Group 2 captures. This option should have an alias "g" for "group", in order to avoid confusion with "o" which "only report the match (no context)".
Note that you don't have to use quotes around patterns, but you can, and indeed you must if your pattern happens to contain white space. Quotes are fine, but remember not to use delimiters on your patterns—this is not PHP. There are many other cool options. For instance, you can specify how much context to display around each match. For a full reference, see the manual page. About the --color Option Feel free to skip ahead to the much more exciting pcretest section, as these are just notes so I can remember a feature that I haven't yet managed to use the way I'd like. Like GNU grep, pcregrep has an option to color the match, making it stand out from the context: grep --color toto * Sadly, this does not work in the Windows shell (cmd.exe) and results in this strange output: "←[1;31mtoto←[00m", while no color is displayed. This is a limitation of the Windows shell rather than pcregrep. The PCREGREP_COLOR option, set to "1;31" by default in the make files, is an ANSI code that can manipulate colors on terminals that accepts these codes, as on unix. Windows is a different OS, so we shouldn't expect it to work. You can easily change the overall background and text color of the cmd shell, either from the menu (click on the icon at the top left), or from the command-line, by passing strings such as "color=1B" when launching cmd ("1" stands for a blue background, "B" stands for very light blue text). But to my knowledge there is no way to manipulate the color of individual text in cmd.exe. The work-around is to use a different terminal. I tried pcregrep in Console, and the color option works, but I haven't yet figured out how to integrate Console in my system so that it launches in the right path. (I normally launch command shells from Directory Opus with a Ctrl + Shift + R shortcut, and they open in the right folder, in admin mode). Last Words about pcregrep I hope you get as much pleasure out of having that powerful grep utility at your fingertips as I do. Okay, it's time to look at pcretest!

Using pcretest

In my mind, pcretest has two great uses: optimization and debugging. First, let's talk about debugging. You could feed pcretest a pattern such as ~\btoto(\w+)\b~, and a subject such as "slkjtototatalkj". With the right parameters, pcretest would show you the exact path it took in order to produce a match: --->slkjtototatalkj +0^\b +2^t +3^^o +4^^t +5^^o +6^^(\w+) +7^^\w+ +10^^) +11^^\b +13^^ 0:tototata 1:tata I hope you'll agree that this is rather nifty. It could come in handy for an expression that fails for unknown reasons. You'll be able to see exactly what is going on. Now let's talk about optimization. The pcretest utility lets you run a regex on some data a million times (or however many time you like), and it reports the average time it needed to find a match. This makes it easy to compare alternate expressions. When you read about techniques to optimize your regular expressions, you may be interested in running tests on alternate regex phrasings. You can do that in your programming language—I used PHP to test many tweaks suggested in Jeffrey Friedl's book—but pcretest gives you an even more powerful test bench. By the way, PCRE must have been seriously optimized since Jeffrey's book came out, because as mentioned in my page about regex tricks, none of the tweaks I tried seemed to make much difference. Perhaps partly thanks to Jeffrey's hints? Before looking at the pcretest command itself, you need to know that it usually operates on an input file. The file contains the regex to be tested, and the lines of text to test. The regex is on the first line, and must be enclosed in delimiters. Here is an example of a file that would work, with the regex delimited by tildes ("~"). The regex is shown in bold for emphasis, but that would not be part of the text file. ~toto\d{3}~ aslkj 242 slkj totos lkj sdlkj toto444 sdfs sadflkj If you wanted, you could add more regexes and data to that file. Just leave a blank line after the end of the data, then start your next regex, then add the data for that regex. Here is a file that would work with pcretest and contains two regexes. You can add as many regexes and lines of data as you like. ~toto\d{3}~ aslkj 242 slkj totos lkj sdlkj toto444 sdfs sadflkj ~\btoto(\w+)\b~ aslkj 242 slkj tototata lkj The regexes can use the entire PCRE syntax, whether in the pattern itself (for instance with (?s) or \K), or after the delimiter, for instance with the G flag. There is one particularly interesting flag for debugging: "C". It's the flag that produced the "trace output" I showed you a few paragraphs ago. To get that output, I just added "C" to the pattern in the IN.txt file: ~\btoto(\w+)\b~C. But enough about the input file. Let's now talk about the command itself. Here are some sample uses of pcretest that you could try with either of the files above. For a full reference, I highly recommend you read the manual page, which is part of the download.
CommandDescription
pcretest -helpDisplays a list of the options you can use with the command. You can send the output to a file with "pcretest -help > testhelp.txt". But note that the manual page in the download has much more detail.
pcretest -COuputs some information about the version of pcretest you are running, such as whether it supports UTF-8.
pcretest IN.txt OUT.txtReads the regex and data from IN.TXT, outputs the result to OUT.txt, reporting the matches for each line. In the case of the second regex, which includes a capture group, pcretest also reports the Group 1 match.
pcretest -t 100000 IN.txt OUT.txtAs above, but runs 100,000 times, and reports both the matches and the average time per run. Start without the -t option, just in case there is an error in your expression.
There are a number of other options, some of which I don't even understand. So if you feel so inclined, dig into the manual page, and have fun with it! Subject: a correction to the article Hello, Rex, and thank you very much for the site. It should probably be noted in the article that in pcre2test the modifier "C" is changed to "auto_callout", so if one downloads the latest compiled version (it's 10.22 at the moment) of pcretest linked from the page and tries the example pattern (~\btoto(\w+)\b~C) you gave, pcre2test throws the "** Unrecognized modifier 'C' in 'C'" error. It took me a while to figure out what the problem is :) Please don't publish this message, just correct the article if you think you should. Ilya. Subject: Many thanks I was having problems with Cygwin's grep and multi-line patterns, so thanks for the recommendation to use pcregrep instead. I love your site; everything is explained clearly. I finally learnt how to use lookaheads and behinds thanks to your clear instructions. Really great site. Thanks again. I tried the latest version on your site, saying: pcregrep the pcregrep.txt It prints every line of the input. If I use an invalid pattern like "zzz", it prints nothing. Reply to Glen That sounds like expected behavior... if each line of the input contains "the". Is that not the case? Remember that a "line" can take several screen lines up to the next carriage return. I tested what you said, and if I remove "the" from one of the lines of input, that line does not show in the output (as expected). If you want to see more sober output, you can try something like pcregrep -no the test.txt For how this works, please see the examples on the page, as well as the documentation inside the zip file. Warm regards, Andy I'm using pcregrep on Windows 7. I have a folder with text files in utf-8 encoding, which contain the word "espaol. " If I open cmd on that folder and do: pcregrep -u -I espaol *.txt I don't get anything. Many thanks in advance, — Cesar Reply to Cesar Romani Hi Cesar, pcregrep works. Your PCRE syntax is not correct. No time to help you, please download the regexbuddy demo from the right column of my website to troubleshoot your PCRE syntax. Cordially, Rex There are a few ways to use the —color flag in cmd.exe. I used to use rlwrap.exe from cygwin but it had a few cygwin dependencies and was a bit heavy for just wrapping grep. If you wanted to wrap a full readline in cmd.exe it works great though. Now I use ansicon and it's very simple and much faster than rlwrap was. You can get the source from github under adoxa/ansicon and compile or they have a prebuilt one you can download. It's builds a single exe and 2 dll's (x86 & x64). I just run it like "ansicon.exe grep —color toto *" which just uses it for that one command. You can also run ansicon.exe by itself to enable for the current cmd.exe or you can install to autorun so it runs on every new cmd.exe. You can also use color prompts or other tools with ansi color output. I enjoyed your article and the info on pcretest, which was exactly what I was looking for. Reply to Gary Hi Gary, Thanks for the great tip! Wishing you a fun weekend, Rex

How to Make Perl Regex One-Liners

Perl-one liners are powertools that get a lot of work done with a single command. You can use them on the Unix command-line—or indeed on the command line of any OS where Perl is installed. There are many pages about Perl one-liners. This one is only interested in Perl one-liners in the context of regular expression matching, replacing and splitting. The site also has a page about Perl regex, but this one is only about one liners.

Perl One-Liner Recipes, not Regex Recipes

Other pages about Perl regex one-liners focus on showing you the regular expressions to accomplish certain tasks. In contrast, this page assumes you know regex, as teaching you regex is the focus of the rest of the site. What this page shows you is, given a certain regex, the Perl syntax to write one-liners to accomplish various tasks. The idea is to get you up-and-running with Perl one-liners—not to get you up-and-running with regex. Our Input All our one-liners assume we are working with a file called yourfile containing these lines: cat bat carrot book true blue red caramel To test the one-liners, I encourage you to create this file with nano or whichever tool you use to edit files. Our Regex All our one-liners assume we are working with this regex: \bc\w+ The idea is to match words that start with the letter c. In our input, these words are cat, carrot and caramel.

General Notes on Syntax

These are for reference only, feel free to skip if you're not interested. Rather than spread out the explanations among the recipes, I gathered them in this short section. The -e (execute) flag is what allows us to specify the Perl code we want to run right on the command line. The Perl code is within quotes. Although it's possible to string several -e statements in a row, we won't do it here. The -n flag feeds the input to Perl line by line. -0777 changes the line separator to undef, letting us to slurp the file, feeding all the lines to Perl in one go. All the examples assume we're working on one file called yourfile, but you could specify multiple files with yourfile yourfile2 yourfile3 or with *.txt All the examples assume we're working on one file called yourfile, but you could instead pipe some output to the one-liner, e.g. echo $PATH | perl… The -p flag (printing loop) processes the file line by line and prints the output. To replace directly in the file you can use the -i flag… but first test your one-liner without the -i to make sure it's what you want. If you're planning to use (*SKIP)(*F), remember this only works in Perl 5.10 and above: check your Perl version with perl -v The perl command is in apostrophes, and escaping those is hard work… So if your regex happens to contain apostrophes, first place it in an env variable then refer to it by name, e.g. env mypattern="'\w+" perl -0777 -ne 'while(m/$ENV{mypattern}/g){print "$&\n";}' yourfile

Tasks for Perl Regex One-Liners

We're ready to dive into the various types of tasks for Perl regex one-liners. If you have an idea for a task that's missing here, send me a comment. As a reminder, we'll be working with this input: cat bat carrot book true blue red caramel and this regex: \bc\w+ (which matches words that start with the letter c)

Task 1: Process the file line by line, and return all matches

We expect to match cat, carrot, caramel. Use this: perl -ne 'while(/\bc\w+/g){print "$&\n";}' yourfile

Task 2: Process the file line by line, and return all matching lines

We expect to match lines 1 and 3. Use this: perl -ne 'print if /\bc\w+/' yourfile

Task 3: Process the file line by line, and return the first match of each line

We expect to match cat and caramel. Use this: perl -ne 'print "$&\n" if /\bc\w+/' yourfile

Task 4: Process the file as a block, and return all matches

We expect to match cat, carrot, caramel. Use this: perl -0777 -ne 'while(m/\bc\w+/g){print "$&\n";}' yourfile

Task 5: Process the file as a block, and return the first match

We expect to match cat. Use this: perl -0777 -ne 'print "$&\n" if /\bc\w+/' yourfile

Task 6: Process the file as a block, and replace all matches

To replace with ZAP, use this: perl -0777 -pe 's/\bc\w+/ZAP/g' yourfile

Task 7: Process the file as a block, and replace the first match

To replace with ZAP, use this: perl -0777 -pe 's/\bc\w+/ZAP/' yourfile

Task 8: Process the file line by line, and replace all matches

To replace with ZAP, use this: perl -pe 's/\bc\w+/ZAP/g' yourfile

Task 9: Process the file line by line, and replace the first match

To replace with ZAP, use this: perl -pe 's/\bc\w+/ZAP/' yourfile

Task 10: Process the file as a block, and split

Use this: perl -0777 -ne 'if(@r=split(m/\bc\w+/,$_)){foreach(@r){print "$_\n";}}' yourfile

Task 11: Process the file line by line, and split

Use this: perl -ne 'if(@r=split(m/\bc\w+/,$_)){foreach(@r){print "$_\n";}}' yourfile

Amazing Firefox Shortcuts Using Regex

As a keyboard shortcut maniac, I love the keyconfig Firefox extension, which lets you manage conflicting shortcuts, as well as create fancy shortcuts by binding keys to pieces of JavaScript code. Recently, inspired by Mingyi Liu's terrific Fastest Search extension, I resolved to make my Firefox usage even faster by creating some fancy shortcuts using regex. On this page, I share the ones I've come up with so far. Please note that my approach here has been "quick and dirty", so I make no representations that either the JavaScript or the regex are ideal. I've included some related shortcuts that do not require regex. If you have suggestions, please send them along at the bottom of this page.

How to Set Up these Shortcuts

Install keyconfig then run it (Ctrl + Sh + F12). To activate the "Add a new key" button, you may need to use the pull-down at the top to change the selection, use the same menu to return where you were. Click "Add a new key". In the name, insert a name that starts with a "0" so your shortcuts will stay together when you sort them by name. In the code box, paste the code. Press OK. Enter a key combination in the small box at the bottom and click Apply. If there is a conflict, sort by shortcut and resolve.

Navigation Shortcuts

Navigate Up from the Current URL (suggested shortcut: Ctrl + Shift + U) Pressing this shortcut repeatedly navigates up and up. First, the anchor part of the string (what follows the #), if any, is stripped. Next, the query part of the string (what follows the ?) is stripped. Then we navigate up the file path. var root = content.document.location.origin; var path = content.document.location.pathname; // includes the leading "/" // Do we have a hash anchor? if (content.document.location.hash) { // Go Up by stripping the anchor var newloc = root + content.document.location.pathname + content.document.location.search; openUILinkIn(newloc,"current"); } // Do we have a query string? else if (content.document.location.search) { var newloc = root + content.document.location.pathname; openUILinkIn(newloc,"current"); } else { var upRegex = /^.*\/(?!$)/; var matchArray = upRegex.exec(path); if (matchArray != null) { var upPath = matchArray[0]; var newloc = root + upPath; openUILinkIn(newloc,"current"); } } Navigate Up Domain Hierarchy (suggested shortcut: Ctrl + Alt + U) Pressing this shortcut removes the leftmost subdomain from the domain name. For instance, http://mail.google.com becomes http://google.com. Sure, you can end up with domains that don't exist. So what? The point is to use the shortcut when it can save you time. var host = content.document.location.hostname; var protocol = content.document.location.protocol; var upRegex = /^[^.]+\.([^.]+\..*)/; var matchArray = upRegex.exec(host); if (matchArray != null) { var upPath = matchArray[1]; var newloc = protocol + "//" + upPath; openUILinkIn(newloc,"current"); } Navigate to the Root of the Current URL (suggested shortcut: Ctrl + Shift + R) For instance, if the url is http://www.rexegg.com/regex-firefox-shortcuts.html, the browser will navigate to http://www.rexegg.com. No regex needed. openUILinkIn(content.document.location.origin, "current"); Increment / Decrement URL (suggested shortcut: Ctrl + Shift + plus / minus) This is for sites that have pages with incrementing numbers. Pressing the shortcut increments or decrements the last number of the url by one. Clearly, the regex needs to be adapted to particular cases (for instance, you may want to increment the next to last number). Code to decrement: var url = content.document.location.href; // the entire url var numRegex = /^(.*\D)(\d+)(.*)/; var matchArray = numRegex.exec(url); if (matchArray != null) { var num = parseInt( matchArray[2] ); var newnum = (num-1).toString(); var newloc = matchArray[1] + newnum + matchArray[3]; openUILinkIn(newloc,"current"); } Code to increment: var url = content.document.location.href; // the entire url var numRegex = /^(.*\D)(\d+)(.*)/; var matchArray = numRegex.exec(url); if (matchArray != null) { var num = parseInt( matchArray[2] ); var newnum = (num+1).toString(); var newloc = matchArray[1] + newnum + matchArray[3]; openUILinkIn(newloc,"current"); } Google this Site (suggested shortcut: Ctrl + G) This shortcut opens a Google tab that searches for the word "gold" (to be replaced by what you need at the time) on the current site. No regex required. var site = content.document.location.hostname; var newloc = "https://www.google.com/?gws_rd=ssl#q=site:" + site + " gold"; openUILinkIn(newloc,"tab"); Transform URL: toggle this site between the .com version and the .co.nz version I took New Zealand addresses as an example: Adapt to your needs. var protocol = content.document.location.protocol; var host = content.document.location.hostname; var tail = content.document.location.pathname + content.document.location.search + content.document.location.hash; var tldRegex = /^(.*?)(\.co)?\.([^.]+)$/; var matchArray = tldRegex.exec(host); if (matchArray != null) { if (matchArray[3] != "nz") { var newloc = protocol + matchArray[1] + ".co.nz" + tail; openUILinkIn(newloc,"current"); } else { var newloc = protocol + matchArray[1] + ".com" + tail; openUILinkIn(newloc,"current"); } } Transform URL: find Subtitles for this IMDB movie ID The idea is to take one piece from the current url and to use it to visit a different url. The idea should be adapted for your needs. In this example, if you are on this IMDB page: http://www.imdb.com/title/tt0366551/, the shortcut will extract the IMDB ID and search for it on the Open Subtitles website. var path = content.document.location.pathname; var ttRegex = /tt\d+/; var matchArray = ttRegex.exec(path); if (matchArray != null) { tt = matchArray[0]; var newloc = "http://www.opensubtitles.org/en/search2?MovieName=" + tt + "&action=search&SubLanguageID=eng"; openUILinkIn(newloc,"tab"); }

Shortcuts to Copy URL Fragments

Copy Page IDs (with Amazon & IMDB examples) (suggested shortcut: Ctrl + I) This shortcut aims to copy page identifiers for various sites. If specific sites interest you, you need to add them in the code. As is, the shortcut handles IMDB's tt and nm fields (such as tt0366551) as well as Amazon product (dp) codes. // Copies Identifiers ("Tags") for Various Sites // See location properties // http://www.w3schools.com/jsref/obj_location.asp var host = content.document.location.hostname; var path = content.document.location.pathname; var pathsearch = content.document.location.pathname + content.document.location.search; var tag; var tagRegex; var matchArray; // For each site we're interested in, set up a regex // The regex captures the tag to Group 1 // Depending on the site, we may match different components of the url if( /\bimdb\.com/.test(host) ) { tagRegex = /(?:title|name)\/([^\/?]+)/; matchArray = tagRegex.exec(path); } else if( /\bamazon\./.test(host) ) { tagRegex = /(?:dp|product)\/([^\/?]+)/; matchArray = tagRegex.exec(path); } // Don't know this site? Grab some digits else { tagRegex = /(\d+)/; matchArray = tagRegex.exec(pathsearch); } if (matchArray != null) { tag = matchArray[1]; var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(tag); } Copy the entire url (suggested shortcut: Ctrl + Ins) No regex required. var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(content.document.location.href); Copy the website's name (suggested shortcut: Ctrl + Alt + Ins) For instance, if the url is http://www.rexegg.com/regex-firefox-shortcuts.html, the clipboard will contain www.rexegg.com. No regex required. var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(content.document.location.hostname); Copy the file name (without anchors and query fragments) (suggested shortcut: Ctrl + Shift + Ins) For instance, if the url is http://www.rexegg.com/regex-firefox-shortcuts.html#filename, the clipboard will contain regex-firefox-shortcuts.html var path = content.document.location.pathname; var pageRegex = /[^\/]+$/; var matchArray = pageRegex.exec(path); if (matchArray != null) { var page = matchArray[0]; var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(page); } Copy the anchor fragment (suggested shortcut: Alt + Ins) For instance, if the url is http://www.example.com/main.html?s=1#top, the clipboard will contain #top. No regex required. var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(content.document.location.hash); Copy the query ("search") fragment (suggested shortcut: Shift + Ins) For instance, if the url is http://www.example.com/main.html?s=1#top, the clipboard will contain ?s=1. No regex required. var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(content.document.location.search); Copy the file name (including anchors and query fragments) (suggested shortcut: Ctrl + Alt + Shift + Ins) For instance, if the url is http://www.rexegg.com/regex-firefox-shortcuts.html#filename, the clipboard will contain regex-firefox-shortcuts.html#filename var path = content.document.location.pathname; var hash = content.document.location.hash; var query = content.document.location.search; var pageRegex = /[^\/]+$/; var matchArray = pageRegex.exec(path); if (matchArray != null) { var page = matchArray[0]; var wholepage = page + hash + query; var clipboard = Cc["@mozilla.org/widget/clipboardhelper;1"] .getService(Ci.nsIClipboardHelper); clipboard.copyString(wholepage); }

Other Useful Firefox Shortcuts

Next / Previous Tab (suggested shortcut: Ctrl + Alt + Left/Right Arrows) These shortcuts navigate to the next or previous tabs. No regex required. Next Tab: gBrowser.mTabContainer.advanceSelectedTab(1,true); Previous Tab: gBrowser.mTabContainer.advanceSelectedTab(-1,true); Preferences (suggested shortcuts: F12 or Ctrl + K or Ctrl + , ) Opens the Firefox preferences. No regex required. openPreferences(); Add-Ons (suggested shortcut: X) Opens the Firefox Add-on page. No regex required. BrowserOpenAddonsMgr(); Back (suggested shortcut: Ctrl + Backspace) Navigates to the previous page. Browser:Back Location Bar (suggested shortcut: F4) I never liked Ctrl + L to target the location bar. This lets you progam an alternate shortcut. openLocation();

Regex Tools

By "regex tools", I mean tools that either help you build your regular expressions, or where regex is at the core of the tool's function. In the page on regex uses, we saw other tools that happen to use regular expressions—but those are not the focus of this page. We'll be looking at three excellent tools. A few jump points:

RegexBuddy: the Rolls-Royce of regex tools

RegexBuddy is my absolute favorite regex tool. I can't do serious regular expressions work without it. This page used to have a short intro to the tool. Since then, I wrote the huge RegexBuddy tutorial I always dreamed of (the most comprehensive RegexBuddy tut I'm aware of), which now lives on my dedicated RegexBuddy tutorial page. Don't have time to read it? Download a free RegexBuddy trial.

Regex101

regex 101 explain Although RegexBuddy is my regex building tool of choice, some people do not want to invest in a standalone tool, do not run Windows, or prefer the community features of online tools. As of April 2014, my favorite online regex tool by far is regex101. The coding world is awash with online regex testers. New tools come out all the time, and old favorites often fail to keep up with the release of their supported engines' new versions. Recently, I have been impressed with regex101. Don't get me wrong. The features cannot possibly rival with RegexBuddy's depth. But regex101 is clean, it's easy to use, it has community features that RegexBuddy lacks, and, for many people, it may be the only regex tester they need. Here are the main features: regex 101 menu Support for multiple flavors. Regex101 supports PCRE, Javascript and Python. Regex tester with syntax highlighting. The match panel does a great job of colorizing the regex syntax. It shows you capture groups, allows substitutions and provides a detailed explanation of each token. Debugger. If your regex matches, the debugger panel shows you the steps taken by the engine. Code Samples. The tool supplies you with samples to use your expression in the flavors supported. Saving a regex to a short link. When you click the "save regex" button, you are supplied with a short link such as http://regex101.com/r/tP9zZ6 (try it!) This makes it easy to share expressions on forums. Chat. There is a chat room via an integrated irc web client (qwebirc). Expressions submitted by users. When you click the "community" button, you see a list of expressions submitted by users, ranked by number of votes.

ABA Search and Replace

In late 2011, I fell in love with a compact tool called ABA Search and Replace. Searching for text (and replacing it) across multiple text files is not a new idea: grep was born in 1973. The best grep version for Windows (as far as I know) is actually available on this site, on the next page. What's new is that Peter Kankowski, ABA's talented programmer, has nailed the interface. This program is a joy to work with. In fact, it is so good that I find myself inventing tasks just so I have the chance to use it. First, let's briefly review the alternatives. Directory Opus, the stellar file manager for those who use a computer more than once a week, lets you build amazingly intricate searches (with or without regex) that can look at a file's contents, size, metadata and other attributes—but the interface does not respond like ABA's. Jan Goyvaerts has several tools with grep-like functionality, among which the expensive PowerGrep, whose interface I have not managed to understand. The grep-style features in Jan's RegexBuddy and EditPad Pro also leave me cold. The PCRE Grep command-line tool on the next page is light-weight and delightful, but it does not replace. Now a quick orientation to the ABA interface. At the top, three tabs: Search, Replace, Undo. (It's very cool that you can undo major replacements across many files. ABA does that by backing up your files. You specify the size of the file cache—only 20MB by default.) ABA Search Replace At the top of the picture, you can see the Search box, where you type or paste your regex. Then the Replace box, where you enter your replacement expression. Next, in the file box, you enter the name of the file you want to search, or a wildcard, such as *.txt. Finally, in the bottom pane, you see the matches and replacements. Now here's what I love about ABA: As you tweak your expression, the matches in the bottom panel change on the fly! So do the replacements. That is truly magical. The check boxes let you deselect instances you don't want to replace. You can copy all the lines that contain matches—or just the matching text. That's amazingly convenient when you are trying to trim down a huge file to a dozen lines of interesting data. the program supports variable-width lookbehinds (often helpful!) And basically, it just works. The other search-and-replace tools have an interface I don't find intuitive. Peter wrote his own regex engine so that ABA would support various encodings aside from ascii and utf-8—for instance, UTF-16 LE. If you want to support good programming and treat yourself to a very cool powertool that is bound to save you hours of work sooner and later, I highly encourage you to spend the thirty bucks for a license, which at the moment includes free lifetime upgrades.

TextDistill

The regex-oriented tool that most impressed me recently is TextDistill. This (currently) free program firmly sits in a fascinating niche that seems largely ignored by software developers: text processing. Sure, you can search-and-replace in your text editor, or perform a number of advanced actions in a professional publishing package such as Indesign. But there is so much more you can do to a body of text. TextDistill fills the need to apply a stack of regex-directed actions to an input text. Pushing your original raw text (and its consecutive transformations) through that stack, you may end up at the other end with an unrecognizable polished jewel. This metamorphosis is the fruit of distilling the text through the various filters (called recipes), hence the product's name: TextDistill. The main window lets you create tabs, one for each text-processing project. When you first install TextDistill, several sample tabs are open: by exploring each of them, you can learn a lot about how the program works. TextDistill In each tab, there are four panes: the input text; the stack of recipes to apply to the text; details for the currently selected recipe; and the output text. For your stack of transformations, you can pick and choose from a number of pre-programmed recipes. Sure, you can apply a simple regex find-and-replace, and this will probably be your staple transformation. But there are a host of other transformations to pick from. For instance: Remove lines containing any of… Skip a number of lines Select first unique line matching… and many others. The only other product I'm aware of in this space is Text Pipe, which is offered at the kind of price where you start to think "Thanks but I'll just write a quick script"—in one word, obscene. TextDistill uses .NET regular expressions, one of the most powerful regex engines available (together with PCRE, Perl and Mathew Barnett's regex module for Python). If you ever need to transform text in non-obvious ways, I highly recommend it.

Other Regex Tools

Here are some regex creation tools I've tried or heard about. Software by Jan Goyvaerts: official website, free trials As you already know, I'm a big fan of the software by Jan Goyvaerts, the author of RegexBuddy. His product revolve around his outstanding regex engine. Here's a list of his products and free trials. RegexBuddy: the most powerful regex tool on Earth. A free RegexBuddy trial is available. EditPad Pro: the most regex-aware text editor. A free EditPad Pro trial is available. PowerGREP: perhaps the most powerful text search and manipulation tool. A free PowerGREP trial is available. RegexMagic: build regex without knowing regex. A free RegexMagic trial is available. AceText: your text needs from a central location. A free AceText trial is available. HelpScribble: create documentation files and website. A free HelpScribble trial is available. DeployMaster: no-nonsense installation builder. A free DeployMaster trial is available. Online Tools I'm quite fond of RegexPlanet because it supports an astounding array of flavors: Go, Haskell, Java, Javascript, .NET, Perl, PHP, Python, Ruby, tcl, XRegExp. Debuggex is an interesting tool because it generates visualizations of your expressions. You can then share the links, as you'll see if you click the image. debuggex loves RexEgg I'm not what one might call a "visual person", but it seems to me that the Debuggex visualizations could be a powerful way of walking such a person through a regular expression. For JavaScript coders, RxInput is a very interesting component that validates as you type, completing the field when no other matching paths are left. For Ruby, Rubular is also well-liked. I used to like Lars Olav's tester for its three flavors, but it seems to have fallen a bit behind. Stand-Alone Regex Testers For desktop-side testing, I'll be perfectly frank: I don't see any tool can improving on my RegexBuddy experience, so I haven't tested the tools below properly. But not everyone wants to pay for a regex tool. There are several free alternatives. I haven't tried any of them because I love RegexBuddy and I'm lazy. Here are links to tools I've heard of: Expresso The Regulator Rad Software Regular Expression Designer another tool you may want to list is ReX-T (https://apps.apple.com/de/app/rex-t/id1120211452). We just published version 4 and think it would complete your list (disclaimer: I'm one of the developers. ) Have you looked at verbal expressions? It's a verbose way to define regex using jquery like chain style. I think it's quite elegant and goes a long way to make regex human understandable. You should check it out and see if you can integrate it with your site here. Of course there are a number of ways people abstract regular expressions, from popular parsing libraries that use regex under the covers to RegexMagic, a product that aims to let you express the search with pull-down menus. As an author of regex related material, I thought you might be interested in a new tool I have created. Textpression is a Windows application that allows users to avoid regex syntax and work visually with regex. Appreciate you probably get thousands of these requests, but any feedback would be welcome; especially via the Facebook page. Please also feel free to discuss Textpression with anyone and everyone you think might be interested. Many thanks for your time. David Howes (Textpression Software's Creator) Thanks for writing. I am flat out in the middle of a coding project at the moment, but I'd love to have a look when the skies part. In the meantime, I've added your message to the Regex Tools page. Subject: Another free regex tool option This tool is similar to regexbuddy and built on the (fairly robust). NET regex libraries: http://www. Ultrapico. Com/Expresso. Htm Worth looking at, especially while learning

The Huge RegexBuddy Tutorial

TL;DR Simply put, RegexBuddy (RB for short) may be the only program you'll ever need to create and test your regular expressions, whichever programming language and regex engine you use. To get started, get the free RegexBuddy trial. With that said, if you participate on forums and StackOverflow, you may like to share links to patterns you've tested, and if so you may also want to use tools mentioned on my main regex tools page, as RegexBuddy does not have a regex sharing service at this time. When he finished resting, God created RegexBuddy RegexBuddy on other platforms RegexBuddy is a Windows program. There's nothing like it on 'nix or OSX, and there probably never will be—for the simple reason that being intimate with that many regular expressions flavors is a life's work. If you're on linux or OSX and want to do some serious regex work, your best option is to run RegexBuddy with virtualization software such as VMWare, Parallels or Wine. For Wine, read this first. About this page I've been one of RegexBuddy's many rabid fans for a while now. This tutorial aims to give a comprehensive introduction to RB's features. I hope to share that constant feeling of awesome one has when working in RB. I set out with the goal of writing a thorough review of a product I know intimately. A week later, I ended up with something that feels more like a second manual. At the top of most topics, I've added a link that looks like this: . This is to enable you to point others to that particular topic if you need to. Jumping Points For easy navigation, here are some jumping points to various sections of the page:

Can you afford RegexBuddy?

Let's get the money question out of the way. Can you afford RegexBuddy? That depends. If you're a teenager trying to scrape the web pages of your high-school's year book, maybe not. If you're a programmer who gets paid to write code, there's a good chance. RB costs about forty bucks. But you don't have to decide now: you can get started with the free RegexBuddy trial. In my view RegexBuddy is one of those programs that are worth their weight in golden bytes. Nearly all of the examples on this website were tested in RB. That being said, RB is not perfect—what is?—and a number of committed users pester Jan with minor feature requests, to which he usually responds.

Who is behind RegexBuddy?

Jan Goyvaerts RegexBuddy is the brainchild of a cool dude called Jan Goyvaerts. You can see him here proudly wearing that seems to be a fake RexEgg shirt while (I'm guessing) holding a RexEgg mug in his left hand and a RexEgg pen in the right (off-camera). To implement RegexBuddy's cross-language, cross-engine features, Jan has to stay abreast of several dozen regular expressions flavors. This makes him one of the world's foremost experts on regular expressions. Among other things, Jan is… the co-author of the Regular Expressions Cookbook, 2nd Ed, which is one of the only two regex books worth reading. the author of the regular-expressions.info website, which is the top-rated regular expressions website in Google, a rank to which RexEgg is unlikely to catch up before Jan retires. the author of the proprietary JGsoft regex engine, one of the most powerful regex engines around, which powers RegexBuddy and Jan's other software products. the author of a brilliant regex-friendly text editor called EditPad Pro (a free EditPad Pro trial is available) and of a powerful text-processing tool called PowerGREP (you can download a free PowerGREP trial). Both of these use the afore-mentioned JGsoft regex engine. the owner of two of the legendary (and now rare) DataHand keyboards. the co-star (next to Johnny Depp) of the long-awaited Lord of the Regex. One of the items on this list is fictional. Can you tell which? To earn a free RegexBuddy trial, fedex your answer to Bill Gates.

What is RegexBuddy?

RegexBuddy is a tool that helps you at all stages of a regular expression's life. It can help you: Create and test regular expressions for the specific regex engine of your target platform—Python, Java, .NET, JavaScript, C, PHP, Perl and countless others. Perform real actions on real text and even web pages: extract all matches in a document; decorate extracted information to build a useful list; perform complex replacements; split a file into several components. Either build patterns with expert assistance as you tell the program the kind of tokens you'd like to insert, or build patterns by typing them in yourself. Either way, RegexBuddy gives you instant feedback on the matches in your text. Switch back and forth between potential regex variations using a pattern "scratch pad". Manage your favorite expressions by saving them to a library. Export such explanations in various formats such as text and html. Match or replace text in files thanks to the integrated GREP tool. And more (really!)

RegexBuddy speaks your language

Partial list of supported languages in RegexBuddy When you create or test a regular expression in RegexBuddy, you select its target environment. Since different languages use different regex engines which differ in features and behavior, you'll want to specify whether you're working in C#, Java, Python, Ruby etc. But it gets more specific. Regex engines are constantly updated, so you'll also want to tell RegexBuddy which version of the engine or language you're working with—for instance Java 13 or PCRE 25.6. (Yeah I know, these might not be out yet.) RegexBuddy knows all these details! Through these combinations of engines, languages and versions, RegexBuddy supports over two hundred target environments. You can even add flavors whose features you define yourself. You don't have to look at this Smrgsbord of languages in the drop-down list: at the top of the list, after you select More applications and languages (highlighted in yellow on the image), the full list appears with check boxes allowing you to populate the everyday drop-down you'll be using. Translating between languages Sometimes, you have a PCRE regex and need to make it work in JavaScript. Will it work? JavaScript's regex flavor is so crippled that there's a good chance a token won't work and you will have to look for a different idiom. RegexBuddy lets you compare what tokens mean in different regex flavors, as well as translate from one flavor to another. We'll look at these features later.

Quick Start: A few tweaks so you can follow along

To call this section a quick start might be false advertising as this tutorial is taking you on a journey into the depths of RegexBuddy. By the end, you'll be well on your way to becoming a RegexBuddy ninja. But if you're planning to follow along, I thought I'd mention some quick adjustments so that your screen looks much like mine. Full-screen mode Go ahead, maximize the RegexBuddy window if you haven't already done so. When working with regular expressions, there's a lot you need to keep track of. I can't work in RB unless it's in full-screen mode. Default layout RegexBuddy layout menu On most of my screenshots, all eight of the main tabs are in a single row. If you're in a different layout, you may want to switch back to the default. Click the View menu (circled on the image) and save your layout if you like it (Custom layouts / Save layout), then pick Restore default layout. Unable to restore the default layout? See this tip. Still in the View menu, make sure that Lock Toolbars is checked. This will prevent you from repositioning interface elements inadvertently. Show controls as toggles RegexBuddy preferences menu Some toolbar controls can either be shown as drop-down menus or as toggle buttons. To see the same as on my screenshots, click the Prefs menu (circled on the right), click the Operation tab and select the third option (Show options that can be changed with toggle buttons that indicate the "on" state). Pimping up the interface There are more tweaks we can do to the interface (colors schemes! fonts!), and we'll get into those later, but for now this will do. Let's dive into the interface.

RegexBuddy Interface 101: the Main Tabs

The RegexBuddy interface looks innocent and simple, but it hides a lot of power. Let's dive in! The core of the interface—the section you need to understand first—are the eight tabs shown on the picture below. You can reposition the tabs so your list might be in a different order. regexbuddy tabs These eight tabs determine what we are doing: 1. explaining a regular expression (Create), 2. translating between flavors (Convert), 3. building a pattern and running it against sample text or files (Test), 4. tracing the engine's matching path (Debug), 5. integrating the pattern in your target language (Use), 6. storing and retrieving patterns from custom libraries (Library), 7. using the pattern to search your files (GREP), 8. interacting with Jan and RegexBuddy users (Forum). In later sections, we'll look at each of these tabs.

RegexBuddy Interface 102: the Three Modes

When you work on a regular expression in RB, you select one of three modes, corresponding to the three main tasks we set out to perform with regular expressions: 1. Match 2. Replace 3. Split This selection takes place in the very top toolbar. RegexBuddy's three modes: match, replace and split In Match mode, RegexBuddy highlights the matches in the text and builds additional views showing extracted components. In addition to matching, Replace mode lets you specify a replacement pattern, while Split lets you divide the sample text using the pattern as a delimiter. Depending on which mode you select, the section below the toolbar presents a slightly different face. For instance, in Replace mode, there is a text box for the replacement pattern.

RegexBuddy Interface 103: the Pattern Boxes

Below the buttons for the three operating modes, you'll find the box where you develop your regular expression patterns. RegexBuddy box for regular expression There are two main ways to work in this window: 1. Freehand regex writing. You just type your expressions in the window. 2. Assisted token insertion. You insert tokens by picking them from a list. You can use both methods together. For instance, while I work in freehand mode 99% of the time, when I've forgotten the syntax for, say, a named capture group in C#, I just get RegexBuddy to insert it. Insert token in RegexBuddy Assisted Token Insertion: two ways There are two ways to have RB insert tokens for you. - You can right-click in the text box for the pattern. - You can go to the Create tab and use its "Insert Token" pull-down menu. Unicode is one place where I find token insertion particularly useful. The insert pull-down allows you to insert: - specific characters (such as \x{20AA} for ) which you select on a character map - tokens to match specific scripts, such as \p{Hiragana} - tokens to match specific categories, such as \p{Sm} to match a math symbol.
Syntax Highlighting The various tokens are colored according to their semantic value. Syntax highlighting in RegexBuddy For instance, on the picture, you can see that character classes have a lovely shade of ocher. In addition, when you move around the regex with the cursor, the closest nesting construct (such as parentheses or square brackets) is highlighted in aqua. See the aqua parentheses at the end of the line? The aqua makes it easier to know where you are and to identify closing parentheses. RegexBuddy Prefs iconYou can customize colors in the Preferences menu, which is accessed by clicking the icon shown on the right (it lives at the top right of the interface). I keep the colors the way they are, but a decent easy tweak would be to use the provided "white on black" theme. RegexBuddy patter white on black Love that color? Use it everywhere. Anytime you see colored syntax in RegexBuddy, if you paste it in programs that understand rich text formatting, the coloring comes with it. The screenshot shows an RB regex in MS Word. Formatted RegexBuddy pattern in MS Word Jan tells me that Word messes up the colors by reducing them to a 16-color VGA palette while simple old WordPad does not mess them up. Replace Text Box In Replace mode, the regex pattern's text box shrinks to make space for a second text box. RegexBuddy replacement pattern text box That is where you specify the replacement pattern. If you've forgotten the back-reference syntax of your target engine, right-click inside the text box: RegexBuddy offers to insert the token for you.

RegexBuddy Interface 104: the History Pane

RegexBuddy History Pane At the top right or RegexBuddy, you'll find a little box that I use all the time: the History pane. Jan intended for the History pane to be a scratch pad where you temporarily stash various versions of expressions you're working on. For that purpose, it works brilliantly. You can add, remove and rename patterns, and you can reorder the list. What Jan didn't foresee is that some people would use the History pane to stash expressions on a semi-permanent basis. For that purpose, he had created the Library tab. But using the Library usually means you have to shift to another tab, and people are lazy. (You can actually create a layout where the library replaces the history pane.) I spend most of my RB hours in the Test tab, and the History pane makes it easy to manage multiple expressions I'm working on. When I build something that I feel I might reuse, I move it to the library. One way I like using the History pane is to name a number of presets after different regex engines. That way when I need to build something for C#, I just click the C# item and I'm ready to go. As a result of the History pane's mission as a scratch pad, it lacks features that you'll soon miss if you use it a lot, such as saving different subjects for different patterns. Backups also involve some leg work, such as moving items to a library or messing with the INI file. Warning: don't lose all your history 63One thing that's happened to me too many times is that to delete a pattern, I've mistakenly clicked the "Clear All" button (circled on the right) instead of the Delete button. As of RB 4.6.1, there's no confirmation dialog. In my view that's a big problem that needs to be fixed as hours of work can too easily be lost. Don't be that guy! In the meantime, I've provided a solution that you'll find later on the page: a AutoHotkey script that, in addition to providing some cool keyboard shortcuts, makes it impossible to click on the buttons of the History pane. In my view it's well worth spending the two minutes to install AutoHotkey just for this. My two cents: the History pane should just be a special Library Given the way many people are using the History pane, it seems to me that the history pane should just be a special library (history.rbl)—with all the benefits of a library, such as the ability to store dedicated subjects for each pattern. In fact Jan has announced that this is his exact plan. Often, you want to use the same text for multiple patterns. Cloning text and patterns across multiple entries of a special ScratchPad.rbl library could be gracefully handled by an item on the revised History pane's menu. A number of controls and menus complete the interface, but it's time to dive into the tabs. We'll visit the rest of the interface later. Alright, let's dive into the tabs.

Create tab: Explain regular expressions

The title of this tab—Create—is a bit of a misnomer, perhaps a leftover from earlier versions. I've used this tab a lot, but not once have I used it to create a regex pattern! Sure, you can use the Insert Token pull-down menu to have RB's gentle hand pick the right syntax for a Unicode titlecase letter. That's wonderful, but you can also do that directly by right-clicking in the regex text box, without ever leaving the comfort of the Test tab. So what's the point of the Create tab? My view is is that in the next major RB version it should be renamed the Explain tab. I'll warn you now that I'll be thick-headed about this for the rest of the tut. Besides, who doesn't like the word Explain? It reminds me of happy times spent in a SQL console—SQL, another declarative language. Besides inserting tokens, here is what the Explain tab (mmm, sorry, the Create tab) does really well. Explain a Regular Expression and a Replacement Pattern RegexBuddy explains a regular expression Most of the real estate in this tab is dedicated to explaining the regex, as well as the replacement pattern if there is one. The explanation lines "talk" to the regex and the replacement text boxes. When you click somewhere in the regex or replacement pattern, the corresponding line is highlighted in the explanation. When you click on a line of the explanation, the cursor moves in the regex or replacement box. This makes it really easy to pinpoint whatever section you need to clarify. Even if you think you know regex syntax backwards and forwards, this can be surprisingly helpful, as RegexBuddy provides way more detail than most of us keep in our heads, especially if you select the detailed mode (circled on the pull-down). Feel like taking the challenge? Select the PCRE regex flavor from the pull-down at the top left, then paste (?m)^ in the regex window. In the mode selectors at the top, make sure "LF Only" is not selected. In the Explain tab (cough… Create) select Detailed mode, click on the ^ in the regex window, look at the explanation. You knew all that, so what's the big deal? Now switch the regex flavor to Java. Holy smokes! The explanation is nearly twice as long. Sharing Regex Explanations Share regex explanations from RegexBuddy What I've used the Explain tab (yeah, I know) the most is to share regex explanations. For this, there are two magic buttons: Export and Copy. For some reason, Copy doesn't yet have a label, but I'm sure Jan will fix that down the line. The two buttons offer the same three ways of formatting the explanation (plus two variations). The difference is that Export outputs the explanation to a file, while Copy outputs it to your clipboard. Here are the three core formats: RegexBuddy outputs html explanation 1. Plain text. As the name suggests, with pleasant indentation. 2. Markdown. This is perfect if you're answering questions on StackOverflow and don't feel like crafting your own explanation. (Not sure if that's the right strategy but it will give you lots of ink very fast.) 3. Html. This creates code for a beautiful interactive webpage. Thanks to some JavaScript, the pattern and the explanation talk to each other: depending on where you hover, both the token and the explanation line are highlighted. Neat! Compare the meaning of a pattern in various regex flavors RegexBuddy compare meaning of pattern in various regex engines This is another feature of the Explain tab (no apologies this time) that I've found immensely helpful: it enables you to compare what your regex would mean (if anything) in a different engine. In the pull-down menu, you select one target regex flavor—or, by selecting the More applications item at the top, you check boxes in order to run a multi-language comparison. The explanations are often illuminating: you realize that syntax that seems identical on the surface handles a number of details differently. And where is the devil? In the details. No wonder some people say the devil invented regex. For me, the compare function might at its most useful when comparing the meaning of single tokens—instead of full patterns, for which explanations explode into more lines than I can handle. For instance, if you'd like to take the Compare function for a spin, have a go at running it for a regex containing this single token: ^. Hint: include Ruby in the list of engines you select. One last thought about renaming the Create tab to Explain One added benefit is that each of the eight tabs would have a unique initial: Test, Use, Library, Explain, Convert, Debug, Grep, Forum. That's a dream situation for keyboard shortcuts (at the moment the C in Create clashes with Convert). The custom keyboard shortcuts script I share at the bottom takes advantage of this (in the meantime, E is also in crEate.)

Test tab: Write and apply regular expressions

The Test tab is where I spend most of my time. I love it. On the screenshot, you can see that we work in three zones: the regex, the text, and the results. What you do in the regex box immediately affects what you see in the other two—as long as you have the right options selected (more about this shortly). Likewise, if you change the test, the highlighted matches do, as do the results in the bottom pane. RegexBuddy test tab On the screenshot, the middle pane is the subject text. One of the matches is highlighted. In the bottom pane (the results pane), I have asked RB to display the text captured by group 2. Some options you MUST know about in the Test tab In order to see beautiful results as in the screenshot, some things have to be in order. Sure, the regex has to match, you must select the right mode (Match, Replace, Split) and you must make sure the various matching controls are right. But even when all of this is right, the results may not show what you expect unless you pay attention to the following. RegexBuddy highlight button Highlight. Toggling this button turns the highlighting on and off in the text pane. The pull-down menu offers to highlight one of the capture groups as well. I leave highlighting on most of the time. RegexBuddy List / Replace buttons List All / Replace All. The screenshot lies—you only see the List All part when in Match mode and the Replace part while in Replace mode. And of course they have a cousin: Split mode. The List All and Replace pulldowns are extremely important because they drive what you see in the Results pane at the bottom. In both cases, you usually want to click the pull-down menu and ensure that the Update Automatically option is checked. Within the List All pulldown, you can choose to show the matches, selected groups or two composite views. In the Full details view, pay close attention to the + symbol in the left margin, as you'll need to click it to see match details. If you want to count matches, at the moment, this view is the most convenient way to do so; since the matches are numbered, just navigate to the bottom. In the Replace All pulldown, you can chose either to replace all the matches directly in the text pane, or to build a list of replacements in the Results pane. You can copy text from either pane. RegexBuddy Whole file / Line by line button Whole File / Line by Line. This pull-down determines whether the regex is applied one line at a time, one page at a time, or to the whole file. This makes a big difference! Often, when I scratch my head as to why a pattern is not working, this setting is the culprit. RegexBuddy Line breaks button Line Breaks. Sometimes, line breaks really matter to your regex—for instance when you'd prefer the dot and $ anchor to only consider the \r\n combination as a line break, but neither lone carriage returns nor new lines. The line breaks pull-down is for such time: it can override automatic conversions. For everyday work, select Automatic line breaks. The details are intricate: if you want to know more, read the docs. Other controls in the Test tab I'll briefly mention the other controls in the tab. Open The Open pull-down at the very left lets you load a text file into the test field.
Save The Save pull-down lets you save test text or results.
Paste The Paste pull-down offers an array of ways to paste your clipboard. I've never used it but I'm sure it's valuable to some.
Web button The Open url pull-down is the scraper's friend. It lets you enter a url address, then downloads the page and dumps it into the text pane. The pull-down remembers sites you've visited.
RegexBuddy debug button The Debug button activates the Debug tab. Its usage is explained in the next section.
Arrows The back and forward arrows navigate the highlighted matches.

Debug tab: Trace the engine's matching path

The Debug tab shows you the engine's path as it attempts matches on the subject. This is handy for two kinds of situations: - when you want to show someone how an engine works, - when you're at a loss as to the engine's behavior. When using the Debug tab, I switch to my Debug custom layout (see the section on layouts), which has the Test and Create tabs to one side and the Debug tab to the other. Debugging in RegexBuddy How to launch the debug trace If you switch to the Debug tab, you'll notice that it shows you nothing like what's on the screenshot. Instead, it tells you to first switch to the Test tab, because that's where debug operations are launched. If you want to debug the match attempts starting at a particular position in the subject string, place the cursor at the corresponding position in the text box. If you intend to test everywhere, no need for that step. Next select the Debug pull-down menu that lives within the text tab. There are three options: Debug menu pull-down Debug Here is for when you want the debugger to show exactly one match attempt starting at the position in the text where you have your cursor (or at the start of the highlighted match if the cursor is on a highlighted match), Debug Till End is for when you want the debugger to show all match attempts from your cursor position to the end of the string, Debug Everywhere is for when you want the debugger to show all match attempts from the beginning of the string till the end. Why wouldn't you want to always use Debug Everywhere? Because it can generate a lot of noise in the results of the Debug tab. Usually, when you debug, you know exactly where you expect the match attempt to succeeds, so you can skip the other match attempts. If that's the case, place your cursor at the string position where you expect a match and select Debug Here. In fact, that's too many clicks: without touching the pull-down menu, click the Debug button directly as Debug Here is its default action. Understanding Debug Results Reading the results of the debug output takes a bit of getting used to. If you have multiple match attempts, you first need to expand them and pick one to examine (one of the reasons why Debug Here is often a better option). The key to using the Debug pane is to know that it "talks" to three other components. When you click somewhere in the debug window, the relevant sections of these elements (if currently displayed) are highlighted: - the regex pattern, - the text box in the Test tab, - the explanation in the Create tab. That's why the side-by-side layout is helpful. Debugging in RegexBuddy Static View RegexBuddy's debug trace is a wonderful feature. And yet I don't use it all that much because the presentation is not ideal for how I grasp things. Ideally, I would like to see two columns in the Debug tab: one showing a "string pyramid" (as it currently does), one showing the token being attempted on each line (or the beginning of the sub-expression if it's too long to display). For many debugging tasks, this would make it easier for me to understand what is happening because I would be looking at a static view (an unchanging picture), in contrast to the current view, which is "cinematic": you have to move through the match attempt to see the tokens being attempted at each step. Hopefully, such a view will be added in a future release.

Use tab: Generate code for your programming language

The Use tab helps you integrate your regex into the programming language of your choice by generating code. In the Function pull-down, you select one of several templates, such as Iterate over all matches and capturing groups. RegexBuddy Use tab Depending on the template, RegexBuddy may present text boxes for you to enter custom names for variables used in the generated code. In the above screenshot, notice how I typed news_post and have_title, and how RB used these names in the code. Perfectly formatted regex strings One of the benefits of using RegexBuddy's templates is that RB handles the sometimes gruesome formatting your regex must undergo before it can live in your target language. For instance, some characters may need to be escaped; or, as in the screenshot, some options may need to be set in a matching function. Customizing and adding code templates (just do it!) Every coder has her own coding style. The templates that ship with RB are only meant as a starting point. Make them your own! You can either tweak the existing templates or add new ones. I find this tremendously useful. Here are some examples of templates I write: 1. House style and documentation. You may not write your method calls exactly how Jan has provided them. A custom template lets you feel at home, and also allows you to insert any comments you like. 2. Complete scripts. Sometimes you want to give someone a full working demo of a regex operation. With a template, I can generate a self-standing C# command-line program or Python script. 3. Same engine, new contexts. RegexBuddy fully understands PCRE, but PCRE is integrated in many projects, and naturally RB cannot provide templates for every possible context. With custom templates, I can generate code using the PCRE engine in contexts such as Apache, MariaDB and AutoHotkey. Here's how to create a template. RegexBuddy template editor button Click the Template Editor button. The default template for the chosen language appears. Navigate to the General tab. Add a personal suffix to the language name, turning it into something such as Python 2.7 [Rex]. This will make things easier later when you want the ability to select between your template and the original. Click save. Note that all custom templates go to your RegexBuddy settings folder, i.e. %AppData%\JGsoft\RegexBuddy 4, without disturbing the originals in %programfiles%\Just Great Software\RegexBuddy 4. You cannot save them anywhere else. I usually don't mess much with the rest of the tab and tend to leave the Modes tab alone (I'll explain it later). The Functions tab is where all the action happens. New regex template icon In the Functions tab, click the New icon. On the left, you'll see a new function called <new function>. On the right, the form lets you edit the new function. Start by editing the name. As you do, the name changes on the left. Hit the Save icon at the top. Note that you can always remove a function using the Delete button. In the text box to the right, type or paste your template. Note the preview on the left. Note that the form lets you specify some input parameters. For instance, for the Parameter 1 pull-down, select Subject text. In the text box on the right, give the variable a default name, such as news_post. In the code template, locate the variable you had for the subject, and replace it with %PARAM1%, which you can either type or paste by using the Placeholders menu at the bottom. When you use the template, you will be able to change the name of the subject, and the proper name will appear in the code. That's the gist of editing templates. Have a play with the Placeholders menu. Apart from the numbered parameters, the placeholders insert values you define on the Modes tab. The Regex Tree menu lets you embellish your code with a full explanation of the regex in one of three formats (comment, string, xml). The Conditionals menu lets you tweak the generated code depending on conditions being met. Click Save when done. You can see the results by selecting the function in the Use tab. When you have added several functions, you can organize them using the Move Up and Move Down buttons. Pairing a language with a code template By default, RegexBuddy uses your modified code template if available for a language. If you've deleted all of Jan's original functions and would like to see what they look like, create a new language based on the target language, giving it a clear and distinctive name. In the language creation form, in the Template for source code snippets pull-down at the bottom, you'll be able to select the original template, which you'll recognize because its name won't be decorated like yours (see the Python 2.7 [Rex] example above). Select code template

Convert tab: Translate a pattern to another regex flavor

Suppose you have a file parser written in, say, Python, whose job it is to extract the beginning of each file, up to a section starting with "=== SUMMARY ===". For this job, it would be reasonable to use a simple pattern such as this: (?s)^.*?(?=\n=== SUMMARY ===) One day your boss tells you that all Python scripts have to be moved over to Ruby. As you write the code, you pop the regex straight into the new script. Now you have two problems. (No, not these two problems.) 1. The inline modifier (?s) that you had used to flip on DOTALL mode (dot matches line breaks) does not work in Ruby. 2. The caret anchor ^ that you had used to tell the engine to search at the beginning of the file tells Ruby to search at the beginning of every line. This is what RegexBuddy's Convert tab is for. If there is a straightforward conversion, it does it brilliantly. RegexBuddy convert tab On the conversion pane, there are pull-down menus giving you the options, if applicable, to strip comments and to use free-spacing mode. If RegexBuddy has doubts about the conversion, it will issue several degrees of warnings, such as this one: regex conversion warning RegexBuddy Accept Conversion Once you're happy with the conversion, if you want to keep working with the pattern in the target language, click Accept Conversion. This changes the selected regex flavor, copies the pattern to the regex text box and sets all the required flags. Converting between Exact Spacing and Free-Spacing One great bonus of the Convert tab is that you can use it to strip a free-spacing regex of all comments and whitespace, and a regular regex of its inline comments. To perform this feat, in the Conversion target pulldown, select the same language as the one used by the pattern. The spacing and comment pulldowns do the rest. Conversely, the same method allows you to set up a plain regex for free-spacing, assuming of course that the chosen flavor supports it. This may sound less useful, but it's actually a great time-saver for patterns that contain lots of spaces, as you otherwise need to manually escape them or insert them in character classes such as [ ] (less efficient but more readable). Understanding conversions If you're unsure of some of the syntax in the conversion, you can look at it in the Explain tab (mmm… it's still called Create?) To do so, start by backing up the current regex by adding it to the History pane. Then accept the conversion and switch to the Explain tab. In that same tab, you can also set up a comparison between the two flavors. At times, it may be helpful to view the Compare and Explain tabs together. If that's of interest, you can set up a custom layout such as this one. Conversion layout in RegexBuddy

Library tab: Store and retrieve patterns

In this tab, you create libraries to store patterns you might want to use again. When you store a pattern, you can store the whole thing: the kind of operation (match, replace, split), the language, the regex, the modifiers and the subject text. RegexBuddy library RegexBuddy comes with a built-in library, which contains a vast range of regular expressions dealing with topics ranging from IP addresses and credit cards to Romanian national ID numbers. Even though I've never needed these patterns, I still find the default library to be a great illustration of what a mature library can look like. Create your first custom library now The power of libraries is really in creating your own. You can do anything with regex, and your libraries should reflect that! Just look at one of my libraries in the screenshot above. Regex library open icon To create a custom library, press the New icon at the very left. But that's only the first step.
Regex library save iconImmediately after pressing New, press the Save icon. That's because RegexBuddy doesn't offer to enter a name for the new library. Naming and saving the library ensures that everything you create will get properly saved. I strongly recommend you save your custom libraries in a location that you recognize and frequently back-up, so that it doesn't get left behind by oversight when you migrate to a different machine. Some of you use that company with a box in its name, right? Two things you need to know first RegexBuddy library open button The key to having a good experience with the RegexBuddy library is to be aware of the Open button and the Read-only checkbox (both circled on the screenshot). The Open button lets you switch to a different library. If, like me, you're not very good at understanding icons, you might miss it. The Read-only checkbox will save you a lot of aggravation when RB refuses to add a pattern to the library. Yeah, you probably need to uncheck that box. Using the library: the three main buttons The library is easy to use. You operate it with three buttons. RegexBuddy library: Use icon When you select a library item, the Use icon lets you copy that item to the test pane. You can copy the regex, the text or both. After applying this action, you need to switch to another tab such as Test to get to work. RegexBuddy library: Update icon While using an item from the library, if you modify the regex or the subject, the Update button lets you save either or both to the stored pattern (assuming the Read-only box is unchecked). No need to touch the Save button: the library file is updated on disk. RegexBuddy library: Add icon When you'd like to preserve to preserve a pattern, use the Add button to save it and, optionally, the subject text. RB doesn't ask you for a name, so you need to name your pattern. I don't find this super intuitive—the name lives in the top-right-most text box of the Library tab. Trick: saving boiler-plate text How many times do you craft a pattern then realize you don't have suitable boiler-plate to test it on? One idea is to create some library entries that only contain text—either creating a dedicated Boiler-Plate library or saving them to your custom library with a prefix such as [Boiler-Plate]. I like the first solution best. You can use the regex field (and the replacement field!) to jot down notes about what the text represents, its source, and so on. An alternative solution is to use the Open button in the Test pane and let RegexBuddy manage a list of files to use as input. Saving Changes to the Library When you've changed a regex and want to ensure the library reflects the change, use the Update button. You don't need to worry about saving the library itself. When you click the Add or Update buttons, the library is updated on disk. The floppy icon is a bit confusing. It's not to save pending changes, but to make a copy of the current library, with or without the expressions'subject text. Organizing your Libraries Don't try dragging items up and down the list: that won't work. You'll have to rely on alphabetization—for instance by adding numbered prefixes as on my screenshot at the top of the section. Library Layout There's not a whole lot more to the library. It's a great feature and it works as advertised. But here's one final trick you might like: a library layout. It shoves the History pane out of the way, replacing it with a dedicated Scratch Pad library. That way you can save a subject with each pattern and no longer need to fear you might click the Clear All button by mistake. Library Layout in RegexBuddy To create this layout, start by clicking the red button at the top right of the History pane in order to close it. Now click the Library tab and drag it to the right of the screen, until you see a ghost outline roughly corresponding to the section of the screen where the History pane used to be. Release the mouse button. If everything worked the way you wanted, save the layout under the View button's Custom Layouts menu.

GREP tab: Use a pattern to search your files

This tab offers a powerful tool to search (and replace!) inside files using regular expressions. Jan is also the author of PowerGREP, possibly the most powerful text-processing tool on the market (click here for a free PowerGREP trial). If your job involves manipulating large amounts of text, then PowerGREP is probably a great tool for you—even more so if you craft data extraction tools for others, as you can produce text pipelines for your clients to plug into their own copies of PowerGREP. For everyone else, RegexBuddy's GREP tab offers a great balance of features. Not only can you search and replace, you can also save and reload an action i.e., a search or replacement job. RegexBuddy Grep Basic Workflow First, we need to decide what we're looking for in the target files. 0. Enter a regex in the usual regex text box. If you want to perform a replace operation, enter that pattern too. Alternately, you can specify what not to match. To do so, check the Invert results box on the right. Second, we need to tell the tool which folders and files we're interested in. 1. In the folders box, either click the browse button to the right or paste a folder path. 2. At the very right, check the Recurse subfolders box if you'd like to look in subfolders as well. 3. If you want to restrict the search to certain files, enter a DOS-style wildcard pattern such as f*.html in the file mask field (see the circled text on the picture above). Alternately, enter a pattern for the files you want to exclude, and check the invert masks box on the right. Yeah, I know… A DOS-style wildcard in a regex tool? Really? Why not a regex? To be fair, there's actually an expanded syntax for those wildcards—you can look it up in the docs if you like. For my part, I don't feel like learning another matching syntax, so I'll hold out for regex file-matching in RB5. Quick tip: the pull-downs to the right of the file and folder text fields contain paths and masks you've used before. 4. If this is a Replace operation, decide whether you'd like to change the original files or work on a copy. These options are in the Target pull-down. If you'd like to replace in copies, you'll have to specify a path to the right of the pull-down. 5. For Replace operations, decide whether to backup the originals before proceeding. That's highly recommended because you'll then be able to perform an Undo on your last replace operation. There are multiple options in the Backup pull-down: see the docsfor details. Next, we're ready to search (and perhaps replace). Grep button 6. If you're just searching, go ahead and press the Grep button on the left. Results appear in pleasant highlighting. You can double-click files or line numbers to open them in the Test pane.
Edit Grep matches Or you can click inside the match results then use the Edit pull-down menu to edit the match results in a text editor.
6 bis. For replacement operations, click on the button's pull-down menu. Generally you'll want Preview. The Execute option lets you review the changes. Quick Execute is acceptable if you've selected to backup the originals. Before launching a massive replacement operation, I strongly recommend you try a Replace and Undo on a sample file in order to make sure that it performs as you expect and as advertised. Bugs happen. RegexBuddy Grep Undo 7. If you've performed a replacement, now it may be time to either Undo or delete the backup files. These actions live in the Grep button's pull-down menu.
Save and Open Grep Operations Save Grep operation One neat feature of the GREP tab is that it lets you save and reload what it calls actions, i.e. search and replace operations. The menus are self-explanatory. Miscellaneous refinements On the right, Include Binary Files does what it says. Likewise, the Line-based checkbox processes the input line by line. The Clear button (blank sheet) quickly resets the form. The Export button saves the results to a text file (which you can also achieve via copy-paste).

Forum tab: Interact with other RB users

RegexBuddy forum login The RegexBuddy forum is a brilliant touch. It's integrated within the program itself, and it feels like a software product of its own. In fact, it's received a lot of attention as it's a component of Jan's other products. Being used to web forums nowadays, it took a few tries to get comfortable with the bulletin-board-style software, but I've really come to appreciate it. For one thing, since you need the product in order to post, there is no spam whatsoever. And since the forum is behind a "private screen", you can run into some pretty cool people. Login Options The login button is at the top left. It pays to look carefully at the options, which stay selected from one session to the next. If you use EditPad (free EditPad Pro trial here) or other JGsoft products, you may want to select "Show all groups", as that will allow you to navigate to these fora. If you want to be alerted when someone replies to your beautiful posts, select "Email replies to conversations you participate in".
The forum can take your screen's full width, so it can be very readable. RegexBuddy forum Still, I often prefer to access the RB forum from within EditPad Pro, where I can launch it with a hotkey and where I have better luck getting it to run in its own window. In addition, EPP also seems to give more control over the font size. RegexBuddy forum in EditPad Pro Orientation I'll let you play with the features, but here's some quick orientation as the forum has a unique interface (unlike familiar web-based fora). If you selected "Show all groups" at login, you can switch between fora at the top right. The two panes on the left let you navigate conversations. The one at the top shows the overall thread title, while the one at the bottom shows posts within a given thread. RegexBuddy forum refresh Make sure you locate the Refresh button. It's your friend. While the New button creates a new thread (shown in the top-left pane), the Reply button adds a post to an existing thread (shown in the bottom-left pane). RegexBuddy forum Send button It's not enough to compose a post. You need to click the Send button. Until you do, your post actually stays as a draft in the forum pane, even if you close and re-open RB. It may look like its been sent, but it hasn't. Watch out for the two Delete buttons. The first deletes an entire thread you've created, while the second deletes a single post. The Edit button lets you edit your posts. Use responsibly, as threads become illegible when someone completely alters the original. RegexBuddy forum attach The Attach pull-down is terrific. Use it! Not only can you attach random files, you can also attach the RegexBuddy's current pattern and subject text without having to paste them. Doing this ensures others get an exact copy of the regex and all its settings or of the test subject. It also keeps the message body short and readable. You can also attach a library or a source code template. RegexBuddy forum Use button Conversely, the Use button lets you use a pattern or subject text that someone attached, copying it into the proper RegexBuddy fields. RegexBuddy forum Feeds button In the Feeds menu, don't miss the Open buttons. I won't explain what they do but they're very cool. Their output depends on the options you select at the top. Don't miss the 'Search here' box. There's a good chance you can learn a lot about your question before you even ask. Tip: Bookmarking and Organizing Forum Threads If you don't use a news reader, you can still bookmark and organize forum threads, either in a text file or in an html page. 1. In the top or bottom left pane, click a thread or individual message you would like to bookmark. 2. Press Ctrl + C to copy. 3. Press Ctrl + V to paste in a text file for your bookmarks. This will create a regexbuddy: link. In EditPad Pro, the links will be clickable. The first time you click such a link, you may need to tell Windows to let RegexBuddy handle it: the post should open in the RegexBuddy forum. 4. For an html file of bookmarks, wrap the regexbuddy: links in html, as in: <a href="regexbuddy:forum/view/1000/1007880">Green title bar</a> Enjoy!

RegexBuddy Interface 201: Miscellaneous controls

RegexBuddy controls We never really wrapped up the tour of the interface, remember? Let's do so now by looking at the controls and menus at the very top. Helpful and Strict Modes RegexBuddy helpful mode It's good to know about the pull-down menu that lets you choose between "Helpful" and "Strict" modes. The strict mode simply behaves like the chosen engine is supposed to behave, without alerting you to potential problems. The helpful mode alerts you to points of syntax that may be specific to that engine and that you may not have considered. For instance, in most engines, \s matches a whitespace character. But if you use this token in MySQL, it simply matches a literal "s". In strict mode, if you use \s (presumably with the intention of matching whitespace) RegexBuddy does not flag the token as problematic because it is valid MySQL regex. In helpful mode, RegexBuddy highlights the token in red, and if you switch to the Explain tab to investigate the error, it tells you that "MySQL does not support any shorthand character classes". This is a valuable feature and speaks for leaving helpful mode turned on. Copy and Paste RegexBuddy copy and paste formatted regex I don't use the Copy and Paste menus much, but I can see how someone else with a different workflow might use them all the time. Copy to Code You know how a simple regex such as "https?://[Rr]ex[Ee]gg\.com\b" can give you a headache when you try to use it in your code? You'll have to do things like… /"https?:\/\/[Rr]ex[Ee]gg\.com\b"/ in JavaScript, @"""https?://[Rr]ex[Ee]gg\.com\b""" in C#, and so on. The Copy button takes that pain away: you tell RegexBuddy what your target language is, and it does the rest. It fills your clipboard with an ugly string that you didn't have to write, and you can go paste it verbatim in your environment. Paste from Code Likewise, you can lift a regular expression straight from a piece of code, and navigate the Paste From menu to let RB handle the burden of peeling away the ugliness to reveal the beautiful regex core, which it pastes to the pattern text box. Undo and Redo RegexBuddy Undo and Redo buttons Nothing to report here—these work as advertised.
View Menu RegexBuddy layout menu Within the View menu, there are several items of interest. The minor ones first: - Large Toolbar Icons makes the menu more usable, - Lock Toolbars prevents you from dragging interface items inadvertently. Custom Layouts The most important part of the View menu is the Layouts section at the bottom. If you ever mess up your layout, you can quickly restore the default. You can also experiment with the side-by-side layout. Best, you can save and restore some Custom Layouts. RegexBuddy debug layout To create a custom layout, start by resizing the text panes to your liking by hovering over the pane separator until the mouse cursor changes to a "drag" icon. If you also want to move menus around, make sure the Lock Toolbars toggle in the View menu is unchecked: handles appear next to some menu items, allowing you to drag them into new positions. Tabs can also moved. It takes a moment getting used to, but if you drag a tab to the right and a little bit below, a shadow appears where the tab will be placed. Only drag to the right—shuffling tabs until you get the desired order—I've had no success dragging in the other direction. I've found it useful to save several layouts: 1. One-Line Layout. This is for patterns that hold on one line. The regex text box is minimized, giving maximum space to the Test and Results panes. 2. Multi-Line Layout. This is for when I work in free-spacing mode, when patterns can span many lines. This gives more space to the regex box, taking away from the Test and Results fields. 3. Debug Layout. This is a side-by-side layout, helpful when using the debug tab. The Test and Explain tabs are on the left, the Debug tab is on the right. See the screenshot in the debug section. 4. Conversion Layout. This layout shows the Convert and Explain tabs together. See the screenshot in the section about conversions. 5. Library Layout. In this layout, the Library tab replaces the History pane, allowing you to use a personal library as a scratch pad. You can see a screenshot in the section about libraries. Unable to restore the default layout? See this tip. Preferences Menu RegexBuddy preferences menu The Preferences dialog has seven tabs. I won't walk you through everything (see the docs) but I'll highlight items to which I pay attention. 1. Editors tab. - Regular Expression: check all boxes - Test: Visualize line breaks, word wrap, Ctrl + Wheel changes font size (awesome!) - Use: Ctrl + Wheel changes font size - GREP: all unchecked - Configure Text Tools / Main font: Consolas, 16 2. Operation tab. - Show options that can be changed with toggle buttons that indicate the "on" state - Preserve state: check all - Statusbar: show 3. GREP tab: no tweaks. 4. Regex Colors tab: default. 5. Text Colors tab: at the moment, for contrast with the regex box, I'm enjoying the white on black. 6. Use Colors tab: the Borland classic scheme brings me back to Turbo C and Clipper/dBase. 7. GREP Colors tab: experimenting with white on black Help Menu RegexBuddy help menu Surprisingly, the Help menu is packed with goodies. Do yourself a favor and take a leisurely look!
xkcd 208 regex troubleshooting The star of the Help menu is Create Portable Installation. Insert a flash drive, install, and you're ready for some serious off-site troubleshooting. The first four items of the Help menu open strategically chosen sections of the help file—a chm document that is a true treasure. It contains far more than RegexBuddy-related help: most of the content from Jan's regular-expressions.info website is there, making it pleasant to read this outstanding reference about regular expressions offline. See this tip to read the RegexBuddy help on your Kindle or other e-reader. Three other items are worthy of note: - Since RB doesn't yet have an option to check for updates automatically, you may find it helpful to peruse the Check for New Version item every once in a while. - The About RegexBuddy item shows you which version you're running. - The Support and Feedback item generates a chunk of text that you can paste in an email or forum message if you need help. Regex modifiers and settings RegexBuddy pattern modifiers Just above the text box for the regex pattern, there is a line of controls that dictate how the match is made. These controls can either be shown as toggles (as on the image) or as drop-downs. If you have drop-downs and want toggles, see the quick-start section. Even if you have toggles, you may not see the same control as on my screenshot. These controls depend on the context—chiefly which regex engine and which mode (match, replace, split) are selected. By and large (but not exclusively), the controls correspond to the classic regex pattern modifiers. On the picture, the case-insensitive modifier is selected. This would be equivalent to setting it as a parameter in a match function, rather than inline. In fact, when you go to the Use tab to integrate the regex into your language, RegexBuddy places the modifiers in the match, replace or split function. Likewise when you copy a language-ready regex via the Copy menu. On the picture, note that the DOTALL mode (dot matches line breaks) is set inline by the (?s) token, but that the dot matches line breaks control is off. That's because the controls work independently of what you write in the pattern—in the same way as you could turn on a mode in a function while you turn it off inline, such as in: if subject =~ /(?-i)https?/i There are simple rules to deal with such situations (namely the case-insensitivity passed as a parameter to the match function is overridden once the pattern turns it off inline), and you can be sure that RegexBuddy knows them.

Adding a Regex Flavor

Sometimes, you maybe working in a variation of an existing regex flavor that doesn't happen to match anything offered in the Languages list. At such times, you may find it handy to create a custom flavor. You won't be able to control every aspect of the engine's behavior, but you'll have a starting point. RegexBuddy new language editor On the above screenshot, I have created a new flavor, which now appears in the list of languages to the left. To the right, you can see the form that you fill in to set up a language. You select a base flavor (in this case POSIX BRE), and you select a number of parameters. I haven't had a need to do this myself, but I've used a template that someone else made and shared on the RegexBuddy forum—you can be that guy! It was particularly helpful because it came with a code template. I can't recall the details now as it was a few years back but I believe it was for some flavor of an Apple-related language.

I know this matches! Is RegexBuddy broken?

When you're a hurry, it sometimes happens that a regex that you know matches gets you a stern
The regular expression does not match the test subject.
You place the pattern under the microscope, you try it in actual code, and there's nothing wrong with it. In these situations, I've learned that a handful of settings are often the culprit. The highest offender by far is the Line by Line vs Whole file setting. I use both and often forget to check it when working fast. Another big one is free-spacing mode that I've left turned on when I'm typing classic-style. Oh, and case-sensitivity is known to interfere. Did I mention line breaks? Did DOTALL or, more insidiously, CR Only get turned on? It sounds silly but sometimes I've just forgot to turn Highlight on—or I've turned off the Update automatically setting inside List all. I've never flipped on the Lazy quantifiers toggle by mistake. But you might! Come to think of it, I've never turned it on at all. Who knows what happens when you do? Music starts playing, a magical door opens? For now I won't look. If all of those fail… Well, bugs do happen. If your language is showing you one thing and RB another, and you've double-checked all the settings, then it's definitely worth posting via the Forum tab.

How to Copy your RegexBuddy Settings to Another Machine

Migrating a RegexBuddy installation is easy. There are two steps. 1. Settings. In the source machine, paste this folder path in Windows Explorer or your favorite file manager (Directory Opus, right?) %AppData%\JGsoft\RegexBuddy 4 Copy everything. After installing RB on the new machine, once again paste that folder name to navigate to it, and overwrite the contents of the folder with the backed up data. 2. Libraries. In the section about the Library tab, I advised you to save your libraries in a personal folder that often gets backed up. You did that, right? Copy the libraries to a personal folder on the new machine, go to the Library tab and open each of the files to populate the Open menu.

Tweaking your RegexBuddy setup with AutoHotkey

In version 4, one of my gripes with RB is that you can't assign your own keyboard shortcuts. One easy way around this is to run a AutoHotkey script. If you don't already have AutoHotkey (a.k.a. AHK) installed, it's a program that lets you tweak how Windows works. For instance, it lets you assign keyboard shortcuts that work in groups of programs that you define, saving you the pain of tweaking each application individually. It's a joy moving from VisualStudio to Pycharm (or Thunderbird!) and being able to use the same shortcuts to duplicate or move lines. Having a hotkey script also lowers the pain of setting up a new machine. You take your ahk script with you and a lot of the customizations that make you feel at home are done. There's a lot more you can do in AHK, but to get started I suggest you create a simple file for keyboard shortcuts. In that file, you can have shortcuts for multiple programs. I'll give you a file that only applies to RegexBuddy; you can expand it with shortcuts for other applications. In a second I'll tell you what it does, but in the meantime here is the link to download this simple AutoHotkey script to customize RegexBuddy shortcuts. To install, Save to a safe place. Make a shortcut to that file in your startup folder. To go there, paste this in Windows Explorer or your favorite file manager: C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp Install AutoHotkey if needed. That's it, you're good to go. The script does three things: RegexBuddy keyboard shortcuts with AutoHotkey 1. It assigns a few keyboard shortcuts that make sense to me. For instance, take the main row of tabs: Test, Use, Library, Explain (a.k.a. Create), Convert, Grep, Forum. To switch to a tab, the formula is: Alt + Initial of the tab's name For instance, for the Test tab, press Alt + T. The exception is Create because its C collides with Convert. I think of it as Explain anyway. 2. The script has a special key: F12 pops up a window that shows all the programmed shortcuts, like on the screenshot. 3. The script disables clicking on the icons of the History pane. This is because at the moment it is way too easy to click on Clear History, causing you to lose all the patterns in the pane. There was no way to disable only the Clear History button, but right-clicking in the pane gives you all the actions, which also have convenient shortcuts assigned by the script: Ctrl + Plus / Minus to add a pattern, Ctrl + Alt + Up / Down to move it up or down the list. If you don't like that click-disabling behavior, comment it out using /*C-style comments,*/ making sure that the beginning and ending markers each live at the beginning of a line. This is only a starting point. It's up to you to modify the script to tweak the shortcuts and add behavior. Do note that some of the shortcuts mean that some default RB shortcuts will not work. For instance, in the Use tab, Alt + U no longer opens the function drop-down list because it switches to that tab in the first place.

Misc Tips

Here are some miscellaneous tips about the program. Default Layout not restoring? Normally, you can solve layout issues via View button's Restore Default Layout item. If that is not working even after restarting RegexBuddy, one file may be corrupted. Exit RB and use your file manager to navigate to the %APPDATA%\JGsoft\RegexBuddy 4 folder. There, delete the Dock.ini file. When you restart, the default layout will be shown. Debug Build You can download a special version of RegexBuddy called the debug build. The main time to do this is when your program regularly crashes. The debug build will output a file that you can email to Jan. Sometimes, the debug build is one step ahead of the current build, so that new features or bug fixes may be baked in. In that sense you could consider it a pre-release, except that you'll usually have no idea what's been added (unless you read about it on the forum). You'll also have to put up with the debug build's peculiar (but deliberate) way of crashing without attempting to recover from errors. Read the RegexBuddy help on your e-reader RB's chm file is much more than a RegexBuddy reference: it contains a huge amount of information about regular expressions, mirroring Jan's regular-expressions.info website. Reading on the screen can be tiring. Luckily, you can read the entire chm file: 1. as a PDF by downloading the RegexBuddy PDF directly from JGsoft, 2. online on the RegexBuddy manual pages, 3. as an ebook file to peruse on your e-reader: for instructions, read on. I was going to upload a version for you, but given that the help file changes with every release, staying on top of it would be tedious. Instead, here are simple instructions. 1. Paste %programfiles%\Just Great Software\RegexBuddy 4 into Windows Explorer or your favorite file manager. 2. Locate RegexBuddy4.chm. If your e-reader handles chm files natively, you know what to do. Otherwise, … 3. Open Calibre or install that wonderful ebook manager if you haven't yet. Drag the file onto Calibre. 4. Edit the metadata so the ebook correctly shows the author (Jan Goyvaerts). For the title, include the version, e.g. RegexBuddy 4.6.1 5. Click Convert. At the top right, select your output format (for Kindle, select mobi). Press OK. 6. Once the conversion terminates, navigate to your Calibre library to pickup the converted ebook and transfer it to your e-reader. If you don't know the location of the library, click the Calibre button and read the top line: Your Calibre library is currently located at… Note that the sixth step might be simpler for you—I have disabled Calibre's auto-sync option.

Wish List

RegexBuddy is awesome, but no one is perfect, and there are a few things I would like our champion to work on. 1. Regex Features The only language features you'll find on this list are fairly obscure—but regex addicts use them, and what do regex addicts use? That's right, they use RegexBuddy. Support for the (*SKIP)(*FAIL) construct and other backtracking control verbs. Support for Matthew Barnett's regex engine (by far the best regex module for Python). 2. Program Features Option to check for updates automatically. Shortcut for the Preferences menu. Rename the Create tab to Explain, as that's what the tab is for in RB4. The "Clear History" icon is a time bomb. Click it instead of the "Delete" icon next to it, and you'll lose all your patterns. This button urgently needs a confirmation dialog. Library: after clicking Use, directly switch to the Test tab as our job in the library is done. Forum: ability to unsubscribe from a single thread (currently you have to unsubscribe from the whole forum.) Forum: "Subscribe" button so that we can follow a conversation by email even when we haven't participated. In the Use tab, ability to choose delimiters for languages that require them. In the Template editor, ability to set a default delimiter for languages that require them. Create tab: for consistency, a label for the Copy button? We need a way to customize hotkeys. In Windows, F2 stands for rename. I want to use F2 to rename my regex patterns in the History pane. Instead, F2 adds a pattern. Test tab: ability to copy a pattern to all the test patterns in the History pane. Test panel: ability to copy the test text to each pattern in the History pane. Code template editor: a parameter to use the text currently present in the Test pane. In the Debug tab, addition of a static view mode, allowing us to grasp the debug on a static picture, rather than in a cinematic way. On this view, a second column would be added, so that we can see, side by side, the engine's position in the string and the token or expression being attempted.

Related Links

Here are links related to RegexBuddy. Software by Jan Goyvaerts: official website, free trials RegexBuddy: the most powerful regex tool on Earth. A free RegexBuddy trial is available. EditPad Pro: the most regex-aware text editor. A free EditPad Pro trial is available. PowerGREP: perhaps the most powerful text search and manipulation tool. A free PowerGREP trial is available. RegexMagic: build regex without knowing regex. A free RegexMagic trial is available. AceText: your text needs from a central location. A free AceText trial is available. HelpScribble: create documentation files and website. A free HelpScribble trial is available. DeployMaster: no-nonsense installation builder. A free DeployMaster trial is available. RegexBuddy in the wild For a laugh, have a look at this StackOverflow thread, which reveals that some people are threatened by RegexBuddy (and regex in general, but that's not a revelation.) That's all I have! Hope you found this tutorial useful. Please feel free to print it out. If you're helping out on the forum, remember that the direct links can be used to direct someone to a specific section in the tut. If you find typos or want to share your thoughts on RB, I'll be excited to read you —please use the comment form below. Keep on regexxing! Smiles, Rex

Regex Humor

On this page, I aim to collect all the tidbits of regex humor I manage to muster. Some of these are the fruit of well-known brilliant minds, some of it I've started to produce, and lots of it will, I hope, be contributed by you guys, who as we know are not only geniuses but also fine humorists. There's a comment form at the bottom so please fire away. The bits that get good feedback will move to the main section of the page, and your work will get proper attribution. Lower down there's a collection of all of Randall Munroe's xkcd strips that mention regex.[citation needed] But first things first: what's the meaning of life?

The Meaning of Life

With gratitude and apologies to Douglas Adams (may he rest in peace):
"O Deep Thought computer," he said, "the task we have designed you to perform is this. We want you to tell us...." he paused, "The Answer." "The Answer?" said Deep Thought. "The Answer to what?" "Life!" urged Fook. "The Universe!" said Lunkwill. "Everything!" they said in chorus. Deep Thought paused for a moment's reflection. "Tricky," he said finally. "But can you do it?" "Yes," said Deep Thought, "I can do it. But, I'll have to think about it." "How long?" "Seven and a half million years," said Deep Thought. [Seven and a half million years later.... Fook and Lunkwill are long gone, but their ancestors continue what they started] "Good Morning," said Deep Thought at last. "Er..good morning, O Deep Thought" said Loonquawl nervously, "do you have...er..." "An Answer for you?" interrupted Deep Thought majestically. "Yes, I have." "And you're ready to give it to us?" urged Loonquawl. "I am." "Now?" "Now," said Deep Thought. "Though I don't think," added Deep Thought, "that you're going to like it." "Doesn't matter!" said Phouchg. "We must know it! Now!" "Alright," said Deep Thought. "The Answer to the Great Question..." "Of Life, the Universe and Everything..." said Deep Thought. "Is..." said Deep Thought, and paused. "Yes...!!!...?" "Okay, here it is, let me print it out for you," said Deep Thought, with infinite majesty and calm. Slowly, a narrow tape came out of a small slit in Deep Thought's titanium panels. It read:
^(?=(?!(.)\1)([^\DO:105-93+30])(?-1)(?<!\d(?<=(?![5-90-3])\d))).[^\WHY?]$
"But… What does that mean?" asked Loonquawl. "I don't know," said Deep Thought. "But I can design a more powerful computer that will be able to tell you that." "It will take time, though", added Deep Thought.
Curious? 1. On the following link, you can see a demo of the Meaning of Life Regex at work. 2. … But I highly recommend you try to figure it out for yourself—it's a great exercise! 3. Authors: Douglas Adams in this book—and for the regex, Rex—7 August 2014 (to share this: direct link)

The Incomplete Two-Problem Quote (…what two problems?…)

Odds are ten to one that you've already heard the famous quote about the two regex problems. Sadly, the quote is incomplete. Jeffrey Friedl did a great job tracking down the original author Jamie Zawinski, and on my side I've been trying to find out the lost words from the complete quote. Maybe you can help. The Original Quote:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
Obviously, something there is missing… What two problems are we talking about? Here are some attempts to complete the quote. Please send yours. The Reversed Quote Hypothesis
Some people, when confronted with a problem, think “I know, I won't use regular expressions.” Now they have two problems.
Author: Rex, 21 October 2015 The Parrot Hypothesis
Sometimes, when confronted with a problem, you think “I know, I'll use regular expressions.” Now you have two problems: 1. figuring out what to do with the many hours of tedious coding you just saved, and 2. having to deal with the trolls who give you an earache parrotting some lame quote about having two problems.
Author: Rex, 7 May 2014 The Recursive Hypothesis
Some people, when confronted with a problem, think “I know, I’ll quote Jamie Zawinski.” Now they have two problems.
Source: Martin Liebach, March 4 2009 Variation: I think that reads better with “I know, I’ll call Jamie Zawinski.” Now they have two problems. Will the well-meaning people of this world bring an end to this awful controversy? People have been discussing the two-problem question quite seriously on Stack Exchange but have failed to reach a consensus. I sincerely hope that with your help, the strict scientific approach on this page will be more fruitful. And now… A bit of regex entertainment.

What's the control character for "I smoke dope?"

regex bell character Author: Rex, 8 May 2014. Source: the marvelous Regex Cookbook, 2nd Ed.

Password Validation

.NET password validation It's also true in Python 3! Author: Rex, 8 May 2014

Messy Editor

\x5C[^\cH],u(I)D\g{1}0t! (Untested. Please don't use this in your code.) Author: Rex, 8 May 2014

Regex Harassment

Boss: (?![0-57-9])\d[^\D0-8]\? Employee: (slaps boss) Author: Rex, 20 July 2014

Regex Humor in the Wild

This section presents tidbits of regex humor found here and there. From Multiple Sources Q: What did one regex say to the other? A: .+ This one is too clever for me. I've read the explanation on Stack Overflow, but I still don't get it. From Morten Just Q: What regex are you most likely to see at Christmas? A: [^L] Q: Why couldn't Chris try out the regular expressions he created until he left home? A: His mom wouldn't let him play with matches. Source: mortenjust

xkcd regex

It seems like regex is now an official xkcd theme. It's fortunate that Randall has an on-and-off obsession with regex (if it's fair to call it that), because since you're reading this page, you probably do too. Randall seems to have a pretty permissive license, but I'll be seeking permission to use his art here. If you like his work, consider supporting him by buying his merch. I bought his first book, xkcd: volume 0 and already look forward to What If?, named after the eponymous column. xkcd #208: Regular Expressions This one is my favorite (and probably everyone else's). I prefer it in this abridged form, but the link will take you to the original xkcd 208 regular expressions xkcd #1031: Leopard regex xkcd 1031 regular expressions leopard regex xkcd #1171: Perl Problems I'm not crazy about this one, perhaps because Randall seems to be endorsing the original version of the two-problem quote, which as we've seen is sadly incomplete. But I'm probably missing the irony. xkcd 1171 Perl Problems xkcd #1313: Regex Golf xkcd 1313 regular expressions I hope you are well. I was wondering if you could assist me with my regex query. I am trying to compute the necessary minimum required strength for a copulative inter coalescent stack-based password management experience (a mouthful, I know). I would like it to restrict emoji and only allow alnum characters from 0-16, no more and no less. We would also like no non English language characters as if we use weird characters e.g Abrahamic characters the database sort of breaks (idk why). WOULD like some help with this as have been struggling on stack overflow no one has came up with good response, having studied your site for quite some time (and found it quite helpful - especially the segment on humour. Keeps me going at work, haha! ) I have this to start and don't know where to go, if you could provide me with a bus ticket to the regex mobile that'd be great. Thanks & Best Reggards (See what I did there, rexegg! ), you ARE the best! Franklin Wish I could help but I'm swamped… There's a decent Abrahamic engine available as a Python module (produced by a Jordanian developer IIRC), can't find it right now but since your comment is live on the site, hopefully someone else will be able to jump in. \bS(h)?al(?(1)o|aa)m\b Subject: explanations Assuming you are not pulling our leg about not understanding these: 1) for /. */ — since the expression matches "any and everything", so "whatever" you can think of. "whatever" is slang today for "some" agreement whether you really agree or not. 2) the 2 problems — by solving it with a regex, you've added another problem — the problem of writing a regex that solves the problem. Doing that is at least as difficult as the original problem. Like I said — assuming you were not pulling out leg! Subject: humor proposal How about this: What do you ask for if it's not enough? M|e Subject: RE: Now they have two problems. Some people, when confronted with a problem, think "I know, I'll use regular expressions. " Now they have a job. Subject: Subject: RE: Now they have two problems Is the request to complete this quote advanced irony? The quote is complete. You have problem 1. And you're using regexes, problem 2. Perhaps it is too negative about regexes for you to parse. Reply to Peter O Hi Peter, Aha, you mentioned irony. Maybe there is life out there after all. I've been shocked at the number of people who didn't get the joke and wrote in to explain the so-called problems over the past year. I now regret binning those comments as they would make a fun read, one after the other. > Perhaps it is too negative about regexes for you to parse. Nah, I don't think I'm the one with parsing problems. Kind regards, R The joke is complete in itself. Using regex restates the original problem, it does not provide a solution. The original problem now exists in two forms, it's original expression and its regular expression form Reply to James Hi James, Your idea that the regex "does not provide a solution" is highly original. In my experience of thousands of successful data extractions in response to real problems, the regex always seemed to provide a solution—but I must have been deluded. And the extraction strategies almost never felt like restatements—probably another illusion. Kind regards, Rex

Regex Books and Resources

This page contains two main sections:

Regex Books

Of the four books about regular expressions I have seen, two O'Reilly books are well worth reading. They are different, and if you fall in love with regex, you will probably want to read both. The one to start with is Jan's . The first two chapters give you a quick ramp-up to regular expressions. The third chapter shows you how to perform a number of regex operations in various programming languages. (If you use RB, you may recognize the kind of code output by the Use panel.) Mastering Regular Expressions In cookbook fashion, the remaining five chapters present recipes for many of the tasks you might want to accomplish with regex. If you use RegexBuddy, you will see a parallel between the choice of recipes and the patterns in the RB library. Eventually, the book you will want to study is Jeffrey Friedl's . The first three chapters make a solid introduction to regex. Chapters 4 and 5 are excellent reads about advanced regex. Chapter 6 contains a fascinating discussion of techniques to optimize your expressions. The four remaining chapters each focus on using regular expressions in a particular context: Perl, .NET, Java and PHP. O'Reilly also has a Regular Expression Pocket Reference which I find uninteresting. If you have read this site (or Jan's tutorial), Sams Teach Yourself Regular Expressions in 10 Minutes is a waste of time.

Online Regex Resources

Regex Engine Benchmark. I'll put that first because I find it fascinating: regex-redux regex engine benchmark. JG Soft In the world of regex-ware, there is one name to remember: JGSoft, or the man behind it, Jan Goyvaerts. You might think that JGSoft stands for "Jan Goyvaerts Software"—but no, it stands for "Just Great Software". Jan seems to have infiltrated the world of regex to its very core. He stands behind: - a top-notch online regex tutorial - a regex engine (JGSoft) - a top text editor (EditPad Pro) that uses the JGSoft regex engine - very cool regex tools, particularly the absolutely awesome RegexBuddy - an O'Reilly regex cookbook I mention that now so you can connect the dots as you cruise through this page and the Tools page. Online Regex Tutorials and Resources Apart from this site, for a comprehensive introduction to regular expressions, my favorite is Jan's regex tutorial. The one reproach I would have is that it tends not to document the regex features not yet implemented in Jan's tools such as RegexBuddy. To read this beautiful tutorial offline without copying and pasting lots of pages into a Word document, you can send five bucks to Jan for a pdf. But if you're going to buy or try one of Jan's products, such as EditPad Pro (which has a free trial), you would be paying for a redundant feature as Jan's regex tutorial is conveniently included in his Help files. A tip if you have a Kindle or other ebook reader: using Calibre, a free program, you can convert the CHM file (the help file, e.g. EditPadPro7.chm or RegexBuddy.chm) into an ebook. That's a great way to read the manual (and the tutorial) and to make the most of Jan's products. See this page of Kindle tricks for more details. Apart from Jan's tutorial… I find this quick intro to a few "advanced" regex topics quite clean. Smashing Magazine has a great collection of regex links. StackOverflow has a brilliant Regex FAQ. If you don't want to reinvent the wheel, you can try your luck at the regex library, which had over 3,000 expressions last time I checked. Online regex checkers are mentioned under "Other Tools" on the tools page. Regex Forums There are a few places online where you can get answers to regex questions. The most active by far is the regex tag on StackOverflow. The page I sent you to lists the questions that had the most votes. Once you land, you can also click the other tabs to see the newest questions. As a huge fan of RegexBuddy, I love its forum on regular expressions. It is a private forum that you can only access if you own EditPadPro or RegexBuddy. (Click the links to download a free trial of RegexBuddy or EditPad Pro). I like the regex forum on the php developers network. The regex board at PHPfreaks is good too. So is the regex forum on devshed, though my eyes find it a tad harder to read. Other Regex Links This page is a place to start for mySQL regex. But you'll also want to look at the regex(7) man page. Well, unless you want to go back to the beginning of my sprawling regex tutorial, that's it for now. I wish you a lot of fun on your journey with regular expressions!

The Greatest Regex Trick Ever

So you're doubtful at the mention of a "best regex trick"? Fine. I'll concede right away that deciding what constitutes the best technique in any field is a curly matter. When you start out with regex, learning that the lazy question mark in <tag>.*?</tag> prevents you from steamrolling from the start to the end of a string such as <tag>Tarzan</tag> likes <tag>Jane</tag> may seem like the best regex trick ever. At other points in your career, you'll surely fall in love with regex bits such as [^"]+ to match all the content between certain delimiters (in this case double quotes), or with atomic groups. However, as you mature as a regex practitioner, you come to regard these techniques for what they are: language features rather than tricks. They are neat, to be sure, but they are how regex works, and nothing more. In contrast, a "trick" is not a single point of syntax such as a negated character class or a lazy quantifier. A regex trick uses regex grammar to compose a "phrase" that achieves certain goals. With regex there's always more to learn, and there's always a more clever person than you (unless you're the lone guy sitting on top of the mountain), so I've often been exposed to awesome tricks that were out of my league—for instance the famous regex to validate that a number is prime, or some fiendish uses of recursion. But however clever these tricks, I would not call any of them the "best regex trick ever", for the simple reason that they are one-off techniques with limited scope. You are unlikely to ever use them. In contrast, the reason I drum up the technique on this page as the "best regex trick ever" is that it has several properties: Anyone can learn it. You don't have to be a regex master. It answers not one, but several common and practical regex questions. These questions are ones that even competent regex coders often have trouble answering gracefully. It is simple to implement in most programming languages. It is easy to extend when requirements change. It is portable over numerous regex flavors. It is usually more efficient than competing methods. It is too little-known. At least, until now. Do I have your attention yet? Before we proceed, I should point out some limitations of the technique: It will not butter the reverse side of a toast. It will not make small talk with your mother-in-law. It relies on your ability to inspect Group 1 captures (at least in the generic flavor), so it will not work in a non-programming environment, such as a text editor's search-and-replace function or a grep command. The point above also means that you may have to write one or two extra lines of code, but that is a light price to pay for a much cleaner, lighter and easier to maintain regex. Code samples for the six typical situations are provided below. There is an edge case to keep in mind. The regex engine dumps unwanted content into a trash can. In a typical context that is no problem, but if you are working with an enormous file, the trash can may get so large that you could run into memory issues. Other than that, it's awesome. Okay, let's dive in. No need to buckle up, the technique itself is delightfully simple.

Excluding certain Contexts while Matching or Replacing

Here are some of the questions that our regex trick is able to answer with speed and grace: How do I match a word unless it's surrounded by quotes? How do I match xyz except in contexts a, b or c? How do I match every word except those on a blacklist (or other contexts)? How do I ignore all content that is bolded (… and other contexts)? Once you grasp the technique, you will see that under a certain light, these are all nearly the same question. For convenience, here are some jumping points. For full potency, I recommend you read the whole article in sequence. But if you don't care about the typical solutions to the problems addressed by the technique, you can skip directly to the description of the trick. What do you mean by "Best Regex Trick"? [top] Typical Solutions to "Unless" Problems The Trick Match a Word Unless it's in Quotes How Does the Technique Work? The Technique in Pseudo-Regex One Small Thing to Look Out For Match Pattern Except in Context X Match Pattern Except in Contexts A, B, C Match Everything Except X Match Every Word Except Black List Ignore Content of This Kind A Variation: Deleting the Matches Variation for Perl, PCRE and Python Code Samples Code Translators Needed This is a long page. It's sure to have typos and perhaps bugs. Will you do me a favor and report any typos or bugs you find? Thanks!

The Typical Solutions

To see how convenient the trick is, it helps to first see how inconvenient some matching tasks can be when you don't know it. So let's see what other solutions exist. We'll look at two broad cases: A. The "simple" case B. The general case A. Simple Case: fixed-width non-match, as in "Tarzan" First, let's examine a "simple case": we want to match Tarzan except when this exact word is in double-quotes. In other words, we want to exclude "Tarzan". Option 1: Lookarounds At first you may think of framing Tarzan between a negative lookbehind and a negative lookahead: (?<!")Tarzan(?!") However, this does not work because it also excludes valid strings such as "Tarzan and Jane" and "Jane and Tarzan", whereas we only wanted to exclude "Tarzan". Back to the Future Regex I To account for this, you might inject a lookahead inside your negative lookbehind. This is what I call a "Back to the Future" regex. The lookahead inside the lookbehind asserts that after we've found the opening double quote behind Tarzan, we can find Tarzan (surprise) and a closing double quote. Since we're inside a negative lookbehind, this whole package is what we don't want. (?<!"(?=Tarzan"))Tarzan Step Forward then Backflip This approach is closely related to the Back to the Future approach. You match Tarzan, then you exclude the match if it is followed by a double quote (lookahead) that is preceded by the string "Tarzan". Tarzan(?!"(?<="Tarzan")) Conditional Alternately, you might first turn the negative lookbehind into a positive lookbehind that captures the opening quote if found, then tag a conditional at the end to assert that if Group 1 was set, the following character cannot be a double quote. (?>(?<=(")|))Tarzan(?(1)(?!")) Logic à la Lewis Carroll For this simple sample problem, you can modifiy the faulty lookarounds solution with a bit of logic: (?<!")Tarzan|Tarzan(?!") The left side of the alternation excludes "Tarzan, but the right side allows it. The right side of the alternation excludes Tarzan", but the left side allows it. As desired, this expression can match Tarzan, "Tarzan, Tarzan" but not "Tarzan". This is neat, but is it obvious? You might find the logic immediate, but most people will need to think about it for a moment to see how this works (I'm in that camp). The four options above work… but good luck explaining them to your boss. Option 2: Parity Check You can check that Tarzan is not inside quotes by checking that it is not followed by one quote followed by an even number of quotes. That's a bit of a hack. Tarzan(?!"(?:(?:[^"]*"){2})*[^"]*) Simple, right? Er… Not really. There's plenty of room to introduce bugs here. And indeed, this regex will not properly handle "Jane and Tarzan", where we would like Tarzan to match (you could get around this with a lookbehind and an alternation). In contrast, the that uses the regex trick on this page will be hauntingly simple. Option 3: The Two- or Three-Step Dance (Replace before Matching) I'll expand on this option below when we look at cases more complex than "Tarzan". In the meantime, here is the idea: 1. Replace all instances of the bad string (here "Tarzan"). If you're just trying to match, your replacement can be "" (you can remove the string). If you want to replace the good strings but leave the bad strings, replace the bad strings with something distinctive, such as "T~a~r~z~a~n" 2. Simply match or replace the string you want (here Tarzan), which is now safe to do as you know that all the bad strings have been neutralized. 3. If you are replacing rather than simply matching, there is one more step: you now need to revert the distinctive strings ("T~a~r~z~a~n") to their original form. When you're working with a text editor and want to perform replacements, this is often your best bet. The technique on this page is for when you are working in a programming language that allows you to inspect your Group 1 captures, so it won't help you in EditPad Pro or Notepad++. Option 4 for Perl, PCRE, Ruby, Python: \K I'll also expand on this option below when we look at more complex cases than "Tarzan". This option works in Perl, PCRE (C, PHP, R, …), Ruby 1.9+ and Python's alternate regex engine. In these regex flavors, the \K token tells the engine to discard the characters matched up to its appearance when preparing the overall match. We can use this feature to match unwanted content (here "Tarzan" and other characters that are not Tarzan) up to the very point where a wanted string begins (here Tarzan). At that point, the \K discards the unwanted content, and the engine proceeds to match the content we really want. This solution looks like this: (?:"Tarzan".*?)*\KTarzan This is a compact option if you use the engines that support it, but if you're aware of that you'll see below. Okay, that was the simple case. Here the context to avoid had a fixed width: a single double-quote character on either side of the word Tarzan. Now let's look at the general case, where the content to exclude has a variable width. B. General Case: variable-width exclusion (for instance between tags) More often than not, the context we want to exclude has a width we cannot predict. For instance, suppose we want to avoid matching the string Tarzan somewhere between [a] tags, as in [a]Jane and Tarzan[/a]. Not only will the string between the tags be variable (here Jane and Tarzan), but the tag itself may also vary, as in [a]. In such situations, you often see the big guns come out. Option 1: Variable-Width Lookbehind In most regex flavors, a lookbehind must have a fixed number of characters, or at least a number of characters within a specified range. However, a handful of flavors allow true variable-width lookbehinds. Among the chosen few are .NET, Matthew Barnett's alternate regex module for Python and JGSoft (available in RegexBuddy and EditPad). In .NET, the question "match Tarzan except inside curly braces" (e.g., not in "{Jane loved Tarzan's curly hair}") can almost be gracefully handled with: (?<!{[^}]*)Tarzan Back to the Future Regex II Why almost? Because "Tarzan" should be allowed in { Jane and Tarzan..., where the left brace is left open. To check both sides, we'll need to inject a positive lookahead inside the lookbehind—stepping into Back to the Future territory—to assert that after we've found what we were looking for behind Tarzan, we can find Tarzan (surprise) and optional characters up to a closing curly brace. Since we're inside a negative lookbehind, this whole package is what we're trying to avoid. This is the adult version of our earlier , and it looks like this: (?<!{[^}]*?(?=Tarzan[^{}]*}))Tarzan What if you need more restrictions—such as also forbidding Tarzan from appearing in [i][/i] tags inside of [p][/p] tags? Yes, you can add more variable-length lookbehinds. Good luck to you as the restrictions become more numerous and complex. Also, if the pattern to be matched is more complex than the literal Tarzan, the expression can fast become unmanageable. And in Java, PHP, Ruby and Python's re module, you can forget about this technique altogether because infinite-width lookbehinds do not exist in these flavors. Option 2: The Two- or Three-Step Dance (Replace before Matching) To match all instances of Tarzan unless they are embedded in a string inside curly braces, one fairly heavy but simple solution is to perform a two-step dance: Replace then Match. If we also want to replace all these matches, we need a third-step: a final replacement. Step 1: You positively match all instances of Tarzan embedded in curly braces. If you're just trying to match, your replacement can be "" (you can remove the string). If you want to replace the good strings but leave the bad strings, replace the word Tarzan with something distinctive, such as "T~a~r~z~a~n". To perform the match, this simple regex would do: ({[^{}]*?)(Tarzan)([^}]*}) The string is captured into three groups: the beginning, Tarzan, and the end. If you're removing the bad strings before matching, your replacement would be \1\3 or $1$3 depending on your regex flavor. If you're replacing the bad strings before replacing the good strings, your replacement would be \1T~a~r~z~a~n\3 or $1T~a~r~z~a~n$3. If there are other contexts in which you want to avoid matching Tarzan, you probably have to repeat Step 1, as attempting to match all the bad strings in one big regex is fraught with risk. Step 2: All the unwanted instances of Tarzan have been neutralized, so you can now match Tarzan without worrying about context. I realize that matching Tarzan in a vacuum is not that interesting. In real life you might be looking for Tarzan and the phone number that follows. Optional Step 3: If the point of Step 2 was not only to match but also to perform a replacement on the acceptable Tarzan strings, then once that replacement is made we also need to turn all the T~a~r~z~a~n strings back into Tarzan, which is easily accomplished. Option 3 for Perl, PCRE, Ruby and Python: \K This option works in Perl, PCRE (C, PHP, R, …), Ruby 1.9+ and Python's alternate regex engine. In these engines, the \K token causes the engine to drop all it has matched up to the \K from the overall match it returns. This opens a strategy for us: we can (i) match any unwanted content (if present) up to the beginning of a wanted Tarzan instance, (ii) throw away that portion of the match using \K, then (iii) match Tarzan. This option could look like this: (?:(?>{[^}]*?})[^{}]*?)*\KTarzan Note that while we try to match unwanted content, we swallow entire sets of {strings in curly braces} without bothering to check if they contain Tarzan. We do not need to care, because we know that if something is inside curly braces, we don't want it. Compared with the other options we've seen so far, this is fairly economical. But if you need to add conditions in which Tarzan cannot be matched, it can become very hard to manage. Besides, it's still too much work compared with… (drum roll…)

The Best Regex Trick Ever (at last!)

If you've read up to here, well done! Without further ado, let's plunge into this technique I have been relentlessly selling you. I'm hopeful that this won't be counter-climactic in the least. One key to this technique, a key to which I'll return several times, is that we completely disregard the overall matches returned by the regex engine: that's the trash bin. Instead, we inspect the Group 1 matches, which, when set, contain what we are looking for. This means that you may have to write one or two extra lines of code, but that is a light price to pay for a much cleaner, lighter and easier to maintain regex. Code samples for the six typical situations are provided below. An example is worth a picture-and-a-half, so let's revisit our first example.

Match Tarzan but not "Tarzan"

You remember the simple case where we tried to match all instances of Tarzan except those enclosed in double quotes? It turned out to yield solutions in varying shades of obscure, such as ((?<=")?)Tarzan(?(1)(?!")) and Tarzan(?!"(?:(?:[^"]*"){2})+[^"]*?(?:$|[\r\n])) and (?:"Tarzan".*?)*\KTarzan Well, you'll now see how simple the problem becomes when you use the best regex trick ever: "Tarzan"|(Tarzan) Really? That's it? Yes. The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side. When our programming language returns the results, we ignore the overall matches (that's the trash bin) and instead turn our whole attention to Group 1 matches, which contain what we were after. Adding exclusions is a breeze When there's another context we want to exclude, we simply add it as an alternation on the left, where we match it in order to neutralize it—if it's matched, it's in the trash. For instance, if we also had to exclude Tarzan in Tarzania and --Tarzan--, our regex would become: Tarzania|--Tarzan--|"Tarzan"|(Tarzan) Adding exclusions is a breeze, isn't it? Again, the only instances of Tarzan we care about will be those captured by Group 1.

How Does the Technique Work?

This is simple, but it may not be entirely intuitive, so it's worth reviewing how the regex engine handles this pattern. If you feel very confident you understand the mechanics of this match, feel free to skip to the next section, the Technique in Pseudo-Regex. A quick refresher about the regex engine Remember that the engine has two "reading heads" which both move from left to right: one moves in the string, one moves in the regex pattern. The main thing to understand is that at the start, with the string reading head at the very beginning of the string, the engine tries to match the entire pattern at that position. If that fails, the string reading head advances by one character, and the engine again tries to match the entire pattern. Thus the engine can advance in the string one character at a time, and at each of these characters attempt an overall match, and fail, until at one starting point in the string, perhaps, an overall match is returned. Let's look at this in more detail. When we fire up the engine, both reading heads are at the very left. At the string reading head's current position (i.e. the very left), the engine attempts to match the entire pattern. To do so, it tries to match the pattern's first token against the string's first character. If that fails, the engine's string reading head advances to the position immediately past the first character (i.e. between the first and the second character), and the pattern reading head resets to the very left. At that position, the engine once again attempts to match the whole pattern. At that stage, if the first token matches the second character, both reading heads advance, and the engine tries to match the second token against the next character. Of course some tokens have quantifiers and match multiple characters; and within a match attempt from a given starting position in the string, the pattern reading heads often have to backtrack. But the principle remains the same: if an overall match fails, then the string reading head moves to the next position, the pattern reading head resets to the very left, and the engine once again attempts an entire match. In "multiple matches mode", when the engine succeeds in matching the entire pattern, it records the current match, then attempts the next match starting from the position that immediately follows the last character that was just included in the match. A Walk-Through So let's say we are trying the original pattern "Tarzan"|(Tarzan) against this string: Now Tarzan says to Jane: "Tarzan". 1. The engine's string reading head positions itself at the head of the string, before the "N" in "Now". At this position, the engine attempts to match the entire pattern "Tarzan"|(Tarzan) 2. At this position, the engine is unable to match the opening double quote in "Tarzan" because the next character is "N", so the left side of the alternation immediately fails. The engine's pattern reading head then jumps to the right side of the alternation and tries to match the initial T in Tarzan, but fails, again because the next character in the string is "N". 3. At this position in the string, the match has failed. The string reading head advances one character in the string (positioning itself between the "N" and the "o" in "Now"), and the pattern reading head resets to the very left. At this new position, the engine again attempts to match then entire pattern "Tarzan"|(Tarzan) 4. At this position, the engine is unable to match the opening double quote in "Tarzan" because the next character is "o". Likewise, the right side of the alternation fails because "T" is not "o". 5. The string reading head again advances in the string and attempts two matches that fail, the first before the "w" in "Now", the second before the space character preceding "Tarzan". The string reading head then advances in the string to the position preceding the T. 6. The left side of the alternation fails because the next character is not a double quote. The pattern reading head jumps to right side of the alternation, and the engine is able to match the T. The string reading head advances by one character, the pattern reading head advances by one token. The engine is able to match a, then, as the reading heads continue to advance in parallel, the engine matches r, z, a and n. The match succeeds, Tarzan is added to the list of matches, and since it was in parentheses it is also recorded as the Group 1 capture for this match. 7. The string reading head advances to the position after the "n" in the initial Tarzan, and the pattern reading head resets to the very left. At this position the engine starts a new match attempt, and fails. The string reading head advances to each position in "says to Jane: ", and as it does so, at each position the engine attempts a new match, and fails. The string reading head then advances to the position preceding the first double quote. 8. At this position, before the opening double quote, the engine attempts to match a double quote and succeeds. The string reading head advances by one character, the pattern reading head advances by one token. The engine matches the T, and both reading heads keep advancing in parallel until all the characters in "Tarzan" have been matched. 9. The match succeeds, "Tarzan" is added to the list of matches, but it is not captured in any capturing group as it was not surrounded by parentheses. 10. The engine returns two matches: Tarzan and "Tarzan". We don't pay attention to the matches, but for each match we look at capturing Group 1 using our programming language. (You'll see code samples in several languages below.) For the first match, we have a non-empty capturing Group 1: Tarzan. That is what we were after.

The Technique in Pseudo-Regex

Here is the recipe in "pseudo-regex": NotThis|NotThat|GoAway|(WeWantThis) This is a game of good cop / bad cop. Bad string As in any good cop / bad cop routine, the bad cop comes in first. The idea is to use a series of alternations on the left to specify the contexts we want to exclude. By doing so, we force the engine to match these "bad strings". We won't even look at the overall matches—think of the set of overall matches as a garbage bin. After matching a bad string, the engine attempts the next overall match starting at the string position that immediately follows the bad string. In effect, that bad string has been skipped: this is how we manage to exclude unwanted context. Good string When the engine starts a match attempt at the beginning of a "good string", it can safely match it, because we know that if that string had been embedded in context we want to exclude… the engine would already have matched it and placed it in the garbage bin! Since we do match the good strings, they too go in the garbage bin. The difference is that by using capturing parentheses when we match the good strings, we capture them into Group 1. One or two lines of code In our code, we'll only examine these Group 1 captures. Examining Group 1 may take one or two more lines of code than examining "Group 0" (the overall matches), but that's a small price to pay for a regex that is crystal-clear and extremely easy to maintain. The code samples lower in the page will show you how to use this technique in a variety of languages for the six most common regex tasks: (i) checking if there is a match, (ii) counting matches, (iii) retrieving the first match, (iv) retrieving all matches, (v) replacing, and (vi) splitting. This is a simple but extremely potent regex technique, don't you think?

One small thing to look out for

There are not many bewares with this technique, but there is one small thing to look out for. It may sound obvious, but do make sure that the expression in (GetThis) is not so broad that it can swallow strings that contain bad strings—specifically, strings that start one or more characters before a bad string. For instance, suppose you want to match all words that are not inside an <img> tag. Let's apply our NotThis|(GetThis) recipe. 1. Your NotThis rule could look like this: <img[^>]+> 2. What about the GetThis rule? Don't use a dot-star, as on the right side of the alternation in <img[^>]+>|(.*) Why not? The engine starts a match attempt at the beginning of the string. First, it tries a < against the first character. Say the first character is "S": the < fails to match. The string reading head stays at the start of the string, but the pattern reading head now moves to the right side of the alternation. The engine tries the .* … and the naughty dot-star swallows the "S" and the rest of the string, exclusions and all. We are relying on the exclusion rules to remove unwanted context. But on the GetThis side, you can't have an expression that swallows the same context you are trying to remove! That stands to reason, but it needs to be said—and seen. I sometimes mess this up when building expressions fast, and it's good to be able to instantly spot what is going wrong. Note that the problem only arises if the GetThis regex is able to match one or more characters before it matches a bad string. That is because at a string position that precedes a bad string by one or more characters, the exclusion rule is not able to fire, and the engine switches over to the hungry GetThis. On the other hand, it is perfectly acceptable for the GetThis expression to have the potential to match a bad string, as long as it only has that potential at the very start of a bad string. Why? Because this potential never has a chance to come to fruition. Since the exclusion regex patterns are on the left of the alternation, these patterns neutralize bad strings before the GetThis regex can ever get to them. In our example, this regex would do the job: <img[^>]+>|(\w+)

More Applications of the Technique

Let's now explore other examples using the technique. At the very end, we'll also look at a neat variation for Perl and the PCRE engine (which PHP and Apache use).

Match Tarzan but not in {Tarzan's curly braces}

Remember how complex the typical case was before? The task was to match Tarzan, except when it lives somewhere between curly braces. Now all we have to do is apply our recipe: Not_this_context|(WeWantThis) Okay, first off, we know that (WeWantThis) is simply (Tarzan). Now how can we express Not_this_context? The unwanted context is Tarzan inside curly braces. Delightfully, for this, we use something as compact as {[^}]*}, and I'll explain why in a short moment. This small expression simply matches the entire content of a pair of curly braces. For this example, we're assuming that braces are {never {nested}}. This gives us: {[^}]*}|(Tarzan) All we have to do is retrieve the matches from Group 1. Too easy!! Of course in real life we would probably not look just for the word Tarzan, but for some variable content, such as Tarzan\d+ Please skip it simple! Please note this trick within a trick: to specify the exclusion rule, we did not bother to write a whole expression to match Tarzan inside curly braces, such as: {[^}]*?Tarzan[^}]*} Instead, we just matched the content of any curly braces: {[^}]*} Why? Because if something is inside curly braces, we know that we don't want anything to do with it. So we can go ahead and skip all sets of curly braces without bothering to look inside! This is what I call "skipping it simple". Now let's take it up a notch.

Match Tarzan but not in contexts A, B and C

Your boss just told you that not only do we want to avoid Tarzan inside curly braces, we also want to leave the muscular vine hopper in his jungle when he appears within sections that start with BEGIN and end with END. Also, sentences starting with "Therefore" are definitely excluded. How is that for a change of specs? Is she trying to make you break a sweat? You must have solved the first assignment too fast. If you had done it with one of the typical techniques, at this stage you might be pulling your hair. Instead, this is what you do: Step 1: spend 57 seconds revising the original expression to this: \bBEGIN\b.*?\bEND\b|Therefore.*?[.!?]|{[^}]*}|(Tarzan) Step 2: clean up your inbox for a couple of hours before announcing to your boss that it was curly, but that by gawd… you've wrestled that regex to the ground! So what have we done? We've just followed the recipe and added two exclusions to the original regex in alternations at the left. The first exclusion, which could have been a simple BEGIN.*?END, matches any sequence starting with BEGIN and ending with END. You've added the \b boundaries because you're nice and you want to give your boss a real END, not just any old WENDY. The second exclusion swallows any string that starts with Therefore and ends with the three characters in the [.!?] character class—so chosen because your boss told you to assume that all sentences end with periods, question marks or bangs. Okay, we're feeling great. What's the next use of our golden technique?

Match every word except Tarzan

So far, we've been looking at questions of the form:
Match X unless it is in contexts a, b and c.
Now let's look at a family of questions that sound quite different but reduce to the same:
Match every word except words a, b and c.
To start easy, let's try to match every word except Tarzan. Hey, that's simple: \bTarzan\b|(\w+) By the way, this is an interesting case because by itself, the \w+ would be able to match Tarzan. However, it is never able to fire in that situation, because by the time we get to an instance of Tarzan, the exclusion rule has already matched it. This is explained in more detail in the section about one small thing to look out for. Note also that as it is, the regex will capture antiTarzan and Tarzania. That's a feature, not a bug (see the \b boundaries.) Let's take it up a notch and talk about blacklists, a commonly requested regex task.

Match every word except those on a blacklist

This time we want to blacklist the words Tarzan, Jane and Superman. Hey, no problem. We add exclusions on the left, and our regex becomes: \bTarzan\b|\bJane\b|\bSuperman\b|(\w+) or, more gracefully: \b(?:Tarzan|Jane|Superman)\b|(\w+) You can try it online with "Tarzan, Jane and Superman hopped from vine to vine." Remember that what we're looking at is the Group 1 matches, which are shown in the lower right-hand panel and highlighted differently from the plain matches. Let's now talk about an application of the technique which, to untrained ears, sounds completely different:

Ignore Content of This Kind

Sometimes someone may present you with a regex problem and phrase it in this manner:
I want to ignore A.
It's useful to notice that this wording is just a variation on
Match everything except A
Didn't we just see that one? We did. Even so, let's stay sharp by practicing one more time, using this assignment: ignore bolded content. Maybe you can convince your boss to reword this as "match all content except anything in bold". By "in bold", let's say we're talking about content within <b> tags. And by "content", let's say we're talking about sequences of word and whitespace characters. Using our recipe, we can translate the assignment like so: <b>[^<]*</b>|([\w\s]+) As a reminder (see the lookout section for details), it would not do to use a (.*) in the GetThis section, because at any point in the string prior to a bolded section, the exclusion rule would fail, while the naughty dot-star would swallow the entire string from that point to the end—including any bolded sections. In that case, how about the lazy quantifier (.*?), you might wonder? You could do that—but make sure to see the section explaining why lazy quantifiers are expensive on the Mastering Quantifiers page. Back the the article's Table of Contents

A Variation: Deleting the Matches

Sometimes, you want to match content in order to delete it. In this case, there is a simple tweak to our usual recipe that allows us to delete the matches directly without inspecting Group 1 captures. To search, instead of our usual recipe: NotThis|NotThat|GoAway|(WeWantThis) We use: (KeepThis|KeepThat|KeepTheOther)|DeleteThis As you can see, the location of the parentheses has been inverted. We can now replace the match with Group 1. There are two cases: - If the match took place on the left branch of the alternation, and therefore captured to Group 1, the match is replaced with itself (no change); - If the match took place on the right side of the alternation, the match is replaced with Group 1, which is empty: it is therefore deleted. Here is an interesting variation to do the same: (KeepThis)|(KeepThat)|(KeepTheOther)|DeleteThis For the replacement, we concatenate Groups 1, 2 and 3 (in any order). Since only one of those groups is ever captured (if any), the other two groups contain empty strings. Once again, the match is replaced with itself (if captured) or with an empty string. There is no standard for replacement syntax, so in one language this may look like \1\2\3, $1$2$3 or m.group(1) + m.group(2) + m.group(3).

Variation for Perl, PCRE and Python: (*SKIP)(*FAIL)

Perl, PCRE (C, PHP, R…) and Python's alternate regex engine have a variation that uses almost entirely the same syntax, but that returns the desired matches as the overall match instead of returning them in capture Group 1. In these flavors, this is a neat trick to know as it can save us one or two lines of code. Remember that in our technique, when we express a series of unwanted contexts in alternations to be matched and thrown in the garbage bin, such as NotThis|NotThat, the key to success is that when such undesirable areas of the strings are matched, they are in effect SKIPPED. After matching them, the engine attempts the next match starting at the position immediately following the preceding match. The entire area to be excluded has been gobbled up, and therefore skipped. Well, with Perl, PCRE and Python's alternate regex engine, you can use a construct that makes the engine to match that undesirable content, then fail the match… after which the engine skips the entire substring that just failed and starts the next match attempt at the position immediately following the bad string. This allows us to do the same as we've been doing, but we no longer need parentheses to capture the content we want because there is no longer a garbage bin full of unwanted matches to be ignored. We can inspect the matches directly, because the pattern only matches what we want. That syntax can either be written as (*SKIP)(*FAIL), (*SKIP)(*F) or (*SKIP)(?!). That's because (*FAIL) and (*F) are both synonyms for (?!), which, as we saw on the tricks page, is an expression that never matches, forcing the engine to backtrack in search of a different match. As for (*SKIP), it's a backtracking control verb in Perl, PCRE and Python's alternate regex engine. You can read all about it on my page about backtracking control verbs. When the engine tries to backtrack across (*SKIP), the match attempt explodes. Instead of starting the next match attempt at the next starting position in the string, the engine advances to the string position corresponding to where (*SKIP) was encountered. This means that anything to the left of (*SKIP) is never visited again. Apart from time-saving benefits, this technique allows us to reject entire chunks of text in one go. Remember the overall recipe to avoid context X? It was Not_X|(GetThis) Using Perl, PCRE (PHP, R, C…) or Python's alternate regex engine, we can accomplish the same with either of these: Not_A(*SKIP)(*FAIL)|GetThis Not_A(*SKIP)(*F)|GetThis Not_A(*SKIP)(?!)|GetThis Note that the parentheses around GetThis have disappeared. Whenever the engine is able to match Not_A, the (*SKIP)(*FAIL) construct causes it to reject that entire chunk of text and start the next match attempt immediately afterwards. Whenever the engine is not able to match Not_A, it jumps to the right branch of the alternation | and tries to match GetThis. If this fails, the engine starts the next match attempt at the next starting position in the subject text, as always. If we want to avoid three contexts A, B and C, our technique used to do this: Not_A|Not_B|Not_C|(GetThis) In Perl and PHP, we can instead say something like one of these: Not_A(*SKIP)(*FAIL)|Not_B(*SKIP)(*F)|Not_C(*SKIP)(?!)|GetThis (?:Not_A|Not_B|Not_C)(*SKIP)(*FAIL)|(GetThis) Back the the article's Table of Contents

Code Samples

To complete this article, I'd like to provide a full implementation in several common languages. A Call to Help May 2014. I'm calling for your help to translate the examples provided to languages in which you are fluent (see code translators needed). In advance, thank you. The six tasks performed by the code samples The code performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string Learn a new engine! The code samples should allow even complete beginners to pick code fragments that suit their needs and tweak them to their liking. Please rest assured that beginner is not a condescending term here, and I am expecting "advanced beginners" to take advantage of the code. If you are proficient in regex in the context of one programming language, you may be curious to test out other engines, but also worried about the learning curve. Apart from illustrating various uses of the technique, the code samples allow you to start experimenting in a variety of regex flavors. The assignment for the code samples All the code samples tackle the same assignment. Our assignment is to match Tarzan followed by any number of digits, for instance Tarzan111, except: 1. between quotes, as in "Tarzan123", 2. somewhere inside curly braces, as in { Jane Tarzan123 } For this assignment, I will use \d without attempting to distinguish between ASCII digits and Unicode digits, as that is not the point of the exercise. Just be aware that in some engines \d only matches the ASCII digits 0 to 9, while in others it also matches digits in other alphabets. If you want to be consistent, use [0-9] The Test Strings To test the code, we'll use one string that produces two matches and a small variation that should produce none. 1. The string below should produce two matches: Tarzan11 and Tarzan22
Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}
2. To test failure cases, I suggest you capitalize two z characters as in the string below, which should produce no matches: Jane" "Tarzan12" TarZan11@TarZan22 {4 Tarzan34} The Regex Here is the regex we'll use: {[^}]+}|"Tarzan\d+"|(Tarzan\d+) 1. The first part of the alternation {[^}]+} matches and neutralizes any content between curly quotes. 2. The second part of the alternation "Tarzan\d+" matches and neutralizes instances where the sought string is embedded within double quotes. You may ask why I didn't simply neutralize any content between double quotes in similar fashion to the first part of the alternation, using "[^"]+". For most strings, that would have worked, but if you carefully inspect the test string, you'll see that I sneaked in an extra double quote after Jane. I did so to illustrate a safe regex work practice. See, if for any reason the subject string has an odd number of double quotes as is the case here, you cannot be sure that two quotes matched by "[^"]+" belong together. Indeed, for our test string, that code would match a single space within double quotes, and the regex would (wrongly) capture Tarzan12 into Group 1. Therefore, when working with quotes, being specific as in "Tarzan\d+" is safer. In the case of braces (where there are distinct characters for the left and right sides), the risk of mismatches is far lower. 3. The third part of the alternation (Tarzan\d+) matches Tarzan and the following digits and captures the match into Group 1. Here are jump points to code samples in various languages. Implemented PHP C# Python Java JavaScript Ruby Perl VB.NET Not Yet Implemented Visual C++ Scala Other language of your choice

PHP Code Sample

For PHP, I'll provide two samples. The first illustrates the main technique on this page. The second illustrates the (*SKIP)(*F) variation specific to Perl and PHP, which is a little lighter. Sample #1: The Core Technique Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo <?php $regex = '~{[^}]+}|"Tarzan\d+"|(Tarzan\d+)~'; $subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}'; $count = preg_match_all($regex, $subject, $m); // build array of non-empty Group 1 captures $matches=array_filter($m[1]); ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? echo "*** Is there a Match? ***<br />\n"; if(empty($matches)) echo "No<br />\n"; else echo "Yes<br />\n"; // Task 2: How many matches are there? echo "\n<br />*** Number of Matches ***<br />\n"; echo count($matches)."<br />\n"; // Task 3: What is the first match? echo "\n<br />*** First Match ***<br />\n"; if(!empty($matches)) echo array_values($matches)[0]."<br />\n"; // Task 4: What are all the matches? echo "\n<br />*** Matches ***<br />\n"; if(!empty($matches)) { foreach ($matches as $match) echo $match."<br />\n"; } // Task 5: Replace the matches $replaced = preg_replace_callback( $regex, // in the callback function, if Group 1 is empty, // set the replacement to the whole match, // i.e. don't replace function($m) { if(empty($m[1])) return $m[0]; else return "Superman";}, $subject); echo "\n<br />*** Replacements ***<br />\n"; echo $replaced."<br />\n"; // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. $splits = explode("Superman",$replaced); echo "\n<br />*** Splits ***<br />\n"; echo "<pre>"; print_r($splits); echo "</pre>"; ?> Sample #2: The (*SKIP)(*F) Variation This sample implements the technique explained in the Variation for Perl and PCRE section. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo <?php $regex = '~(?:{[^}]+}|"Tarzan\d+")(*SKIP)(*F)|Tarzan\d+~'; $subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}'; $count = preg_match_all($regex, $subject, $matches); // $matches[0] contains the matches, if any ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? echo "*** Is there a Match? ***<br />\n"; if($count) echo "Yes<br />\n"; else echo "No<br />\n"; // Task 2: How many matches are there? echo "\n<br />*** Number of Matches ***<br />\n"; if($count) echo count($matches[0])."<br />\n"; else echo "0<br />\n"; // Task 3: What is the first match? echo "\n<br />*** First Match ***<br />\n"; if($count) echo $matches[0][0]."<br />\n"; // Task 4: What are all the matches? echo "\n<br />*** Matches ***<br />\n"; if($count) { foreach ($matches[0] as $match) echo $match."<br />\n"; } // Task 5: Replace the matches $replaced = preg_replace($regex,"Superman",$subject); echo "\n<br />*** Replacements ***<br />\n"; echo $replaced."<br />\n"; // Task 6: Split $splits = preg_split($regex,$subject); echo "\n<br />*** Splits ***<br />\n"; echo "<pre>"; print_r($splits); echo "</pre>"; ?> Back to the Code Samples explanation and languages Back the the article's Table of Contents

C# Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo using System; using System.Text.RegularExpressions; using System.Linq; using System.Collections.Generic; class Program { static void Main() { string s1 = @"Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34}"; var myRegex = new Regex(@"{[^}]+}|""Tarzan\d+""|(Tarzan\d+)"); var group1Caps = new List<string>(); Match matchResult = myRegex.Match(s1); // put Group 1 captures in a list while (matchResult.Success) { if (matchResult.Groups[1].Value != "") { group1Caps.Add(matchResult.Groups[1].Value); } matchResult = matchResult.NextMatch(); } ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? Console.WriteLine("*** Is there a Match? ***"); if(group1Caps.Any()) Console.WriteLine("Yes"); else Console.WriteLine("No"); // Task 2: How many matches are there? Console.WriteLine("\n" + "*** Number of Matches ***"); Console.WriteLine(group1Caps.Count); // Task 3: What is the first match? Console.WriteLine("\n" + "*** First Match ***"); if(group1Caps.Any()) Console.WriteLine(group1Caps[0]); // Task 4: What are all the matches? Console.WriteLine("\n" + "*** Matches ***"); if (group1Caps.Any()) { foreach (string match in group1Caps) Console.WriteLine(match); } // Task 5: Replace the matches string replaced = myRegex.Replace(s1, delegate(Match m) { // m.Value is the same as m.Groups[0].Value if (m.Groups[1].Value == "") return m.Value; else return "Superman"; }); Console.WriteLine("\n" + "*** Replacements ***"); Console.WriteLine(replaced); // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. string[] splits = Regex.Split(replaced,"Superman"); Console.WriteLine("\n" + "*** Splits ***"); foreach (string split in splits) Console.WriteLine(split); Console.WriteLine("\nPress Any Key to Exit."); Console.ReadKey(); } // END Main } // END Program Back to the Code Samples explanation and languages Back the the article's Table of Contents

Python Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo import re # import regex # if you like good times # intended to replace `re`, the regex module has many advanced # features for regex lovers. http://pypi.python.org/pypi/regex subject = 'Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34}' regex = re.compile(r'{[^}]+}|"Tarzan\d+"|(Tarzan\d+)') # put Group 1 captures in a list matches = [group for group in re.findall(regex, subject) if group] ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? print("*** Is there a Match? ***") if len(matches)>0: print ("Yes") else: print ("No") # Task 2: How many matches are there? print("\n" + "*** Number of Matches ***") print(len(matches)) # Task 3: What is the first match? print("\n" + "*** First Match ***") if len(matches)>0: print (matches[0]) # Task 4: What are all the matches? print("\n" + "*** Matches ***") if len(matches)>0: for match in matches: print (match) # Task 5: Replace the matches def myreplacement(m): if m.group(1): return "Superman" else: return m.group(0) replaced = regex.sub(myreplacement, subject) print("\n" + "*** Replacements ***") print(replaced) # Task 6: Split # Start by replacing by something distinctive, # as in Step 5. Then split. splits = replaced.split('Superman') print("\n" + "*** Splits ***") for split in splits: print (split) Back to the Code Samples explanation and languages Back the the article's Table of Contents

Java Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo import java.util.*; import java.io.*; import java.util.regex.*; import java.util.List; class Program { public static void main (String[] args) throws java.lang.Exception{ String subject = "Jane\" \"Tarzan12\" Tarzan11@Tarzan22 {4 Tarzan34}"; Pattern regex = Pattern.compile("\\{[^}]+\\}|\"Tarzan\\d+\"|(Tarzan\\d+)"); Matcher regexMatcher = regex.matcher(subject); List<String> group1Caps = new ArrayList<String>(); // put Group 1 captures in a list while (regexMatcher.find()) { if(regexMatcher.group(1) != null) { group1Caps.add(regexMatcher.group(1)); } } // end of building the list ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? System.out.println("*** Is there a Match? ***"); if(group1Caps.size()>0) System.out.println("Yes"); else System.out.println("No"); // Task 2: How many matches are there? System.out.println("\n" + "*** Number of Matches ***"); System.out.println(group1Caps.size()); // Task 3: What is the first match? System.out.println("\n" + "*** First Match ***"); if(group1Caps.size()>0) System.out.println(group1Caps.get(0)); // Task 4: What are all the matches? System.out.println("\n" + "*** Matches ***"); if(group1Caps.size()>0) { for (String match : group1Caps) System.out.println(match); } // Task 5: Replace the matches // if only replacing, delete the line with the first matcher // also delete the section that creates the list of captures Matcher m = regex.matcher(subject); StringBuffer b= new StringBuffer(); while (m.find()) { if(m.group(1) != null) m.appendReplacement(b, "Superman"); else m.appendReplacement(b, m.group(0)); } m.appendTail(b); String replaced = b.toString(); System.out.println("\n" + "*** Replacements ***"); System.out.println(replaced); // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. String[] splits = replaced.split("Superman"); System.out.println("\n" + "*** Splits ***"); for (String split : splits) System.out.println(split); } // end main } // end Program Back to the Code Samples explanation and languages Back the the article's Table of Contents

JavaScript Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo <script> var subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34} '; var regex = /{[^}]+}|"Tarzan\d+"|(Tarzan\d+)/g; var group1Caps = []; var match = regex.exec(subject); // put Group 1 captures in an array while (match != null) { if( match[1] != null ) group1Caps.push(match[1]); match = regex.exec(subject); } ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? document.write("*** Is there a Match? ***<br>"); if(group1Caps.length > 0) document.write("Yes<br>"); else document.write("No<br>"); // Task 2: How many matches are there? document.write("<br>*** Number of Matches ***<br>"); document.write(group1Caps.length); // Task 3: What is the first match? document.write("<br><br>*** First Match ***<br>"); if(group1Caps.length > 0) document.write(group1Caps[0],"<br>"); // Task 4: What are all the matches? document.write("<br>*** Matches ***<br>"); if (group1Caps.length > 0) { for (key in group1Caps) document.write(group1Caps[key],"<br>"); } // Task 5: Replace the matches // see callback parameters http://tinyurl.com/ocddsuk replaced = subject.replace(regex, function(m, group1) { // pick one of those two depending on JS version // if (group1 == "" ) return m; if (group1 == undefined ) return m; else return "Superman"; }); document.write("<br>*** Replacements ***<br>"); document.write(replaced); // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. splits = replaced.split("Superman"); document.write("<br><br>*** Splits ***<br>"); for (key in splits) document.write(splits[key],"<br>"); </script> Back to the Code Samples explanation and languages Back the the article's Table of Contents

Ruby Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo subject = 'Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34}' regex = /{[^}]+}|"Tarzan\d+"|(Tarzan\d+)/ # put Group 1 captures in an array group1Caps = [] subject.scan(regex) {|m| group1Caps << $1 if !$1.nil? } ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? puts("*** Is there a Match? ***") if group1Caps.length > 0 puts "Yes" else puts "No" end # Task 2: How many matches are there? puts "\n*** Number of Matches ***" puts group1Caps.length # Task 3: What is the first match? puts "\n*** First Match ***" if group1Caps.length > 0 puts group1Caps[0] end # Task 4: What are all the matches? puts "\n*** Matches ***" if group1Caps.length > 0 group1Caps.each { |x| puts x } end # Task 5: Replace the matches replaced = subject.gsub(regex) {|m| if $1.nil? m else "Superman" end } puts "\n*** Replacements ***" puts replaced # Task 6: Split # Start by replacing by something distinctive, # as in Step 5. Then split. splits = replaced.split(/Superman/) puts "\n*** Splits ***" splits.each { |x| puts x } Back to the Code Samples explanation and languages Back the the article's Table of Contents

Perl Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo #!/usr/bin/perl $regex = '{[^}]+}|"Tarzan\d+"|(Tarzan\d+)'; $subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}'; # put Group 1 captures in an array my @group1Caps = (); while ($subject =~ m/$regex/g) { print $1 . "\n"; if (defined $1) {push(@group1Caps,$1);} } ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? print "*** Is there a Match? ***\n"; if ( @group1Caps > 0) { print "Yes\n"; } else { print ("No\n"); } # Task 2: How many matches are there? print "\n*** Number of Matches ***\n"; print scalar(@group1Caps); # Task 3: What is the first match? print "\n\n*** First Match ***\n"; if ( @group1Caps > 0) { print $group1Caps[0]; } # Task 4: What are all the matches? print "\n\n*** Matches ***\n"; if ( @group1Caps > 0) { foreach(@group1Caps) { print "$_\n"; } } # Task 5: Replace the matches # or: s/$regex/$1? "Superman":$&/eg ($replaced = $subject) =~ s/$regex/ if (defined $1) { "Superman"; } else {$&;} /eg; print "\n*** Replacements ***\n"; print $replaced . "\n"; # Task 6: Split # Start by replacing by something distinctive, # as in Step 5. Then split. @splits = split(/Superman/, $replaced); print "\n*** Splits ***\n"; foreach(@splits) { print "$_\n"; } Back to the Code Samples explanation and languages Back the the article's Table of Contents

VB.NET Code Sample

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. (The code compiles perfectly in VS2015, but no online demo supplied because the VB.NET in ideone chokes on anonymous functions.) Imports System.Text.RegularExpressions Module Module1 Sub Main() Dim MyRegex As New Regex("{[^}]+}|""Tarzan\d+""|(Tarzan\d+)") Dim Subject As String = "Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34} " Dim Group1Caps As New List(Of String)() Dim MatchResult As Match = MyRegex.Match(Subject) ' put Group 1 captures in a list While MatchResult.Success If MatchResult.Groups(1).Value <> "" Then Group1Caps.Add(MatchResult.Groups(1).Value) End If MatchResult = MatchResult.NextMatch() End While '///////// The six main tasks we're likely to have //////// '// Task 1: Is there a match? Console.WriteLine("*** Is there a Match? ***") If(Group1Caps.Any()) Then Console.WriteLine("Yes") Else Console.WriteLine("No") End If '// Task 2: How many matches are there? Console.WriteLine(vbCrLf & "*** Number of Matches ***") Console.WriteLine(Group1Caps.Count) '// Task 3: What is the first match? Console.WriteLine(vbCrLf & "*** First Match ***") If(Group1Caps.Any()) Then Console.WriteLine(Group1Caps(0)) '// Task 4: What are all the matches? Console.WriteLine(vbCrLf & "*** Matches ***") If (Group1Caps.Any()) Then For Each match as String in Group1Caps Console.WriteLine(match) Next End If '// Task 5: Replace the matches Dim Replaced As String = myRegex.Replace(Subject, Function(m As Match) If (m.Groups(1).Value = "") Then Return m.Groups(0).Value Else Return "Superman" End If End Function) Console.WriteLine(vbCrLf & "*** Replacements ***") Console.WriteLine(Replaced) ' Task 6: Split ' Start by replacing by something distinctive, ' as in Step 5. Then split. Dim Splits As Array = Regex.Split(replaced,"Superman") Console.WriteLine(vbCrLf & "*** Splits ***") For Each Split as String in Splits Console.WriteLine(Split) Next Console.WriteLine(vbCrLf & "Press Any Key to Exit.") Console.ReadKey() End Sub End Module Back to the Code Samples explanation and languages Back the the article's Table of Contents

Code Translators Needed

I would love to enlist your help so the page can provide working code in more languages. Please see the list of languages for languages currently implemented and missing. If you wish, you will be duly acknowledged with your name or an alias of your choice. Are you willing to help? Fantastic. To make things easy for me, your code needs to mirror the specs of the other samples. Here are the requirements that come to mind: Completeness. The idea is to provide code that someone who has never used your language is able to plug in to an IDE, compile (if needed) and run. So please include any opening braces and the few needed lines to import any relevant libraries. Conciseness. By the same token, please ommit any unneeded fluff, such as unneeded libraries and classes. Same example. To keep things consistent, please use the regex and subject string provided. Six tasks. The code must include separate sections that could be run separately if needed, addressing the four common tasks illustrated by the code already on the page: (i) checking whether there is a match, (ii) counting the matches, (iii) returning the first match, (iv) returning all matches, (v) replacing all matches, (vi) splitting the string. Formatted output. If you run the existing demos, you'll see that they output certain strings at each step to inform us of where we are in the code. Your code should output those same strings. Link to a working demo. For consistency, if ideone.com supports your language, please use it. If you paste your code in the comment form it may not make it to me intact, but you can paste an ideone.com link or a brief message. I'll reply. Html won't work in the comment form. A million thanks in advance! Well, I think that's about all I have to say about this technique at the moment. Writing it was a big journey. I hope you had a blast reading it. Wishing you loads of fun on your travels in regexland, Rex At this stage you might like to treat yourself to some …or just visit the next page. next Regex Cookbook Hi, Line 37 of the JS code, "if (group1 == "" ) return m;" should be "if (group1 == undefined ) return m;" for the code to work correctly. Reply to Ivan Thank you Ivan. The code worked when I wrote it, but JS specs change over time and vary from platform to platform, so I'm glad you let me know about the latest. Warm regards, Rex Hi Toomas, Thank you so much for your nitpicks, man! I really appreciate them. Perl is not my idiom so I'm sure what I have is quite heavy. Added your Perl code as a comment line above what was there. Fixed the others. Wishing you a fun week, Rex Hi Omer, Thank you very much for reporting typos. I really appreciate it. Fixed. Wishing you a fun weekend, Rex Subject: Typo First sentence after option 3 heading: s/that/than/ Very well done. All your step-by-step examples make this article superb. That is so simple that it's genius! I just started to learn regex and am glad I found this site so I don't waste time struggling with it when you cut right to the chase. Thanks! Subject: Thank you! You are awesome. Thanks for the trick and the page.

Conditional Regex Replacement in Text Editor

Often, the need arises to replace matches with different strings depending on the match itself. For instance, let's say you want to match words for colors, such as red and blue, and replace them with their French equivalents—rouge and bleu. Therefore, the string: blue cheese, a red nose should turn into: bleu cheese, a rouge nose Using regex, this is no problem is most programming languages, where you can call a function to compute replacements. (Depending on context , such functions may be called lambdas, delegates or callbacks.) In fact, if you wanted you could compute replacements by talking to a NASA server and requesting a piece of data from a machine on the moon. From this light, regex replacements are really flexible. You can do whatever you like. In a text editor, however, it's a different story. You can insert new text, you can insert text matched by capture groups, but there is no syntax for conditional insertion, e.g, inserting bleu if you matched blue—at least not in any of the tools I know. The purpose of this page is to show you a trick I came up with that allows you to do just that. Note that this technique will only work in flavors that allow you to set a capture within a lookahead. You'll also need an editor with strong regex capabilities, such as EditPad Pro or Notepad++. I'll show you two similar idea (using a replacement pool or a dictionary) and some variations.

Conditional Replacement using a Replacement Pool

In this version, at the bottom of the file, we temporarily paste a "pool" with the possible replacement texts. Keeping our example, we just paste bleu rouge. In the editor, the text looks like this: blue cheese, a red nose. bleu rouge We then use a pattern that matches blue or red. If it matches blue, we lookahead for bleu, which we know we'll find at the pool in the bottom, and capture it. We do the same for red. In free-spacing mode, the regex looks like this: (?sx) \bblue\b(?=.*(bleu)) | \bred\b(?=.*(rouge)) Or, on one line: (?s)\bblue\b(?=.*(bleu))|\bred\b(?=.*(rouge)) What do we replace our matches with? In the blue match case, the replacement is captured to Group 1. In the red case, it is captured to Group 2. When one group is set, the other is empty, so gluing them together with \1\2 just results in the one that is set: bleu + "" yields bleu "" + rouge yields rouge Following this principle, if we had five replacements, our replacement string would be \1\2\3\4\5 Here's an online demo. Variation: branch reset In regex flavors that support the (?|...) branch reset syntax, you can capture the replacements to a unique group, so the replacement string becomes a simple \1 In the regex, you just need to wrap the alternation in a branch reset: (?sx) (?| # branch reset: both captures go to Group 1 \bblue\b(?=.*(bleu)) | \bred\b(?=.*(rouge)) ) Even if you have five replacements, the replacement string will still be \1. Here's an online demo.

Conditional Replacement using a Dictionary

In this version of my conditional replacement trick, instead of pasting a replacement pool at the bottom, we paste a "dictionary". Dictionary:blue=bleu:red=rouge:green=vert My choice of the term "dictionary" is not innocent. Of course in this case we have a dictionary in the everyday sense. But this could also be a dictionary in the computing sense, i.e. a data structure that contains pairs of unique keys and not-necessarily unique values. In some languages, this is called a hash table or an associative array. Our text input becomes something like this: blue cheese, a red nose. Dictionary:blue=bleu:red=rouge:green=vert For our regex, let's start with this: (?s)\b(blue|red)\b(?=.*:\1=(\w+)\b) This matches either color, then looks further in the file for a dictionary entry of the form :original=translation, capturing the translation to Group 2. Our replacement is therefore \2 (here's a demo). Of course if there's a chance that the actual text would contain segments that look like dictionary entries, the regex would have to be refined. Variation when matches are dense (full translation) In the previous pattern, we specifically look for the literals blue and red because we do not want to give the engine the burden of looking up every word in the dictionary. However, when nearly every word in your file is a match to be translated, including every word in the regex becomes burdensome. Instead, we can simplify the regex by just matching any word: (?s)\b(\w+)\b(?=.*:\1=(\w+)\b) The replacement is still \2. Here's a demo.

Example with Ten Replacements: Translating Japanese Digits

Several years after writing this trick, this very question came up on the RegexBuddy forum. In Japanese, digits can either be represented with the native Kanji characters (imported from Chinese), or with roman numerals. There are Unicode code points dedicated to the roman numerals in Japanese, separate from their ASCII code points, and the goal of the question was to translate these Japanese code points to their ASCII counterparts. This seems like a perfect chance to showcase the technique with more than the two replacements of our simple example. The four versions are shown. This is the exact same technique as above, so no explanation is needed. (If you have a question, please use the form at the bottom.) Here is the text to be transformed, and the desired output. You might not see the difference until you stare at the shape of the digits. ==== Original ==== 0 zero 1 ichi 2 ni 3 san 4 shi 5 go 6 roku 7 shichi 8 hachi 9 kyuu ==== Desired Output ==== 0 zero 1 ichi 2 ni 3 san 4 shi 5 go 6 roku 7 shichi 8 hachi 9 kyuu 1. Pool trick At the bottom of the text, we paste this pool: 0123456789 Our first regex is: (?sx) 0(?=.*(0)) |1(?=.*(1)) |2(?=.*(2)) |3(?=.*(3)) |4(?=.*(4)) |5(?=.*(5)) |6(?=.*(6)) |7(?=.*(7)) |8(?=.*(8)) |9(?=.*(9)) Our replacement is \1\2\3\4\5\6\7\8\9${10} See online demo. 2. Pool trick, branch reset version With a branch reset, our replacement is just \1 The regex becomes: (?sx) (?| 0(?=.*(0)) |1(?=.*(1)) |2(?=.*(2)) |3(?=.*(3)) |4(?=.*(4)) |5(?=.*(5)) |6(?=.*(6)) |7(?=.*(7)) |8(?=.*(8)) |9(?=.*(9)) ) See online demo. 3. Dictionary trick, specific matches For the dictionary trick, we'll paste this at the bottom of our text: Dictionary:0=0:1=1:2=2:3=3:4=4:5=5:6=6:7=7:8=8:9=9 We can use this regex: (?sx)(0|1|2|3|4|5|6|7|8|9) (?=.*:\1=(\w+)\b) For the replacement, we use \2. See online demo. 4. Dictionary trick, variable matches Here we simplify the regex to: (?sx)(\b\w+\b) (?=.*:\1=(\w+)\b) Note that the \w+ caters for cases where the dictionary contains more than digits. If we know the dictionary only contains digits, we can use: (?sx)(\p{Nd}) (?=.*:\1=(\w+)\b) The replacement is still \2. See online demo. Subject: compliment That idea is just very nice! I Just came across it, and it solved my problem. Very likely I'll be using this idea again in the future. Just wanted to say, I like the way you think! Congrats! Subject: Conditional replace in regex This is absolute genius! Thank you for sharing it!

On Which Line Number is the First Match?

Using only regex, can you tell the on which line a match was found? You could do that with a Perl one-liner using $. to print the line, but in pure regex, the answer should be "No: a regex cannot give you a line number." And that is probably a fair answer. But this page presents tricks that allow you to return the line number using only regular expressions. They may not be tricks you want to put into practice, but they're a great excuse to look at three forms of advanced regex syntax (which form the backbone of the three solutions): recursion, self-referencing groups and balancing groups. Input for the techniques To demonstrate the techniques, we will use this input: Paint it white Paint it black Why not blue? Or red or brown? Our aim will be to match the line number where the first instance of blue can be found. The techniques relies on a hack: at the bottom of the input, we will paste a list of digits, separated by a unique delimiter (something that will not appear somewhere else in the file). For our tests, we will use a ~ as a delimiter. Our input becomes: Paint it white Paint it black Why not blue? Or red or brown? ~1~2~3~4~5~6~7~8~9~10 If need be, generating that list of digits programmatically would be a simple matter. Inspiration for these tricks: SQL The inspiration for the main idea behind all three solutions is a classic database hack. Databases such as MySQL do not provide syntax to return a row number, so a well-known workaround is to join to a table of integers. Another use for a table of integers is to provide the equivalent of a for loop within a SELECT statement, letting you for instance to generate a list of the 30 dates after the current date.

Outline

Here are jumping points to the techniques we'll look at. Match Line Number Using Recursion Match Line Number Using Self-Referencing Group Self-Reference Variation: Reverse the Line Numbers Match Line Number Using Balancing Groups

Match Line Number Using Recursion

This solution uses recursion, which is available in Perl, PCRE and Matthew Barnett's regex module for Python. In turn, PCRE is used in contexts such as PHP, R and Delphi. You can test this solution in Notepad++ or EditPad Pro. The point of the recursion is not immediate to grasp, so I'll give an overview before diving into the regex. The idea of the recursive structure, which lives inside a lookahead, is to balance each non-blue line with a digit. This is similar to what we do when we balance nested parentheses ((( … ))) using recursion, except that here we have: non-blue-line non-blue-line non-blue-line ~1~2~3. The last ~digit segment is captured to Group 2. Group 1, which contains the recursion, is optional, which makes the surrounding lookahead optional. This is because if blue is on the first line, no lines are skipped. After the lookahead, we match blue, then if Group 2 was set, we match it. Either way, we look for the next ~digit segment and return the digit as the match. (?xsm) # free-spacing mode, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay ###### Recursive Section ###### # This section aims to balance empty lines with digits, i.e. # emptyLine,emptyLine,emptyLine ... ~1~2~3 # The last digit block is captured to Group 2, e.g. ~3 (?= # lookahead ( # Group 1 (?: # skip one line that doesn't contain blue ^ # start of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline ) (?:(?1)|[^~]+) # recurse Group 1 OR match all non-tilde chars (~\d+) # match a sequence of digits )? # End Group 1 ) # End lookahead. # Group 2, if set, now contains the number of lines skipped .*? # lazily match chars up to... blue # match blue .*? # lazily match chars up to... (?(2)\2) # if Group 2 is set, match Group 2 ~ # Match the next tilde \K # drop what was matched so far \d+ # match the next digits: this is the match In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Groups 1 and 2, and play with the input (move the first blue to other lines).

Match Line Number Using Self-Referencing Group

This technique uses a self-referencing capture group, that is, a group that refers to itself. It's not hard, but it may not be immediate if you haven't seen the technique before, so I'll give you an overview. We match the non-blue lines one by one. For each line we match, we lookahead to the string of digits at the bottom, and we use Group 1 to capture a portion of that string. This is Group 1: ((?(1)\1)~\d+). Group 1 says "if Group 1 is already set, match what Group 1 has captured so far. Then, regardless, match a tilde and some digits." This means with each non-blue line we match, Group 1 grows to capture an ever-longer portion of the digit string. (?xsm) # free-spacing mode, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay ########### LINE SKIPPER / COUNTER ############ (?: # start non-capture group # the aim is to skip lines that don't contain blue # and capture a corresponding digit sequence (?: # skip one line that doesn't contain blue ^ # beginning of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline chars ) # With each line skipped, let Group 1 capture # an ever-growing portion of the string of numbers (?= # lookahead [^~]+ # skip all chars that are not tildes ( # start Group 1 (?(1)\1) # if Group 1 is set, match Group 1 # (?>\1?) # alternate phrasing for the above ~\d+ # match a tilde and digits ) # end Group 1 ) # end lookahead )*+ # end counter-line-skipper: zero or more times # the possessive + forbids backtracking .*? # lazily match any chars up to... blue # match blue [^~]+ # match any non-tilde chars (?(1)\1) # if Group 1 has been set, match it # \1? # alternate phrasing for the above ~ # match a tilde \K # drop what we matched so far \d+ # match digits. This is the match! In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Group 1 and play with the input (move the first blue to other lines).

Self-Referencing Group Variation: Reverse the Line Numbers

In this interesting variation, we reverse the line numbers at the bottom of the file: ~10~9~8~7~6~5~4~3~2~1 This has several benefits. First, we can shoot all the way to the back of the file with a simple .* and know we have reached the digits' section. That is more satisfying than looking for the digits' section with [^~]+. Second, we don't have to worry that our "unique" delimiter (here a simple tilde ~) might be used somewhere else in the input: We shoot down to the end and backtrack from there. This makes the situation even more similar to being able to inspect a separate table or file. The code is nearly the same: In the self-referencing group, instead of appending digits to the existing capture with ((?(1)\1)~\d+), we prepend them with (~\d+(?(1)\1)). Our input becomes: Paint it white Paint it black Why not blue? Or red or brown? ~10~9~8~7~6~5~4~3~2~1 (?xsm) # free-spacing mode, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay ########### LINE SKIPPER / COUNTER ############ (?: # start non-capture group # the aim is to skip lines that don't contain blue # and capture a corresponding digit sequence (?: # skip one line that doesn't contain blue ^ # beginning of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline chars ) # With each line skipped, let Group 1 capture # an ever-growing portion of the string of numbers (?= # lookahead .* # Go to the end of the file ( # start Group 1 ~\d+ # match a tilde and digits (?(1)\1) # if Group 1 is set, match Group 1 ) # end Group 1 ) # end lookahead )*+ # end counter-line-skipper: zero or more times # the possessive + forbids backtracking .*? # lazily match any chars up to... blue # match blue .* # Get to the end of the data ~ # match a tilde \K # drop what we matched so far \d+ # match digits. This is the match! (?= # Lookahead (this positions us in the right place) (?(1)\1) # If Group 1 has been set, match it ) # End lookahead In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Group 1 and play with the input (move the first blue to other lines).

Match Line Number Using Balancing Groups

This version uses an outstanding regex feature exclusive to the .NET engine: balancing groups. We use a group named c to serve as a counter of lines that don't contain blue. Of course there is no such thing as a "counter"… But each capture for Group c is added to the Capture Collection stack, and that stack has a length (which you can later inspect with match.Groups["c"].Captures.Count). After "incrementing the counter" while skipping the empty lines (meaning that for each empty line we add a capture to Group c collection), we match the line with blue and get to the beginning of the digit sequence. There the fun begins: as long as we can decrement c (i.e., pop an element from Group c captures), we match a digit sequence. The digits matched at this point (if any) therefore correspond to the skipped lines. And the digit for the line containing blue is the next one in the sequence. Don't let the explanation scare you: the code is probably simpler than the explanation! For some reason .NET doesn't seem to do well with the capture-popping syntax (?<-c> … ) when it's inside a lookbehind, so instead of matching the line number directly, we will capture it to Group 1. (?xsm) # free-spacing, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay \A # Assert position at the beginning of the input ########### LINE SKIPPER / COUNTER ############ (?<c> # Add a capture to Group c for each line that # doesn't contain blue. Think of Group c as # a counter (we are only interested in the # number of captures it contains) ^ # beginning of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline chars ) # end Group c * # repeat Group c as long as we can # find non-blue lines ########### AFTER SKIPPING ############ .*? # lazily match any chars up to... blue # match blue [^~]+ # match any non-tilde chars ########### Number of Skipped Lines (if any) ############ # To get to the number of skipped lines in the digit sequence, # for each Group c capture (each skipped line), we pop one # element from Group c ("decrement c") and match the next digits (?(c) # Conditional: If Group c has been set (?<-c> # Pop one capture from Group c / "decrement c" ~\d+ # Match the next tilde and digits ) # End of popping / "decrementing" group * # Zero or more times: We will only pop elements # (and therefore match new digits) as long as # Group c still contains captures ) # End Conditional checking if Group c has been set ####### Finally: the next digits are the right one ######## ~ # Matrch the tilde (\d+) # Capture the digits to Group 1 In this live regex demo, inspect the captures: you will see that Group 1 is 3. The special variable that contains "current line number" is $. , not &. Also, in Perl, another alternative to get the line number is to use pos() inside embedded code. Pos() gives the character offset, and we can scan $_ for newlines to get line number + column number. Reply to perlancar Typo fixed, thank you -- and thanks for your other idea as well.

Regex to Match Numbers in Plain English

This page presents a regular expression to match numbers in plain English, such as: one trillion seven hundred twenty two zero point nine five nine hundred ninety nine thousand two hundred thirteen This pattern was an excuse to build a beautiful regex using the . As such, it only works in engine that support that syntax—currently Perl, PCRE (PHP, Delphi, R…) and Python's alternate regex engine. The regex built in a modular way—like lego. We define some named subroutines—one_to_9, ten_to_19—then more named subroutines that build on the earlier ones: one_to_99, one_to_999… At the bottom, after defining all the groups, is where the real matching begins. There you can decide to match a big number by calling the big_number subroutine with (?&bignumber), or to match smaller numbers using subroutines such as (?&one_to_99), and so on. There are plenty of comments in the regex, so if you read the defined subroutine page, you shouldn't need further explanations. (?x) # free-spacing mode (?(DEFINE) # Within this DEFINE block, we'll define many subroutines # They build on each other like lego until we can define # a "big number" (?<one_to_9> # The basic regex: # one|two|three|four|five|six|seven|eight|nine # We'll use an optimized version: # Option 1: four|eight|(?:fiv|(?:ni|o)n)e|t(?:wo|hree)| # s(?:ix|even) # Option 2: (?:f(?:ive|our)|s(?:even|ix)|t(?:hree|wo)|(?:ni|o)ne|eight) ) # end one_to_9 definition (?<ten_to_19> # The basic regex: # ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| # eighteen|nineteen # We'll use an optimized version: # Option 1: twelve|(?:(?:elev|t)e|(?:fif|eigh|nine|(?:thi|fou)r| # s(?:ix|even))tee)n # Option 2: (?:(?:(?:s(?:even|ix)|f(?:our|if)|nine)te|e(?:ighte|lev))en| t(?:(?:hirte)?en|welve)) ) # end ten_to_19 definition (?<two_digit_prefix> # The basic regex: # twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety # We'll use an optimized version: # Option 1: (?:fif|six|eigh|nine|(?:tw|sev)en|(?:thi|fo)r)ty # Option 2: (?:s(?:even|ix)|t(?:hir|wen)|f(?:if|or)|eigh|nine)ty ) # end two_digit_prefix definition (?<one_to_99> (?&two_digit_prefix)(?:[- ](?&one_to_9))?|(?&ten_to_19)| (?&one_to_9) ) # end one_to_99 definition (?<one_to_999> (?&one_to_9)[ ]hundred(?:[ ](?:and[ ])?(?&one_to_99))?| (?&one_to_99) ) # end one_to_999 definition (?<one_to_999_999> (?&one_to_999)[ ]thousand(?:[ ](?&one_to_999))?| (?&one_to_999) ) # end one_to_999_999 definition (?<one_to_999_999_999> (?&one_to_999)[ ]million(?:[ ](?&one_to_999_999))?| (?&one_to_999_999) ) # end one_to_999_999_999 definition (?<one_to_999_999_999_999> (?&one_to_999)[ ]billion(?:[ ](?&one_to_999_999_999))?| (?&one_to_999_999_999) ) # end one_to_999_999_999_999 definition (?<one_to_999_999_999_999_999> (?&one_to_999)[ ]trillion(?:[ ](?&one_to_999_999_999_999))?| (?&one_to_999_999_999_999) ) # end one_to_999_999_999_999_999 definition (?<bignumber> zero|(?&one_to_999_999_999_999_999) ) # end bignumber definition (?<zero_to_9> (?&one_to_9)|zero ) # end zero to 9 definition (?<decimals> point(?:[ ](?&zero_to_9))+ ) # end decimals definition ) # End DEFINE ####### The Regex Matching Starts Here ######## (?&bignumber)(?:[ ](?&decimals))? ### Other examples of groups we could match ### #(?&bignumber) # (?&one_to_99) # (?&one_to_999) # (?&one_to_999_999) # (?&one_to_999_999_999) # (?&one_to_999_999_999_999) # (?&one_to_999_999_999_999_999) You can play with the regex and sample text in this live regex demo. As a convenience to PCRE users, with the permission of Philip Hazel, I aim to provide a mirror to the latest PCRE documentation whenever it is released. To download the latest PCRE, see pcre.org. Apart from links to various versions of the PCRE documentation, this page presents a curated list of new feature introductions in PCRE's pattern syntax, as well as as links to other PCRE-related material on RexEgg.

Index

For easy navigation, here are some jumping points to various sections of the page:

Change Log

For the latest official PCRE2 revision history (ChangeLog), follow the link, which should remain the same when new versions are released. For the official "PCRE 1" revision history (ChangeLog), follow the link, which shows all changes up to the latest version of PCRE1. For a brief, curated history of additions to the syntax, see Additions to PCRE further down.

Documentation

Versions 10.0 and higher are called PCRE2. PCRE2 contains a new API, which includes a replacement function: pcre2_substitute(). The latest PCRE2 documentation should always be available on this link. If you are mostly interested in PCRE's regex syntax, the most important file in the PCRE2 documentation is the pcre2pattern man page. The pcre2api file has the replacement syntax. Versions below 10.0, sometimes known as "PCRE 1", are the original PCRE library—still widely but now in bug-fix mode only (no new features to be introduced). The latest "PCRE 1" documentation should always live on this link. If you are mostly interested in PCRE's regex syntax, the most important file in the "PCRE 1" documentation is the pcrepattern man page.

Feature Additions to the PCRE Pattern Syntax

This section is not the full PCRE change log. Instead, it presents the version and date when new features were added to the pattern syntax. This is a curated collection that does not claim to be exhaustive. For the full story, see the change log for PCRE and the change log for PCRE2.
VersionDateChange
10.364 Dec 2020Added CET_CFLAGS option for Intel CET
10.359 May 2020Added PCRE2_SUBSTITUTE_LITERAL option to turn off the interpretation of the replacement string
10.359 May 2020Added PCRE2_SUBSTITUTE_MATCHED option
10.359 May 2020Added PCRE2_SUBSTITUTE_REPLACEMENT_ONLY option
10.359 May 2020Added Added (?* and (? as synonms for (*napla: and (*naplb: to match another regex engine. option
10.3421 Nov 2019Added non-atomic positive lookaround via (*non_atomic_positive_lookahead:…) or (*napla:…), (*non_atomic_positive_lookbehind:…) or (*naplb:…)
10.3421 Nov 2019 (*ACCEPT) can now be quantified: an ungreedy quantifier with a zero minimum is potentially useful
10.3421 Nov 2019Add pcre2_get_match_data_size() to the API
10.3421 Nov 2019Add pcre2_maketables_free() to the API
10.3316 Apr 2019Added Perl "script run" features (*script_run:…) a.k.a (*sr:…), and (*atomic_script_run:…) a.k.a (*asr:…)
10.3316 Apr 2019Added Perl 5.28 experimental alphabetic names for atomic groups and lookaround assertions, such as (*pla:…) and (*atomic:…)
10.3316 Apr 2019Added PCRE2_EXTRA_ESCAPED_CR_IS_LF option
10.3316 Apr 2019Added PCRE2_COPY_MATCHED_SUBJECT option
10.3316 Apr 2019Added PCRE2_EXTRA_ALT_BSUX option to support ECMAScript 6 \u{hhh} construct
10.3316 Apr 2019In DOTALL mode, \p{Any} is now the same as .
10.3210 Sep 2018 unsets all imnsx options
10.3210 Sep 2018 (*ACCEPT:ARG), (*FAIL:ARG), and (*COMMIT:ARG) are now supported.
10.3014 Aug 2017Added the PCRE2_LITERAL option, telling the compiler to treat the entire pattern as a literal string, including what would normally be metacharacters
10.3014 Aug 2017Added the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL option, telling the compiler to treat an escaped character which isn't a proper token (such as \j) as a literal (in this case the letter j) rather than an error
10.3014 Aug 2017Added the PCRE2_NEWLINE_NUL option, which adds the NUL character (binary zero) to the list of characters which can be set as those to be recognized as new lines, set using pcre2_set_newline()
10.3014 Aug 2017Added the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option, giving finer control over the treatment of Unicode surrogate code points
10.3014 Aug 2017Added the (?n) inline option to disable auto-capture, in the same way as the PCRE2_NO_AUTO_CAPTURE API option
10.3014 Aug 2017Added the (?xx) inline option and the PCRE2_EXTENDED_MORE API option to ignore all unescaped whitespace, including in a character class
10.3014 Aug 2017Added the PCRE2_ENDANCHORED option, telling the engine that the pattern can only match at the end of the subject
10.3014 Aug 2017Added pcre2_pattern_convert() to the API, an experimental foreign pattern conversion function
10.3014 Aug 2017Added pcre2_code_copy_with_tables() to the API
10.2314 Feb 2017Allow backreferences in lookbehind so long as group names or numbers are unambiguous
10.2314 Feb 2017Added forward relative back-reference syntax: \g{+2} (mirroring the existing \g{-2})
10.2229 Jul 2016Added pcre2_code_copy() to the API
10.2112 Jan 2016Added the PCRE2_SUBSTITUTE_EXTENDED option to enhance replacement syntax
10.2112 Jan 2016Added the ${*MARK} facility to pcre2_substitute()
10.2112 Jan 2016Added the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option to tweak what happens during replacements when the output buffer is too small
10.2112 Jan 2016Added the PCRE2_SUBSTITUTE_UNKNOWN_UNSET and PCRE2_SUBSTITUTE_UNSET_EMPTY options to fine-tune how empty capture groups are treated in replacements
10.2112 Jan 2016Added the PCRE2_ALT_VERBNAMES option to subtly modify marked names that can be used with backtracking control verbs
10.2112 Jan 2016Added pcre2_set_max_pattern_length() to the API, allowing programs to restrict the size of patterns they are prepared to handle
10.2030 Jun 2015Added the PCRE2_ALT_CIRCUMFLEX option to allow ^ to assert position after any newline including a terminating newline
10.2030 Jun 2015Added the PCRE2_NEVER_BACKSLASH_C option to disable \C
10.2030 Jun 2015 pcre2_callout_enumerate was added to the API
10.106 Mar 2015 Serialization functions were added to the API
10.05 Jan 2015 Version check available via patterns such as (?(VERSION>=x)…)
10.05 Jan 2015 tells the engine not to automatically anchor patterns that start with .*
10.05 Jan 2015 tell the engine not to return empty matches)
10.05 Jan 2015By default, PCRE2 buils with unicode support
10.05 Jan 2015Name switch to PCRE2 and new API, which includes a replacement function: pcre2_substitute()
*********
8.415 Jul 2017Inline comments can now be inserted between ++ and +? quantifiers, as in a+(?# make it possessive)+ or a+(?# up to b)?b
8.3415 Dec 2014Added support for the POSIX , which are converted to \b(?=\w) and \b(?<=\w) internally
8.3415 Dec 2014Added \o{…} to specify code points in octal
8.3328 May 2014Added \p{Xuc} (PCRE-specific) to match characters that can be expressed using Universal Character Names
8.1025 Jun 2010Added PCRE-specific Unicode properties: \p{Xan} (alphanumeric), \p{Xsp} (Perl space), \p{Xps} (POSIX space) and \p{Xwd} (word)
8.1025 Jun 2010Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and THEN
8.1025 Jun 2010Added \N (any character that is not a line break)
8.1025 Jun 2010Added the (*UCP) start of pattern modifier, which affects \b, \d, \s and \w
7.9011 Apr 2009Added the (*UTF8) start of pattern modifier
7.707 May 2008Added Ruby-style subroutine call syntax: \g<2>, \g'name', \g'2'
7.3028 Aug 2007Added backtracking control verbs , (*PRUNE), (*THEN), (*COMMIT), (*ACCEPT)
7.3028 Aug 2007Added the (*CR) start of pattern modifier
7.2019 Jun 2007Added (?-2) and (?+2) syntax for relative subroutine calls
7.2019 Jun 2007Added (?(-2)…) and (?(+2)…) conditional syntax to check if a relative capture group has been set
7.2019 Jun 2007Added to drop what has been matched so far from the match to be returned
7.2019 Jun 2007Added named back-reference synonyms: \k{foo} and \g{foo}
7.2019 Jun 2007Added
7.2019 Jun 2007Added \h and \v (and their counterclasses \H and \V) to match horizontal and vertical whitespace
7.0019 Dec 2006Added \R to match any Unicode newline sequence
7.0019 Dec 2006Added named group synonyms (?<foo>…) and (?'foo'…)
7.0019 Dec 2006Added named subroutine call synonym (?&foo)
7.0019 Dec 2006Added named back-reference synonyms \k<foo> and \k'foo'
7.0019 Dec 2006Added named conditional synonyms (?(<foo>)…), (?('foo')…) and (?(foo)…)
7.0019 Dec 2006Added
7.0019 Dec 2006Added conditional syntax to check if a subroutine or recursion level has been reached: (?(R2)…), (?(R&foo)…) and (?(R)…)
7.0019 Dec 2006Added \g2 and \g{-2} for relative back-references
6.704 Jul 2006Added named groups in conditionals: (?(foo)…)
6.501 Feb 2006Added support for Unicode script names via \p{Arabic}
6.007 Jun 2005Added pcre_dfa_exec() to the API
6.007 Jun 2005Added pcre_refcount() to the API
6.007 Jun 2005Added pcre_compile2() to the API
5.0013 Sep 2004Added support for Unicode categories such as \p{L} and negated Unicode categories such as \P{Nd}
5.0013 Sep 2004Added \X Unicode grapheme token
4.0017 Feb 2003Added [:blank:] to match ASCII space character and tab
4.0017 Feb 2003Added \Q…\E escape sequence
4.0017 Feb 2003Added possessive quantifiers: ?+, *+, ++ and {…,…}+
4.0017 Feb 2003Added \C to match a single byte, even in UTF-8 mode
4.0017 Feb 2003Added the \G continuation anchor
4.0017 Feb 2003Added callouts (?C), (?C2) etc. which can be used in C but not PHP
4.0017 Feb 2003Added , and subroutine calls (?P>foo)
3.301 Aug 2000Added pcre_free_substring() and pcre_free_substring_list() to the API
3.001 Feb 2000Added recursion (?R)
3.001 Feb 2000Added POSIX classes such as [:alpha:]
3.001 Feb 2000Added pcre_fullinfo() to the API
2.0024 Sep 1998Atomic groups (?>) can now be quantified
2.0024 Sep 1998Added positive lookbehind (?<=…)
2.0024 Sep 1998Added negative lookbehind (?
2.0024 Sep 1998Added non-capturing groups with inline modifiers (?imsx-imsx:)
2.0024 Sep 1998Added unsetting of inline modifiers: (?-imsx)
2.0024 Sep 1998Added conditional pattern matching (?(cond)re|re)
1.0827 Mar 1998Add PCRE_UNGREEDY to invert the greediness of quantifiers
1.0827 Mar 1998Added the to turn on ungreedy mode
1.0827 Mar 1998Added the to turn on extras mode
0.9927 Oct 1997Added atomic groups (?>…)
0.9616 Oct 1997Added DOTALL mode, including inline modifier (?s)
0.9315 Sep 1997Added pcre_study() to the API
0.9211 Sep 1997Added multiline mode via inline modifier (?m) and PCRE_MULTILINE
0.9211 Sep 1997Added pcre_info() to the API (removed in 8.30)

When PCRE precedes Perl

For the most part, PCRE tries to stay in step with Perl regex syntax, but the two engines' behaviors are not always identical. As is bound to happen in communities with many active users, it can happen that an idea makes it to the PCRE engine before it gets adopted by Perl. This kind of friendly exchange is a good thing for all regexers. Parochial not invented here postures wouldn't serve us—we just want the best regex engines. Here are examples of features where PCRE preceded Perl: Recursion was first implemented in PCRE by a contributor and appeared in version 3.0 (February 2000). Perl introduced recursion in version 5.10 (officially released in December 2007), which explains why certain details function differently in the two engines. PCRE implemented Python's named group syntax (?P<foo>…) in version 4.0 (February 2003). Perl started supporting named groups in version 5.10 (officially released in December 2007).

Links to other PCRE-related Material on RexEgg

PCRE-related material is peppered throughout the site. Below, I try to maintain a list of the most significant "PCRE pockets" on the site. Reducing (?…) Syntax Confusion explains all the (?…) syntax. Other points of PCRE syntax can be found on the pages about anchors, boundaries, capture groups and others (see the "Black Belt Program") in the left-side menu at the top of the page. The page on flags and modifiers has a section about PCRE's Special Start-of-Pattern Modifiers. I've implemented an infinite lookbehind demo for PCRE. pcregrep and pcretest presents two PCRE-specific tools and includes the latest Windows binaries. My page on backtracking control verbs shows useful contructs such as (*SKIP)(*FAIL) The PHP regex page shows the PHP interface to the PCRE engine. The trick about matching line numbers shows an interesting example of self-referencing groups and of recursion. The trick about matching numbers in plain English shows an full-scale example of how (?(DEFINE)…) can be used to produce modular, maintainable patterns. If you're less interested in Perl regex in itself than in using Perl to build powerful command-line regex one-liners, visit the page on that topic.

A Word about Perl Delimiters

Before we start, a quick word about delimiters around Perl patterns is in order. You'll usually see Perl regex patterns expressed between forward slashes, as in /this pattern/, which is short for m/this pattern/. But you don't have to use forward slashes. In the "long form", where m for "match" or s for "substitute" precedes the pattern, you can use any delimiter you like. For instance, m~some pattern~ is valid. As discussed elsewhere, this is extremely convenient when your input would otherwise require you to escape forward slashes, as in anything involving html tags or urls.

What makes Perl regex special?

By now, the bulk of Perl regex syntax has drifted to other engines. For instance the PCRE engine, while not identical to Perl regex, supports esoteric Perl syntax such as backtracking control verbs and branch reset. In some cases, the flow runs the other way: recursion and named groups started in PCRE and were later adopted by Perl. Other engines have also extended regex syntax in useful directions and can, in some respects, be said to be ahead of Perl. In that category, I would place .NET's infinite lookbehind, capture collections, character class subtraction, balanced groups and right-to-left matching mode; and the fuzzy matching from Matthew Barnett's regex engine for Python. So what makes Perl regex special today is not its syntax—unless we are talking about Perl 6 regex, which is another planet altogether and miles away from mainstream adoption. I'll fully admit to not being fluent in Perl (I fumble around everytime I need to do something more complicated than a Perl regex one-liner), but my impression as an outsider is that what makes Perl regex special today is two things: Regex integrates intimately within Perl code You can use code inside your regular expressions These two things, of course, reduce to one: regex is tightly interwoven into the fabric of Perl. Indeed, to an outsider, Perl code often looks like one big regular expression. Let me give you what I consider an exquisite example of the power afforded by integrating code within regular expressions. Consider this line of code: if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {} The first thing to notice is that the =~ operator (which stands for matches) does the heavy lifting performed by a match function in other languages. So the regular expression is not an argument in a function—it is specified directly on the right side of the =~ operator, between the / delimiters. How compact! Forget the (?{print "$&\n";}) fragment for a moment. The regex pattern itself is no more than \w+(*F): match some word characters, then fail to match the (*F) token (the forced-failure token, which never matches), causing the engine to backtrack and gradually give up word characters while looking for another way to match. The magic is that each time the engine passes the \w+, before failing, it reads a capsule that contains a small piece of injected Perl code: (?{print "$&\n";}) The code itself is inside the braces: a single print statement print "$&\n"; that outputs the current match (it helps to know that $& is a special variable that contains the match, just as $1 contains the content of capture group 1). As a result, the program prints the list of temporary matches at each point where the engine finishes matching \w+, corresponding to a full path exploration: abc ab a bc b c And if that doesn't make you in awe of Perl regular expressions… Maybe nothing will. Please note that via the (?C…) callout syntax, PCRE aims to provide similar functionality to Perl's "code capsules".

A Perl program that shows how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete Perl regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Perl. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well. As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Perl pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo #!/usr/bin/perl $regex = '{[^}]+}|"Tarzan\d+"|(Tarzan\d+)'; $subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}'; # put Group 1 captures in an array my @group1Caps = (); while ($subject =~ m/$regex/g) { print $1 . "\n"; if (defined $1) {push(@group1Caps,$1);} } ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? print "*** Is there a Match? ***\n"; if ( @group1Caps > 0) { print "Yes\n"; } else { print ("No\n"); } # Task 2: How many matches are there? print "\n*** Number of Matches ***\n"; print scalar(@group1Caps); # Task 3: What is the first match? print "\n\n*** First Match ***\n"; if ( @group1Caps > 0) { print $group1Caps[0]; } # Task 4: What are all the matches? print "\n\n*** Matches ***\n"; if ( @group1Caps > 0) { foreach(@group1Caps) { print "$_\n"; } } # Task 5: Replace the matches ($replaced = $subject) =~ s/$regex/ if (defined $1) { "Superman"; } else {$&;} /eg; print "\n*** Replacements ***\n"; print $replaced . "\n"; # Task 6: Split # Start by replacing by something distinctive, # as in Step 5. Then split. @splits = split(/Superman/, $replaced); print "\n*** Splits ***\n"; foreach(@splits) { print "$_\n"; } Read the explanation or jump to the article's Table of Contents

Using Regular Expressions with C#

The C# regex tutorial is not as fully fleshed out as I would like, but I'm working on it. In the meantime, I have some great material to get you started in C#'s particular flavor of regex.

What's on this Page

With the C# page and the other language pages, my goal is not to teach you regex. That's what the rest of the site is for! My goal here is to get you fully up and running in your language by: Explaining the features that are specific to your language and regex flavor Giving you full working programs that demonstrate these features I believe in learning features by seeing working code. This is what most of this page is about. Table of Contents Here are some jump points to the content on this page. What does the .NET regex flavor taste like? What's missing from .NET regex? C# regex: the first three things you must know Two programs for all common regex tasks The "simple" program The "advanced" program Differences in .NET Regex across .NET Versions Capture Groups that can be Quantified Named Capture Reuse Quantified Reused Name Groups Balancing Groups An Alternate engine: PCRE.NET

What does the .NET regex flavor taste like?

If you hate Windows, you're going to hate .NET regular expressions. Why… Because it sucks? No. Because feature-for-feature, it may well be the best regex engine out there—or at least one of top two contenders for the spot. What's more, if for any reason you just don't like it, you can use the brilliant PCRE.NET interface to the PCRE library. Among other features, .NET regular exprssions have: Infinite-width lookbehind. This means you can write something like (?<=\d+\w+)—extremely convenient when you need to check context. If you are writing code, the only other engine to offer this feature is Matthew Barnett's experimental regex module for Python. Jan Goyvaerts' proprietary JGSoft flavor (EditPad, RegexBuddy, PowerGrep) also support infinite-width lookbehind, but only Jan is allowed to write code with it. Capture groups that can be quantified. This means that if you write (\w+\s)+, the engine will return not just one Group 1 capture, but a whole array of them. This has terrific value for parsing strings with an arbitrary number of tokens. Character class subtraction. This allows you to write [a-z0-9-[mp3]], which means you shouldn't be listening to loud music while writing regex. Err… sorry, I meant, this means you can match all lowercase English letters and digits except the characters m, p and 3. Optional right-to-left matching. I'll soon add a trick to demonstrate a situation where this could be handy. Bear in mind that in other languages, a workaround would be to reverse the string before matching, then to reverse the results. (?n) modifier (also accessible via the ExplicitCapture option). This turns all (parentheses) into non-capture groups. To capture, you must use named groups.

What's Missing from .NET regex?

.NET regex certainly doesn't have it all, though some features that seem lacking are cleanly achieved through other means. If one of your favorite feature from Perl or PCRE is really missing, don't despair yet, as you can use the brilliant PCRE.NET interface to the PCRE engine. Subroutines. In .NET, you cannot write (\d+):(?1). This is a feature I miss. By extension, neither can you write something like (?(DEFINE)(?<digits>\d+))(?&digits):(?&digits) .NET regex does not have the \K "keep out" feature. However, in Perl and PCRE, \K is only a convenience to (partially) make up for the lack of infinite-width lookbehinds. No possessive quantifiers as in \w++. I know, this is only a shorthand notation for the atomic group (?>\w+)… But it is much tidier. No branch reset. .NET does now allow (?|(cats) and dogs|(pigs) and whistles), where Group 1 can be defined in multiple places in the string. However, .NET lets you achieve the same by recycling a named groups: (?<pets>cats) and dogs|(?<pets>pigs) and whistles

The first three things you must know to use regex in C#

Before you start using regex in C#, at a bare minimum, you should know these three things. 1. Import the .NET Regex Library To access the regex classes, you need this line at the top of your code: using System.Text.RegularExpressions; 2. Use Verbatim Strings Regex patterns are full of backslashes. In a normal string, you have to escape them, which prevents you from pasting patterns straight from a regex tool. To get around this problem, use C#'s verbatim strings, whose characters lose any special significance for the compiler. To make a verbatim string, precede your string with an @ character, like so: string myPattern = @"Score: \w+: \d+"; Verbatim strings can span multiple lines. This is useful for your regex subjects as well as regex patterns that use free-spacing mode. For instance: string mySubject = @"Arizona, AZ 100 California, CA 122 South Dakota, SD 33 "; string myPattern = @"(?xm) # free-spacing mode ^([\w\s]+),\s # State ([A-Z]{2}\s) # State abbreviation (\d+) # Value of a dollar, in cents "; 3. Watch Out for \w and \d By default, .NET RegularExpressions classes assume that your string is encoded in UTF-8. The regex tokens \w, \d and \s behave accordingly, matching any utf-8 codepoint that is a valid word character, digit or whitespace character in any language. This means that by default, \d+ will match 123 \w+ will match abcddられま \s+ will match all kinds of strange whitespace characters you've never dreamed of. If all you wanted was English digits for \d, English letters, English digits and underscore for \w and whitespace characters you can understand for \s, then you need to set the ECMAScript option. Here's how to do it: var r2 = new Regex(@"\d+", RegexOptions.ECMAScript);

Two C# programs that show how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the two following complete C# regex programs. There are two programs: a "simple" one and an "advanced" one. Yes, these terms are subjective. Both programs perform the six same most common regex tasks, but in different contexts. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with C#. Differences between the "Simple" and "Advanced" program Here is the difference in a nutshell. The simple program assumes that the overall match and its capture groups is the data we're seeking. This is what you would expect. The advanced program assumes that we have no interest in the overall matches, but that the data we're seeking is in capture Group 1, if it is set. Have fun tweaking With these two programs in hand, you should have a solid base to understand how to do basic things—and fairly advanced ones as well. I hope you'll have fun changing the pattern, deleting code fragments you don't need and tweaking those you do need. Disclaimer As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a C# pro may look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment.

C# Regex Program #1: Simple

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo using System; using System.Text.RegularExpressions; using System.Collections.Specialized; class Program { static void Main() { string s1 = @"apple:green:3 banana:yellow:5"; var myRegex = new Regex(@"(\w+):(\w+):(\d+)"); ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? Console.WriteLine("*** Is there a Match? ***"); if (myRegex.IsMatch(s1)) Console.WriteLine("Yes"); else Console.WriteLine("No"); // Task 2: How many matches are there? MatchCollection AllMatches = myRegex.Matches(s1); Console.WriteLine("\n" + "*** Number of Matches ***"); Console.WriteLine(AllMatches.Count); // Task 3: What is the first match? Console.WriteLine("\n" + "*** First Match ***"); Match OneMatch = myRegex.Match(s1); if (OneMatch.Success) { Console.WriteLine("Overall Match: "+ OneMatch.Groups[0].Value); Console.WriteLine("Group 1: " + OneMatch.Groups[1].Value); Console.WriteLine("Group 2: " + OneMatch.Groups[2].Value); Console.WriteLine("Group 3: " + OneMatch.Groups[3].Value); } // Task 4: What are all the matches? Console.WriteLine("\n" + "*** Matches ***"); if (AllMatches.Count > 0) { foreach (Match SomeMatch in AllMatches) { Console.WriteLine("Overall Match: " + SomeMatch.Groups[0].Value); Console.WriteLine("Group 1: " + SomeMatch.Groups[1].Value); Console.WriteLine("Group 2: " + SomeMatch.Groups[2].Value); Console.WriteLine("Group 3: " + SomeMatch.Groups[3].Value); } } // Task 5: Replace the matches // simple replacement: reverse groups string replaced = myRegex.Replace(s1, delegate(Match m) { return m.Groups[3].Value + ":" + m.Groups[2].Value + ":" + m.Groups[1].Value; } ); Console.WriteLine("\n" + "*** Replacements ***"); Console.WriteLine(replaced); // Task 6: Split // Let's split at colons or spaces string[] splits = Regex.Split(s1, @":|\s"); Console.WriteLine("\n" + "*** Splits ***"); foreach (string split in splits) Console.WriteLine(split); Console.WriteLine("\nPress Any Key to Exit."); Console.ReadKey(); } // END Main } // END Program

C# Regex Program #2: Advanced

The second full C# regex program is featured on my page about the best regex trick ever. Here is the article's Table of Contents Here is the explanation for the code Here is the C# code

Differences in .NET Regex across .NET Versions

If you're using the latest version of .NET, don't worry about this section. After 2.0—and you're unlikely to target an earlier version—the new features are few. New regex features in .NET 4.5 The Time Out feature lets you control the risk of catastrophic backtracking. When you initialize a Regex, you can now specify a third argument to control the timeout. For instance, var myregex = new Regex( @"(A+)+", RegexOptions.IgnoreCase, TimeSpan.FromSeconds(1) ) ensures the engine searches for a match for one second at the most, after which it throws a RegexMatchTimeoutException exception. This timeout is observed by IsMatch, Match, Matches, Replace, Split and Match.NextMatch.

Character Class Subtraction

The syntax […-[…]], which allows you to subtract one character class from another, is unique to .NET—though Java and Ruby 2+ has syntax that allows you a similar operation. For details, see character class subtraction in .NET on the page about character class operations.

Capture Groups that can be Quantified

You'll recall from the page about regex capture groups that when you place a quantifier after a capturing group, as in (\d+:?)+, the regex engine doesn't create multiple capture groups for you. Instead, the capture group returns the string that was captured last. For instance, if we used the above regex on the string 111:22:33, the engine would match the whole string, and capture Group 1 would be reported as 33. Well, with .NET regex, all of that changes. If you just ask, C# will still report that Group 1 is 33. But if you dig deeper into Group 1, C# will also return a collection of captures with all the values that Group 1 captured in succession because of the + quantifier. This feature is a game changer, because it lets you easily parse strings with an unknown number of tokens. For instance, consider a file with a series of word translations for a number of languages, like so: Italian:one=uno,two=due German:one=ein,two=zwei,three=drei,four=vier Japanese:one=ichi,two=ni,three=san For each language, you would like to parse the English word (e.g. "two") and its translation (e.g. "zwei") into variables. If you had the same number of definitions for each language, you could accomplish this with a fixed number of capture groups. But, as you can see, Italian has two definitions, German has four, Japanese has three. For normal regex, the task is complex because you cannot create capture groups on the fly. With .NET, you have a simple solution. Consider a regex that matches each language at a time. It could look like this: \w+:(?:(\w+)=(\w+),?)+ The \w+: corresponds to the language (e.g. Italian:). Inside of the non-capturing parentheses, we define a dictionary pair, capturing the English word to Group 1 and the translation to Group 2. The ,? is just an optional comma (there is no comma after each language's last entry). So far, this is all normal. The odd thing here is the + quantifier that repeats the expression for a dictionary pair. What happens to the capture Groups? In normal regex, if we had just matched the Italian entries, Group 1 and Group 2 would correspond to the last dictionary pair captured for that match, i.e. two for Group 1 and due for Group 2. In .NET, Group 1 and Group 2 are objects. Their Value property is the same as in other regex flavors, i.e. the the last dictionary pair captured for the current match. The twist is that each Group has a member called Captures, which is an object that contains all the captures that were made for that group during the match. Therefore for the first match (the Italian entries), Group 1's Captures member will contain two captures, whose values are "one" and "two". The code below uses this example and shows you exactly how to implement the feature. Before you examine the code, have a look at the output, which explains how the groups work. or leave the site to view an online demo using System; using System.Text.RegularExpressions; class Program { static void Main() { string ourString = @"Italian:one=uno,two=due Japanese:one=ichi,two=ni,three=san"; string ourPattern = @"\w+:(?:(\w+)=(\w+),?)+"; var ourRegex = new Regex(ourPattern); MatchCollection AllMatches = ourRegex.Matches(ourString); Console.WriteLine("**** Understanding .NET Capture Groups with Quantifiers ****"); Console.WriteLine("\nOur string today is:" + ourString); Console.WriteLine("Our regex today is:" + ourPattern); Console.WriteLine("Number of Matches: " + AllMatches.Count); Console.WriteLine("\n*** Let's Iterate Through the Matches ***"); int matchNum = 1; foreach (Match SomeMatch in AllMatches) { Console.WriteLine("Match number: " + matchNum++); Console.WriteLine("Overall Match: " + SomeMatch.Value); Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count); Console.WriteLine("Why three Groups, not two? Because the overall match is Group 0"); // Another way of printing the overall match Console.WriteLine("Group 0: " + SomeMatch.Groups[0].Value); Console.WriteLine("Since Groups 1 and 2 have quantifiers, the value of Group 1 and Group 2 is the last capture of each group"); Console.WriteLine("Group 1: " + SomeMatch.Groups[1].Value); Console.WriteLine("Group 2: " + SomeMatch.Groups[2].Value); // For this match, let's look all the Group 1 captures manually int g1capCount = SomeMatch.Groups[1].Captures.Count; Console.WriteLine("Number of Group 1 Captures: " + g1capCount); Console.WriteLine("Group 1 Capture 0: " + SomeMatch.Groups[1].Captures[0].Value); Console.WriteLine("Group 1 Capture 1: " + SomeMatch.Groups[1].Captures[1].Value); // To be safe, we'll check if we have a third capture for Group 1 // Because the first overall match only has two captures if(g1capCount>2) Console.WriteLine("Group 1 Capture 2: " + SomeMatch.Groups[1].Captures[2].Value); // Let's look at Group 2 captures automatically int g2capCount = SomeMatch.Groups[2].Captures.Count; Console.WriteLine("Number of Group 2 Captures: " + g2capCount); int i2 = 0; foreach (Capture g2capture in SomeMatch.Groups[2].Captures ) { Console.WriteLine("Group 2 Capture " + i2 + ": " + g2capture.Value); i2++; } // end iterate G2 captures Console.WriteLine("\n"); } // end iterate matches Console.WriteLine("\nPress Any Key to Exit."); Console.ReadKey(); } // END Main } // END Program

Named Capture Reuse

C# allows you to reuse the same named group multiple times. For a given group, you retrieve the array of captures in the same way as with quantified capture groups. The following program shows you how. or leave the site to view an online demo using System; using System.Text.RegularExpressions; class Program { static void Main() { string ourString = @"one:uno dos:two three:tres"; string ourPattern = @"(?<someword>\w+):(?<someword>\w+)"; var ourRegex = new Regex(ourPattern); MatchCollection AllMatches = ourRegex.Matches(ourString); Console.WriteLine("**** Understanding Named Capture Reuse ****"); Console.WriteLine("\nOur string today is:" + ourString); Console.WriteLine("Our regex today is:" + ourPattern); Console.WriteLine("Number of Matches: " + AllMatches.Count); Console.WriteLine("\n*** Let's Iterate Through the Matches ***"); int matchNum = 1; foreach (Match SomeMatch in AllMatches) { Console.WriteLine("Match number: " + matchNum++); Console.WriteLine("Overall Match: " + SomeMatch.Value); Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count); Console.WriteLine("Why two Groups, not one? Because the overall match is Group 0"); // Another way of printing the overall match Console.WriteLine("Groups[0].Value = " + SomeMatch.Groups[0].Value); Console.WriteLine(@"Since the 'someword' group appears more than once in the pattern, the value of Groups[1] and Groups[""someword""] is the last capture of each group"); Console.WriteLine("Groups[1].Value = " + SomeMatch.Groups[1].Value); Console.WriteLine(@"Groups[""someword""].Value = " + SomeMatch.Groups["someword"].Value); // Let's look all the first captures manually Console.WriteLine("Someword Capture 0: " + SomeMatch.Groups["someword"].Captures[0].Value); // Let's look at someword captures automatically int somewordCapCount = SomeMatch.Groups["someword"].Captures.Count; Console.WriteLine("Number of someword Captures: " + somewordCapCount); int i2 = 0; foreach (Capture someword in SomeMatch.Groups["someword"].Captures) { Console.WriteLine("someword Capture " + i2 + ": " + someword.Value); i2++; } // end iterate G2 captures Console.WriteLine("\n"); } // end iterate matches Console.WriteLine("\nPress Any Key to Exit."); Console.ReadKey(); } // END Main } // END Program

Quantified Reused Named Groups

What if you were to combine quantified capture groups with reused named groups? No problem. For the given named capture, C# just keeps adding captured strings in the order they are captured. The following program shows you how this works. or leave the site to view an online demo using System; using System.Text.RegularExpressions; class Program { static void Main() { string ourString = @"one-two-three:uno-dos-tres one-two-three:ichi-ni-san"; string ourPattern = @"(?:(?<someword>\w+)-?)+:(?:(?<someword>\w+)-?)+"; var ourRegex = new Regex(ourPattern); MatchCollection AllMatches = ourRegex.Matches(ourString); Console.WriteLine("**** Understanding Named Capture Reuse ****"); Console.WriteLine("\nOur string today is:" + ourString); Console.WriteLine("Our regex today is:" + ourPattern); Console.WriteLine("Number of Matches: " + AllMatches.Count); Console.WriteLine("\n*** Let's Iterate Through the Matches ***"); int matchNum = 1; foreach (Match SomeMatch in AllMatches) { Console.WriteLine("Match number: " + matchNum++); Console.WriteLine("Overall Match: " + SomeMatch.Value); Console.WriteLine("\nNumber of Groups: " + SomeMatch.Groups.Count); Console.WriteLine("Why two Groups, not one? Because the overall match is Group 0"); // Another way of printing the overall match Console.WriteLine("Groups[0].Value = " + SomeMatch.Groups[0].Value); Console.WriteLine(@"Since the 'someword' group appears more than once in the pattern, the value of Groups[1] and Groups[""someword""] is the last capture of each group"); Console.WriteLine("Groups[1].Value = " + SomeMatch.Groups[1].Value); Console.WriteLine(@"Groups[""someword""].Value = " + SomeMatch.Groups["someword"].Value); // Let's look all the first captures manually Console.WriteLine("Someword Capture 0: " + SomeMatch.Groups["someword"].Captures[0].Value); // Let's look at someword captures automatically int somewordCapCount = SomeMatch.Groups["someword"].Captures.Count; Console.WriteLine("Number of someword Captures: " + somewordCapCount); int i2 = 0; foreach (Capture someword in SomeMatch.Groups["someword"].Captures) { Console.WriteLine("someword Capture " + i2 + ": " + someword.Value); i2++; } // end iterate G2 captures Console.WriteLine("\n"); } // end iterate matches Console.WriteLine("\nPress Any Key to Exit."); Console.ReadKey(); } // END Main } // END Program

Balancing Groups

I haven't yet written this section, but there are great examples of this feature in several sections of the site. These will show you everything you need to know to get started with balancing groups. Matching Line Numbers Quantifier Capture

An Alternate engine: PCRE.NET

PCRE is another of my favorite engines. In fact, this site probably has the most comprehensive resources about PCRE on the web, from the PCRE documentation and changelog to PCRE's special start of pattern modifiers, backtracking control verbs and the pcregrep and pcretest utilities. So when I found out about Lucas Trzesniewski's .NET wrapper around the PCRE library, I was excited. This means you can get around .NET's lack of a few features such as recursion. In Visual Studio 2015, installation is a snap: Create a project. Press Ctrl + Q for the Quick Launch window, type nuget and select Manage Nuget Packages for Solution. In the search window, type pcre.net, making sure that the filters pull-down is set to All. Install. The Visual C++ Redistributable for Visual Studio 2015 is a requirement, but you probably won't need to install it if you installed all of VS2015. Compared with using .NET regex, one difference to keep in mind is that on top of the .exe file, you'll have to distribute PCRE.NET.dll (which will be in your build folder). It only weighs 350kB so that's not a big deal. Still if for some reason you're shooting for the size of a small console program such as the one below (about 7kB once compiled), this will blow up the budget. Of course in the case of a pure .NET solution you're probably "paying" a similar weight, but it's hidden in the framework's System.Text.RegularExpressions.dll (29kB) and (I assume) its parents. To get you started, I'll give you a simple but fully functioning program that showcases the main methods. Beyond that, please visit my page about PCRE callouts, which shows more code examples in PCRE.NET see the GitHub repo if you'd like more information. I hope you'll forgive the weird indentation—I wanted everything to fit within the narrow box. using System; using PCRE; using System.Linq; class Program { static void Main() { string subject = "<000> 111 <222> 333 4444"; // Match three digits, unless they live inside angle brackets var digits_regex = new PcreRegex(@"<[^>]+(*SKIP)(*F)|\b\d(\d)\d\b"); // Does the pattern match? Console.WriteLine("=== Does it Match? ==="); Console.WriteLine(digits_regex.IsMatch(subject)); // What is the first match? Console.WriteLine("=== First Match ==="); var onematch = digits_regex.Match(subject); if (onematch.Success) { // onematch.Value is the same as onematch.Groups[0].Value Console.WriteLine(onematch.Value); } // What is Capture Group 1? Console.WriteLine("=== Capture Group 1 ==="); if (onematch.Success) { Console.WriteLine(onematch.Groups[1].Value); } // What are all the matches? Console.WriteLine("=== Matches ==="); var matches = digits_regex.Matches(subject); if (matches.Any()) { foreach (var match in matches) { Console.WriteLine(match.Value); } // Replace: surround with angle brackets Console.WriteLine("=== Replacements ==="); string replaced = digits_regex.Replace(subject, ""); Console.WriteLine(replaced); // Replace using callback Console.WriteLine("=== Replacements with Callback ==="); string replaced2 = digits_regex.Replace(subject, delegate (PcreMatch m) { if (m.Value == "111") return "<ones>"; else return m.Value; }); Console.WriteLine(replaced2); // Split Console.WriteLine("=== Splits ==="); var splits = digits_regex.Split(subject); foreach (var split in splits) { Console.WriteLine(split); } Console.WriteLine("Press Any Key"); Console.ReadKey(); } } } Subject: Amazing website It helped me a lot with my validation controls. Thanks You have no idea how hard it is for me to find a good free site to learn regex. The C# docs is so alien, thank you so much from the bottom of my heart. Subject: Awesome site! Hey Rex, Awesome site you have here. I'm a C# and PHP coder and have been working off and on with Regex. I love it as a tool, but sometimes it gets so darn confusing. Thanks for putting this site up. Nice job!

Using Regular Expressions with PHP

With the preg family of functions, PHP has a great interface to regex! Let's explore how it works and what it has to offer.

Pattern Delimiters

The first and most important thing to know about the preg functions is that they expect you to frame the regex patterns you feed them with one delimiter character on each side. For instance, if you choose "~" as a delimiter, for the regex pattern \b\w+\b, this is the string you would feed to a preg function: '~\b\w+\b~' For the delimiter, you can choose any character apart from spaces and backslashes. But choose wisely, because if your delimiter appears in the pattern, it needs to be escaped. The forward slash is a popular delimiter, and strangely so since it needs to be escaped in all sorts of strings having to do with file paths. For instance, to match http://, do you really want your regex string to look like '/http:\/\//'? Doesn't '~http://~' look better? Rare characters such as "~", "%", "#" or "@" are more sensible and fairly popular choices. I don't like the "#" because it clashes with the # you use in comment mode. Esthetically, my favorite is the tilde ("~") because it meets three criteria. First, it is discrete, which allows the actual regex to stand out. Many delimiters look like they belong to the expression, and that is confusing. Second, tildes rarely occurs in my patterns, so I almost never have to escape them. Third, it is my favorite, which allows me to introduce some circular logic in this paragraph.

Pattern Modifiers: either Inline or as Flags

The second thing to know about PHP regex is that you can change their meaning by using modifiers, either as flags or inline. For instance, to look for "blob\d+" in case-insensitive fashion, you can add the "i" modifier in these two ways: As a flag at the end of the pattern: ~blob\d+~i Inline at the start of the pattern: ~(?i)blob\d+~ I tend to prefer inline modifier syntax, first because it jumps out at you when you start reading the regex, second because it is more portable across other regex flavors, and third because you can turn it off further down the string (for instance, (?-i) turns off the case-insensitive modifier). The modifiers page explains all the flags and shows how to set them. It also presents PCRE's Special Start-of-Pattern Modifiers, which include little-known modifiers such as (*LIMIT_MATCH=x). Whatever you do, never use the cursed U flag or the (?U) modifier because they will draw a gang of raptorexes to your cubicle—not a good look! The u flag and (?u) modifier, on the other hand, are fine—they make the engine treat the input as a utf-8 string.

The Preg functions

There are five major functions in the preg family:

Matching Once with Preg_Match()

This function is the most commonly seen in the world of php regex. It returns a boolean to indicate whether it was able to match. If you include a variable's name as a third parameter, such as $match in the example below, when there is a match, the variable will be filled with an array: element 0 for the entire match, element 1 for Group 1, element 2 for Group 2, and so on. But a code box is worth a thousand words, so consider the following example. $subject='Give me 10 eggs'; $pattern='~\b(\d+)\s*(\w+)$~'; $success = preg_match($pattern, $subject, $match); if ($success) { echo "Match: ".$match[0]."<br />"; echo "Group 1: ".$match[1]."<br />"; echo "Group 2: ".$match[2]."<br />"; } Output: Match: 10 eggs Group 1: 10 Group 2: eggs Notice how $match[0] contains the overall match? Considering that $match[1] contains Group 1, this is equivalent to saying that the whole match is "Group 0", which is in tune with an idea presented in the section about capturing vs. matching: "The Match is Just Another Capture Group".

Finding All Matches with Preg_Match_All()

This terrific function gives you access to all of the pattern's matches. The matches (and the captured groups if any) are returned to an array. Depending on your needs, you can ask the function to organize the array of results in two distinct ways. Consider this string and a regex pattern to match its lines: $airports= 'San Francisco (SFO) USA Sydney (SYD) Australia Auckland (AKL) New Zealand'; $regex = '%(?m)^\s*+([^(]+?)\s\(([^)]+)\)\s+(.*)$%'; You want to isolate the airport's city, the airport code and the country. Here are the two ways to organize the array. First Presentation: in the Order of the Pattern's Groups In both presentations, $hits will contain the number of matches found (including 0 if none are found). $hits = preg_match_all($regex,$airports,$matches,PREG_PATTERN_ORDER); The output is below. Element 0 contains an array with the whole matches; element 1 contains an array with the Group 1 matches; element 2 contains an array with the Group 2 matches; and so on. This order (whole match, Group 1, Group 2, Group 3) can be said to be "the order of the regex pattern". The flag for this presentation is PREG_PATTERN_ORDER (think of it as "the order of the regex pattern"). This is actually the function's default behavior, so you can freely drop the PREG_PATTERN_ORDER flag when you call the function. Array ( [0] => Array // The Whole Matches ( [0] => San Francisco (SFO) USA [1] => Sydney (SYD) Australia [2] => Auckland (AKL) New Zealand ) [1] => Array // The Group 1 Matches ( [0] => San Francisco [1] => Sydney [2] => Auckland ) [2] => Array // The Group 2 Matches ( [0] => SFO [1] => SYD [2] => AKL ) [3] => Array // The Group 3 Matches ( [0] => USA [1] => Australia [2] => New Zealand ) ) Second Presentation: ordered by SET (one set for each match) Again, $hits contains the number of matches found (including 0 if none are found). $hits = preg_match_all($regex,$airports,$matches,PREG_SET_ORDER); The output is below. Note that the outer array is organized "one SET for each match at a time". Element 0 contains an array with the first match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…) Element 1 contains an array with the second match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…) Sometimes, this structure is exactly what you want. The flag for this presentation is PREG_SET_ORDER (think of it as "ordered by set"). Array ( [0] => Array // The First Match ( [0] => San Francisco (SFO) USA [1] => San Francisco [2] => SFO [3] => USA ) [1] => Array // The Second Match ( [0] => Sydney (SYD) Australia [1] => Sydney [2] => SYD [3] => Australia ) [2] => Array // The Third Match ( [0] => Auckland (AKL) New Zealand [1] => Auckland [2] => AKL [3] => New Zealand ) ) To remember the flags, try to understand them as "in the order of the regex pattern" (PREG_PATTERN_ORDER), or "ordered by set" (PREG_SET_ORDER)

Replacing with Preg_Replace()

For straight replacements (for instance, replacing '10' with '20'), you don't really need regex. In such cases, str_replace can be faster than the preg_replace regex function: $string=str_replace('10','20','$string'); The preg_replace function comes in when you need a regex pattern to match the string to be replaced, for instance if you only wanted to replace '10' when it stands alone but not when it is part "101" or "File10". By default, the function replaces all of the matches in the original string, so make sure this is what you want. If you want to replace only 1 or 5 instances, specify this limit as a fourth argument. Here is an example. $subject='Give me 12 eggs then 12 more.'; $pattern='~\d+~'; $newstring = preg_replace($pattern, "6", $subject); echo $newstring; The Output: Give me 6 eggs then 6 more. This code replaces the two instances of "12" with "6". If you wanted to only replace the first instance, you would set the limit (1) as a fourth argument: $newstring = preg_replace($pattern, "6", $subject,1); This would output "Give me 6 eggs then 12 more." If you want to know how many replacements are made, add a variable as a fifth parameter. This forces you to set the fourth parameter (the limit number of replacements). To set no limit, use -1. For instance, with $newstring = preg_replace($pattern, "6", $subject,-1, $count); The value of $count would be 2. Using Captured Groups in the Replacement In the replacement string, you can refer to capture groups. Group 1 is \1 or $1, Group 2 is \2 and $2, and so on. This means that the replacement string "\2###\1" will replace the matched text with the content of Group 2 followed by three hashes and the content of Group 1. This technique is often used when you want to rearrange the sequence of a string. You might match a whole big string full of unwanted fluff, capture the portions you are interested in, and rearrange them how you like. Note that as it makes one replacement after another, the regex engine keeps working on the original string—rather than switching to the latest version of the string. For instance, using the string abcde, let's use the regex (?<=a)\w, which matches one word character preceded by an a: $string = preg_replace('~(?<=a)\w~','a','abcde'); This produces aacde: only the "b" was replaced, because in the original string it is the only character that is preceded by an "a". If, on the other hand, the regex engine switched to the latest version of the string after making each substitution, when it came to "c", that character would also be preceded by an "a", and we would end with aaaaa. Replacing an Invisible Delimiter This is a trick that regex lovers are sure to enjoy. It is closely related to the technique of Splitting with an Invisible Delimiter, so I explain it in that section.

Sophisticated Replacements with Preg_Replace_Callback()

It's neat that preg_replace allows you to manipulate the replacement string by referring to captured groups. But let's face it, often you want to operate some far more complex substitutions on the text you match. This is when preg_replace_callback comes to the rescue. Instead of specifying a litteral replacement (or a replacement composed of litterals and capture groups), preg_replace_callback lets you specify a replacement function. That function does its magic on the matched pattern and returns the replacement, which preg_replace_callback then plugs into place in the original string. For instance, suppose you have a string where you need the last letter of each word to be converted to uppercase. First we'll look at the basic syntax, then we'll see an "inline syntax" that is more economical. In both cases, we'll use this regex: \b(\w+)(\w)\b This pattern simply matches each word separately (thanks to the \b word boundaries). As it does so, it captures all of a word's letters except its last into Group 1, and it captures the final letter into Group 2. (For this task, we're assuming that each word has at least two letters, so we're okay.) Here's the basic way of doing the replacement. $string = ("cool kids capitalize final letters"); $regex = "~\b(\w+)(\w)\b~"; $newstring = preg_replace_callback($regex,"LastToUpper",$string); function LastToUpper($m) { return $m[1].strtoupper($m[2]); } echo $newstring; The Output: cooL kidS capitalizE finaL letterS In the example above, you can see how preg_replace_callback specifies the name of the function that produces the replacement strings: "LastToUpper". The function LastToUpper is then defined. We know that preg_replace_callback sends one parameter to the substitution function, so we specify it and call it—arbitrarily—$m. This $m that preg_replace_callback sends to the substitution function is the current match array, in the same form as the match array of . This means that $m[0] is the overall match, while $m[1] is Group 1, $m[2] is Group 2, and so on. This makes it easy for LastToUpper to return the word with the last letter capitalized: it is Group 1 (the initial letters) concatenated with the uppercase version of Group 2 (the last letter). Here we did something simple, but you can appreciate how easy it would be to infuse our substitution with more logic. Suppose, for instance, that we want to capitalize the last letter of each word, but that when that letter is an "s", we want to substitute a "Z". Easy done: we just burn that logic into the callback function. function LastToUpper($m) { $last = $m[2]=="s" ? "Z" : strtoupper($m[2]); return $m[1].$last; } The Output: cooL kidZ capitalizE finaL letterZ Lighter Version: Use an Anonymous Function Usually, we have no use for the substitution function except for the particular regex we're working on. The second method is the same, except that instead of passing a function name in the second argument, we define the function "inline" in the call to preg_replace_callback. $string = ("cool kids capitalize final letters"); $regex = "~\b(\w+)(\w)\b~"; $newstring = preg_replace_callback($regex, function($m) {return $m[1].strtoupper($m[2]);} ,$string); echo $newstring; Same Output: cooL kidS capitalizE finaL letterS As you can see, our callback function has no name: it's an anonymous function, so we don't pollute the name space. With this, you're equipped to make some powerful substitutions.

Splitting with Preg_Split()

You are probably familiar with the explode() function, which takes some text with elements delimited by a string (such as a comma, or three stars: ***) and splits the text along the delimiter, fanning the elements into an array. For instance, the following would print an array with "break", "my" and "string". $string = ("break***my***string"); print_r(explode("***",$string)); Well, preg_split is the "adult" version of explode(). It too will split a string, but it will allow you to use variable delimiters, making it easy to extract interesting bits of text with unwanted (but specifiable) gunk in the middle. For instance, let's assume that this time, the delimiter (or unwanted part) is a C-style comment (with optional spaces on the side for good measure), such as "/* This part is useless to us */". For the purpose of this example, we assume that we know that the delimers are single C-style comments, meaning that there are no nested comments (that's a different exercise related to matching balanced parentheses). No worries. The following will output "better", "regex", "today". $string = ("better /* I want to improve */ regex/***COOL***/today"); $regex = "~\s*/\*.*?\*/\s*~"; print_r(preg_split($regex,$string)); The Output: Array ( [0] => better [1] => regex [2] => today ) Like preg_replace, preg_split has an optional parameter (in third place) that allows you to set a limit on the number of elements you want to fan to the array. There are also some flags that you can read about on the preg_split manual page. And now, here's a way of looking at things that's sure to interest the algorithm lovers among you: Often, you can use preg_split instead of preg_match_all. In a way, both return matches. While preg_match_all specifies what you want, preg_split specifies what you want to remove. (Or, as we'll see below, what we want to set apart.) Splitting without Losing Sometimes you want to split a string without removing anything from it. Or we might only want to remove a certain section. Imagine a long ribbon with consecutive colors: red, blue, red, blue, red… So far, the splitting we have seen would remove all the reds to produce an array with all the blues. But another use of preg_split is to split the string into an array with the correct "bands of red and blue". For this, we use a flag: PREG_SPLIT_DELIM_CAPTURE. Here's how it works. In the example below, our delimiter is a series of digits, for instance "123". Instead of throwing them away, we want to keep them. $str = "We123Like456Delimiters"; $regex = "~(\d+)~"; print_r(preg_split($regex,$str,-1,PREG_SPLIT_DELIM_CAPTURE)); The Output: Array: [0]=>We [1]=>123 [2]=>Like [3]=>456 [4]=>Delimiters In our preg_split call, the third parameter -1 just states we don't want to limit the number of matches. What PREG_SPLIT_DELIM_CAPTURE actually does is to insert any captured groups into the array. This is why the (\d+) was in parentheses: we include the whole delimiter into the array. But we don't have to keep the entire delimiter. Imagine for instance that your delimiter is of the form @@ABC123, where ABC are three capital letters and 123 are three digits. If you want to fan "ABC" and "123" into the array but lose the "@@", you would do this: $str = "token1@@ABC123token2@@DEF456token3"; $regex = "~@@([A-Z]{3})(\d{3})~"; print_r(preg_split($regex,$str,-1,PREG_SPLIT_DELIM_CAPTURE)); The Output: Array: [0]=>token1, [1]=>ABC, [2]=>123, [3]=>token2, [4]=>DEF, [5]=>456, [6]=> token3 Splitting with an Invisible Delimiter Here is a lovely feature of splitting string with regex. The preg_split function allows you to split a string with an invisible delimiter. For instance, consider a movie title written in camel case (perhaps because it was in a file name): TheDayMyVoiceBroke. You're interested in retrieving each word. But what's the delimiter? There is an "invisible" delimiter: any space where the next character is a capital letter. This can be expressed as a simple lookahead: (?=[A-Z]). You could call that a "zero-width delimiter". Let's see it at work: $string = ("TheDayMyVoiceBroke"); $regex = "~(?=[A-Z])~"; $words = preg_split($regex,$string); print_r($words); The Output: Array ( [0] => [1] => The [2] => Day [3] => My [4] => Voice [5] => Broke ) Magical! But maybe we want to concatenate the words of the movie into a string, with spaces between the words? Before you reach for implode($words," "), consider that what we just did with preg_split, we can do with preg_replace. Here is the code and the output. Replacing an Invisible Delimiter $string = ("TheDayMyVoiceBroke"); $regex = "~(?=[A-Z])~"; echo preg_replace($regex," ",$string); The Output: The Day My Voice Broke

More About preg Functions

The above functions have a few settings I haven't shown. PHP also has a few other preg functions, but they are of minor interest compared with the ones presented here. You can read about them in the preg function section of the PHP manual. In Chapter 10.4 ("Missing" preg Functions) of , Jeffrey Friedl also presents three functions he has programmed to "round off" the preg functions. I recommend you read the book, but if you're in a hurry you can find the functions in the code section of regex.info, Jeffrey's website. Hit Ctrl + F to search for "preg_regex_to_pattern", "preg_pattern_error" and "preg_regex_error".

A Powerful Lookbehind Alternative: \K

If your version of PHP is 5.2.4 or later (phpinfo is your friend), you can use a wonderful PCRE escape sequence: \K. In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind. For example, on the string "Marlon Brando", the pattern (?i)marlon \Kbrando will return "Brando". Well, you could get "Brando" with a capture group or a lookbehind, so what's the big deal? The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K. For instance, let's say you want to match "Brando xx" in "Marlon Brando xx" (where xx are digits) but only if the string sits somewhere between a <tag> and a </tag>. You can't look behind for the start of the tag because you don't know how many characters are before "Marlon Brando", and variable-length lookbehinds are forbidden in PCRE. One option is to match everything and capture "Brando xx" in a Group. Option 2 is to use \K, saving us the overhead of a capture group: (?i)<tag>(?:(?!</tag).)*marlon \Kbrando \d+

A Full "Advanced" PHP regex program

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete PHP regex program. The program is featured on my page about the best regex trick ever. This program performs the six most common regex tasks. The tweak is that it has no interest in the overall matches: the data we're seeking is in capture Group 1, if it is set. As a side-benefit, the program and the article happen to provide an excellent overview of the (*SKIP)(*FAIL) syntax available in Perl and PHP. Just search throughout the article. Here is the article's Table of Contents Here is the explanation for the code Here is the PHP code

More about PHP Regex

For more details on PHP's PCRE regex flavor, I recommend a stroll through three pages of the PHP manual: Pattern syntax Modifiers (e.g. case insensitive) Functions (e.g. preg_match) If you are serious about learning all there is to know about PHP's PCRE regex flavor, then sooner or later you will want to head over to my PCRE documentation repository. With the permission of Philip Hazel, the creator of PCRE, this page contains the documentation for the latest PCRE release as well as other historical releases. It also contains a table showing in which versions of PCRE new syntax features were introduced, as well as links to other PCRE-related material on the site. Lately, I have been working hard on beefing up the site. There are exciting new pages, and old ones have shiny new sections. The Python regex tutorial is not fully ready for prime-time, but it's one of four at the top of my priority list. I'm working on it! In the meantime, I don't want to leave you Python coders out dry, so below there are two programs that show everything you need to get started with Python regex. But first, I feel that a word is in order about the feature set in the re module. Missing from the re module So what's missing from the re module? Here is a list I've cobbled together. It's incomplete but will give you an idea: Atomic groups and possessive quantifiers Unicode properties Variable-width lookbehind \G \K \z Splitting on zero-width matches (fixed in Python 3.7) Subroutine calls and recursion Character class operations Branch reset (*SKIP)(*FAIL) Advanced features for inline modifiers such as (?i): setting them in the middle of a pattern, turning them off as in (?-i), applying them to a subexpression as in (?i:foo) Why I use the regex package In my view, the alternate regex package by Matthew Barnett may possibly be the very best regex engine available in a mainstream programming language. Before you Perl fans send me flame letters, I'll explain why: the regex package combines some of the advanced features of .NET (infinite lookbehind, capture collections, character class operations, right-to-left matching) with some of the advanced features of Perl, PCRE and Ruby (subroutines and recursion). It even has a fuzzy matching feature. The recent addition of \K, (?(DEFINE)…) and (*SKIP)(*FAIL) make it a delight to translate advanced patterns from Perl or PCRE. If I could add anything to my perfect regex engine dreamlist to round up this amazing engine, it would be balancing groups and some kind of ground-breaking quantifier capture feature. An iPython Notebook presentation about the regex package Around the time I was thinking of putting together a presentation about Python regex for our local Python meetup, I received a message from Rex Dwyer, who kindly shared a presentation he had made for his local Python users' group. Synchronicity! You can download the presentation here. It is an iPython notebook. I have confirmed that all the cells run in Jupyter for Python 3, but I haven't yet had the time to read the presentation.

Curated Changelog to the re module

Python 3.7: splitting on zero-width matches Python 3.8: \N was added to match specific characters by name, e.g. \N{YEN SIGN} instead of \u00A5 to match the ¥ character

Python Regex Program #1: Simple

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo import re # import regex # if you like good times # intended to replace `re`, the regex module has many advanced # features for regex lovers. http://pypi.python.org/pypi/regex pattern = r'(\w+):(\w+):(\d+)' subject = 'apple:green:3 banana:yellow:5' regex = re.compile(pattern) ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? print("*** Is there a Match? ***") if regex.search(subject): print ("Yes") else: print ("No") # Task 2: How many matches are there? print("\n" + "*** Number of Matches ***") matches = regex.findall(subject) print(len(matches)) # Task 3: What is the first match? print("\n" + "*** First Match ***") match = regex.search(subject) if match: print("Overall match: ", match.group(0)) print("Group 1 : ", match.group(1)) print("Group 2 : ", match.group(2)) print("Group 3 : ", match.group(3)) # Task 4: What are all the matches? print("\n" + "*** All Matches ***\n") print("------ Method 1: finditer ------\n") for match in regex.finditer(subject): print ("--- Start of Match ---") print("Overall match: ", match.group(0)) print("Group 1 : ", match.group(1)) print("Group 2 : ", match.group(2)) print("Group 3 : ", match.group(3)) print ("--- End of Match---\n") print("\n------ Method 2: findall ------\n") # if there are capture groups, findall doesn't return the overall match # therefore, in that case, wrap the pattern in capturing parentheses # the overall match becomes group 1, so other group numbers are bumped up! wrappedpattern = "(" + pattern + ")" wrappedregex = re.compile(wrappedpattern) matches = wrappedregex.findall(subject) if len(matches)>0: for match in matches: print ("--- Start of Match ---") print ("Overall Match: ",match[0]) print ("Group 1: ",match[1]) print ("Group 2: ",match[2]) print ("Group 3: ",match[3]) print ("--- End of Match---\n") # Task 5: Replace the matches # simple replacement: reverse group print("\n" + "*** Replacements ***") print("Let's reverse the groups") def reversegroups(m): return m.group(3) + ":" + m.group(2) + ":" + m.group(1) replaced = regex.sub(reversegroups, subject) print(replaced) # Task 6: Split print("\n" + "*** Splits ***") # Let's split at colons or spaces splits = re.split(r":|\s",subject) for split in splits: print (split)

Python Regex Program #2: Advanced

The second full Python regex program is featured on my page about the best regex trick ever. Here is the article's Table of Contents Here is the explanation for the code Here is the Python code For versions prior to 3.7, re does not split on zero-width matches EDIT: the following text is obsolete as of Python 3.7 In most regex engines, you can use lookarounds to split a string on a position, i.e. a zero-width match, obtained for instance by using boundaries or lookarounds. For instance, you would use (?=-) to split when the next character is a dash. However, for historical reasons—a bug that is now too old to fix—Python's re.split does not split on zero-width matches. For instance, re.split("(?=-)", "a-beautiful-day") returns ['a-beautiful-day']. To split on zero-width matches in Python, we need to use the regex module in V1 mode. For instance, regex.split("(?V1)(?=-)", "a-beautiful-day") will return ['a', '-beautiful', '-day']—which is what we want. Java regex is an interesting beast. On the one hand, it has a number of "premium" features, such as: Lookbehind that allows a variable width within a specified range Methods that return the starting and ending point of a match in a string. Support for \R to match any kind of line break, including CRLF pairs. Support for the \G anchor (which asserts that the current position is the beginning of the string or the position immediately following the previous match) Support for the \Q … \E (block escape) Possessive quantifiers. The (?d) modifier (also accessible via the UNIX_LINES option). When this is on, the line feed character \n is the only one that affects the dot . (which doesn't match line breaks unless DOTALL is on) and the anchors ^ and $ (which match line beginnings and endings in multiline mode.) On the other hand, Java regex has several unpleasant aspects, such as: Absense of important other premium features found in .NET, Perl or PCRE—such as \K, (*SKIP)(*F), subroutines and recursion. The absence of raw strings, forcing us to double escape backslashes in regex patterns A buggy lookbehind which has a number of undocumented effects.

A Java program

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete Java regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Java. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well. As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Java pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo import java.util.*; import java.io.*; import java.util.regex.*; import java.util.List; class Program { public static void main (String[] args) throws java.lang.Exception{ String subject = "Jane\" \"Tarzan12\" Tarzan11@Tarzan22 {4 Tarzan34}"; Pattern regex = Pattern.compile("\\{[^}]+\\}|\"Tarzan\\d+\"|(Tarzan\\d+)"); Matcher regexMatcher = regex.matcher(subject); List<String> group1Caps = new ArrayList<String>(); // put Group 1 captures in a list while (regexMatcher.find()) { if(regexMatcher.group(1) != null) { group1Caps.add(regexMatcher.group(1)); } } // end of building the list ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? System.out.println("*** Is there a Match? ***"); if(group1Caps.size()>0) System.out.println("Yes"); else System.out.println("No"); // Task 2: How many matches are there? System.out.println("\n" + "*** Number of Matches ***"); System.out.println(group1Caps.size()); // Task 3: What is the first match? System.out.println("\n" + "*** First Match ***"); if(group1Caps.size()>0) System.out.println(group1Caps.get(0)); // Task 4: What are all the matches? System.out.println("\n" + "*** Matches ***"); if(group1Caps.size()>0) { for (String match : group1Caps) System.out.println(match); } // Task 5: Replace the matches // if only replacing, delete the line with the first matcher // also delete the section that creates the list of captures Matcher m = regex.matcher(subject); StringBuffer b= new StringBuffer(); while (m.find()) { if(m.group(1) != null) m.appendReplacement(b, "Superman"); else m.appendReplacement(b, m.group(0)); } m.appendTail(b); String replaced = b.toString(); System.out.println("\n" + "*** Replacements ***"); System.out.println(replaced); // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. String[] splits = replaced.split("Superman"); System.out.println("\n" + "*** Splits ***"); for (String split : splits) System.out.println(split); } // end main } // end Program Read the explanation or jump to the article's Table of Contents

Character Class Intersection, Subtraction and Union

The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class. For details, on the page about character class operations, see character class intersection and character subtraction union in Java and Ruby. Similarly, the syntax […[…]] allows you to use a logical OR on several character classes to ensure that a character is present in either of them. For details, see character class union in Java and Ruby on the page about character class operations. This page focuses on regular expressions in JavaScript. Before we start, I feel that a word is in order about what makes JavaScript regex special. But the main issue that makes JavaScript regex so obnoxious is its lack of features. For instance, all major regex flavors support these features—except JavaScript: Dot-matches-line-breaks mode (a.k.a. DOTALL or single-line mode) Lookbehind Inline modifers such as (?i) Named capture groups Free-spacing mode \A and \Z anchors Ability of $ to match before any line breaks at the end of the string. Unicode properties, atomic groups and \Gare also absent. This "distinction" is shared with Python. Needless to say, other advanced features that regex heads frequently use (such as subroutines, named subroutines, recursion, conditionals, and so on) are nowhere in sight. In short, JavaScript regex is a horrible little engine. The lack of lookbehind means that you'll need to work a lot more with capture groups. On the other hand, scarcity can be the mother of invention, so the lack of features will sometimes inspire you to find alternate ways to reach your goals. One such example is the well-known hack to mimic an atomic group.

Better JavaScript regex: the XRegExp library

If you are stuck working in JavaScript and really cannot stand the default engine, consider using XRegExp, an alternate library written by Steven Levithan, a co-author of the Regular Expressions Cookbook. Here are some features found in the XRegExp library but not in standard JavaScript implementations: Dot-matches-line-breaks mode, either inline with (?s) or with the "s" option Inline modifiers such as (?ism) Free-spacing mode, either inline with (?x) or with the "x" option Named capture with (?<foo>…), backreference \k<foo> and replacement insertion ${foo} Unicode properties such as \p{L} Amazingly, XRegExp does not support lookbehind. Steven Levitan has provided a code workaround—apart from that, you're back to using capture groups.

Even Better JavaScript regex: PCRE port

You can also port PCRE to JavaScript using Emscripten, as Firas seems to have done on regex 101. But getting it to work just how will like it will be a lot of work.

A JavaScript program

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete JavaScript regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with JavaScript. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well. As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a JavaScript pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo <script> var subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34} '; var regex = /{[^}]+}|"Tarzan\d+"|(Tarzan\d+)/g; var group1Caps = []; var match = regex.exec(subject); // document.write match.toString(); // put Group 1 captures in an array while (match != null) { if( match[1] != null ) group1Caps.push(match[1]); match = regex.exec(subject); } ///////// The six main tasks we're likely to have //////// // Task 1: Is there a match? document.write("*** Is there a Match? ***<br>"); if(group1Caps.length > 0) document.write("Yes<br>"); else document.write("No<br>"); // Task 2: How many matches are there? document.write("<br>*** Number of Matches ***<br>"); document.write(group1Caps.length); // Task 3: What is the first match? document.write("<br><br>*** First Match ***<br>"); if(group1Caps.length > 0) document.write(group1Caps[0],"<br>"); // Task 4: What are all the matches? document.write("<br>*** Matches ***<br>"); if (group1Caps.length > 0) { for (key in group1Caps) document.write(group1Caps[key],"<br>"); } // Task 5: Replace the matches // see callback parameters http://tinyurl.com/ocddsuk replaced = subject.replace(regex, function(m, group1) { if (group1 == "" ) return m; else return "Superman"; }); document.write("<br>*** Replacements ***<br>"); document.write(replaced); // Task 6: Split // Start by replacing by something distinctive, // as in Step 5. Then split. splits = replaced.split("Superman"); document.write("<br><br>*** Splits ***<br>"); for (key in splits) document.write(splits[key],"<br>"); </script> Read the explanation or jump to the article's Table of Contents

Differences in Regex Features across Ruby Versions

Before we start, you should know that there were important breaks in regex support between Ruby versions 1.8, 1.9 and 2.0. I won't say anything about version 1.8 except that it's the dark ages of Ruby regex. In version 1.9, the Onigurama engine became integrated with Ruby. Version 2.0 started using the Onigmo engine, a fork from Onigurama. This added some interesting features: Conditionals Recursion \K to drop what was matched so far from the match to be returned \R to match all line break characters including CRLF \X to match a single Unicode grapheme In all engines that support it—except for Ruby—the "dot matches at line breaks mode" (a.k.a single-line or DOTALL mode) is turned on by the (?s) inline modifier or the s flag. In Ruby, you turn it on with the (?m) inline modifier or the m flag. This is confusing because in other flavors, the m stands for multi-line, which is the mode where the beginning- and end-of-string anchors ^ and $ are allowed to match on every line. In Ruby, ^ and $ always match on every line. If you want to specify the beginning of the string, use \A. For the very end of the string, use \z (or \Z to match at the end of the string or before the final line break, if any).

Other Ruby Quirks

I've been meaning to compile a list. I'll start with one item: unlike other engines, Ruby does not allow a lookahead or a negative lookbehind inside a lookbehind, such as (?<=(?<!A)A)

Character Class Intersection, Subtraction and Union

The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class. For details, on the page about character class operations, see character class intersection and character subtraction union in Java and Ruby. Similarly, the syntax […[…]] allows you to use a logical OR on several character classes to ensure that a character is present in either of them. For details, see character class union in Java and Ruby on the page about character class operations.

A Ruby program

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete Ruby regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Ruby. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well. As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Ruby pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. or leave the site to view an online demo subject = 'Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34}' regex = /{[^}]+}|"Tarzan\d+"|(Tarzan\d+)/ # put Group 1 captures in an array group1Caps = [] subject.scan(regex) {|m| group1Caps << $1 if !$1.nil? } ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? puts("*** Is there a Match? ***") if group1Caps.length > 0 puts "Yes" else puts "No" end # Task 2: How many matches are there? puts "\n*** Number of Matches ***" puts group1Caps.length # Task 3: What is the first match? puts "\n*** First Match ***" if group1Caps.length > 0 puts group1Caps[0] end # Task 4: What are all the matches? puts "\n*** Matches ***" if group1Caps.length > 0 group1Caps.each { |x| puts x } end # Task 5: Replace the matches replaced = subject.gsub(regex) {|m| if $1.nil? m else "Superman" end } puts "\n*** Replacements ***" puts replaced # Task 6: Split # Start by replacing by something distinctive, # as in Step 5. Then split. splits = replaced.split(/Superman/) puts "\n*** Splits ***" splits.each { |x| puts x } Read the explanation or jump to the article's Table of Contents

A VB.NET program

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well. This is what I have for you in the following complete VB.NET regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for: Does the string match? How many matches are there? What is the first match? What are all the matches? The last two tasks perform two other common regex tasks: Replace all matches Split the string If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with VB.NET. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well. As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a VB.NET pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment. Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter. (The code compiles perfectly in VS2013, but no online demo supplied because the VB.NET in ideone chokes on anonymous functions.) Imports System Imports System.Text.RegularExpressions Imports System.Collections.Specialized Module Module1 Sub Main() Dim MyRegex As New Regex("{[^}]+}|""Tarzan\d+""|(Tarzan\d+)") Dim Subject As String = "Jane"" ""Tarzan12"" Tarzan11@Tarzan22 {4 Tarzan34} " Dim Group1Caps As StringCollection = New StringCollection() Dim MatchResult As Match = MyRegex.Match(Subject) ' put Group 1 captures in a list While MatchResult.Success If MatchResult.Groups(1).Value <> "" Then Group1Caps.Add(MatchResult.Groups(1).Value) End If MatchResult = MatchResult.NextMatch() End While '///////// The six main tasks we're likely to have //////// '// Task 1: Is there a match? Console.WriteLine("*** Is there a Match? ***") If(Group1Caps.Count > 0) Then Console.WriteLine("Yes") Else Console.WriteLine("No") End If '// Task 2: How many matches are there? Console.WriteLine(vbCrLf & "*** Number of Matches ***") Console.WriteLine(Group1Caps.Count) '// Task 3: What is the first match? Console.WriteLine(vbCrLf & "*** First Match ***") If(Group1Caps.Count>0) Then Console.WriteLine(Group1Caps(0)) '// Task 4: What are all the matches? Console.WriteLine(vbCrLf & "*** Matches ***") If (Group1Caps.Count > 0) Then For Each match as String in Group1Caps Console.WriteLine(match) Next End If '// Task 5: Replace the matches Dim Replaced As String = myRegex.Replace(Subject, Function(m As Match) If (m.Groups(1).Value = "") Then Return m.Groups(0).Value Else Return "Superman" End If End Function) Console.WriteLine(vbCrLf & "*** Replacements ***") Console.WriteLine(Replaced) ' Task 6: Split ' Start by replacing by something distinctive, ' as in Step 5. Then split. Dim Splits As Array = Regex.Split(replaced,"Superman") Console.WriteLine(vbCrLf & "*** Splits ***") For Each Split as String in Splits Console.WriteLine(Split) Next Console.WriteLine(vbCrLf & "Press Any Key to Exit.") Console.ReadKey() End Sub End Module