A regular expression is a description of a pattern of characters. The most basic pattern we can describe is an exact string (or sequence) of characters. So for example I may want to do a search for the characters th (Or in more specific terms, I am searching for the character t followed directly by the character h)

You may be wondering why the th in there was not picked up as a match. The reason is that it contains a capital T as opposed to lowercase which is what the regular expression was searching for. We know that they are the same character, just in a different form. Regular expressions do not however. Regular expressions do not interpret any meaning from the search pattern. All they do is look for exact matches to specifically what the pattern describes.

It is possible to make a regular expression look for matches in a case insensitive way but you'll learn about that later on.

A very basic expression like this is really no different to a search you may do in a search engine or in your favourite word processor or such. It's not really that exciting. From here in however things do start to get more interesting.

Regular expressions can sometimes be a bit hard to get your head around at first so if the material below seems a little confusing don't worry too much. With practice it will start to make more sense. If you find yourself getting stuck, it may be worth revising our bit on the previous page on Learning Regular Expressions

The dot - any character

The dot ( . ) (or full stop) character is what we refer to as a metacharacter. Metacharacters are characters which have a special meaning. They help us to create more interesting patterns than just a string of specific characters. Pretty much everything we look at from here in will be metacharacters.

The dot ( . ) represents any character. So with the regular expression below, what we are looking for is the character b followed by any character, followed by the character g.

It is important to note that the . matches only a single character. We may get it to match more than a single character using multipliers which we'll look at further below. Alternatively, you could also use multiple .'s like so:

In the above example we are matching an l, followed by two characters, followed by an e.

Ranges of Characters

The . allows us to match any character. Sometimes we would like to be a bit more specific than that. This is where ranges come in useful. We specify a range of characters by enclosing them within square brackets ( [ ] ).

In the regular expression above we are looking for the character t followed by either the character e or o, followed by the character d.

There is no limit to how many characters you may place in side the square brackets. You could place a single character, eg. [y] (which would be a bit silly but nevertheless it is legal), or you could have many, eg. [grf4s2#lknx].

Shortcut for characters in a row

Let's say we wanted to find the presence of a digit between 1 and 8. We could use a range like so [12345678] but there is a shortcut we may use to make things a bit easier.

In the above regular expression we are searching for the digits 1, 2, 3, 4 or 9.

We can also combine multiple sets. In the regular expression below we are looking for 1, 2, 3, 4, 5, a, b, c, d, e, f, x.

Using sets of characters can sometimes lead to odd behaviour. For example, you may use the range [a-f] and find that it matches D. This has to do with the character tables the system is using. Most systems have a character table where all the lowercase letters come first, then the uppercase letters. eg. abcdef....xyzABCD... A few systems however, alternate the lowercase and uppercase letters. eg. aAbBcCdD...yYzZ. If you encounter some strange behaviour and you're using ranges, this is the first place to check.

Negating - Find characters that aren't

Sometimes we may want to find the presence of a character which is not a range of characters. We can do this by placing a caret ( ^ ) at the beginning of the range.

The following regular expression searches for the character t followed by a character which is not either e or o, followed by the character d.

Any characters which would normally have a special meaning (metacharacters) lose their special meaning and become literally their character when inside a range. The exception to this is the caret ( ^ ) which gains a new meaning which is not.

Multipliers

Multipliers allow us to increase the number of times an item may occur in our regular expression. Here is the basic set of multipliers:

Their effect will be applied to whatever is directly in front of them. It could be a normal character, eg:

In the above example we are looking for the character 'l' followed by the character 'o' zero or more times. That is why the 'l' in silk is also matched (it is an 'l' followed by zero 'o's).

Now this one may seem a bit odd to you at first. The '.*' matches zero or more of any character. It is normal to think that it will come across the first 'k' and then say 'yep, I've found a match', but what it actually does is say 'k is also any character however so let's see how far we can take this' and it keeps going until it finds the final 'k' in the string. This is what's referred to as greedy matching. It's normal behaviour is to try and find the largest string it can which matches the pattern. We may reverse this behaviour and make it not greedy or lazy by placing a question mark ( ? ) after the multiplier (which can seem a little confusing as the question mark is a multiplier itself but you'll get the hang of it).

Escaping Metacharacters

Sometimes we may actually want to search for one of the characters which is a metacharacter. To do this we use a feature called escaping. By placing the backslash ( \ ) in front of a metacharacter we can remove it's special meaning. (In some instances of regular expressions we may also use escaping to introduce a special meaning to characters which normally don't have a special meaning but more on that in the intermediate section of this tutorial). Let's say we wanted to find instances of the word 'this' which are the last word of a sentence. If we did the following:

It would match the 'this.' at the end of the sentence but it also matches 'this ' in the middle of the sentence because the full stop in the regular expression normally matches any character. If we want to make sure it is limited to only 'this.' we may escape the full stop like so:

It is easy to forget to escape metacharacters when they are part of your search string. If you're getting weird behaviour in your regular expressions keep an eye out for any metacharacters you may have forgotten to escape.

The Mechanism

Now that you have a reasonable idea what regular expressions are, the next step in taking your regular expression skills to the next level is a good understanding of the underlying mechanism that is used to apply regular expressions over text. When you understand the mechanism, it makes it easier to troubleshoot when things start going wrong.

The way it works is that we have a pointer which is moved progressively through the search string. Once it comes across a character which matches the beginning of the regular expression it stops. Now a second pointer is started which moves forward from the first pointer, character by character, checking with each step if the pattern still holds or if it fails. If we get to the end of the pattern and it still holds then we have found a match. If it fails at any point then the second pointer is discarded and the main pointer continues through the string.

Let's say that we are looking for the letter p followed by any character followed by the letter t. The example below illustrates how the mechanism works :

The reason that the main pointer continues from it's location, as opposed to where the second pointer either failed or completed a match is illustrated in the example above. It is possible that another match may be found within the set of characters we just checked.

Summary

dot (.): Match any character.
[ ]: Match a range of characters contained within the square brackets.
[^ ]: Match a character which is not one of those contained within the square brackets.
*: Match zero or more of the preceeding item.
+: Match one or more of the preceeding item.
?: Match zero or one of the preceeding item.
{n}: Match exactly n of the preceeding item.
{n,m}: Match between n and m of the preceeding item.
{n,}: Match n or more of the preceeding item.
\: Escape, or remove the special meaning of the next character.
String: A sequence of characters.

Regular Expressions Intermediate!

Now it starts to get interesting.

Now that you've got a feel for regular expressions, we'll add a bit more complexity. The features you'll find below have to do with identifying particular types of characters and locations within a string.

Shorthand Character Classes

In the previous section of this tutorial we looked at the range operator ( [] ). That allowed us to specify a set of characters, any of which could be matched. There are some ranges that are used frequently so a set of shortcuts has been created to refer to them. We access these by using the escape character ' \ ' followed by a letter. (In this case the escape character introduces a special meaning rather than taking it away.)

\s - matches anything which is considered whitespace. This could be a space, tab, line break etc.
\S - matches the opposite of \s, that is anything which is not considered whitespace.
\d - matches anything which is considered a digit. ie 0 - 9 (It is effectively a shortcut for [0-9]).
\D - matches the opposite of \d, that is anything which is not considered a digit.
\w - matches anything which is considered a word character. That is [A-Za-z0-9_]. Note the inclusion of the underscore character '_'. This is because in programming and other areas we regulaly use the underscore as part of, say, a variable or function name.
\W - matches the opposite of \w, that is anything which is not considered a word character.

The last pair or shortcuts above where we are dealing with words starts to get interesting. See the section on word boundaries below for more information on this.

In the above list you'll notice that the same letter but in uppercase always matches the opposite of what the letter in lowercase does.

As with the elements we saw in the previous section, these will match a single character but may have a multiplier placed after them to increase that. For instance, if we wanted to find any dollar amounts which had four digits in them we could create a regular expression as follows:

\$\d{4}

Today I earned $58327 but lost $3826.

You'll notice that in the above example I have escaped the '$' sign. The '$' has a special meaning which you'll learn about below. You'll also notice that we found a match in the first number even though it had more than 4 digits. This can sometimes be a little confusing but you have to remember that regular expressions don't consider the meaning of the content, only if a string of characters match the given pattern.

It can be easy for us as humans to overlook potential matches as we tend to look at things and percieve their meaning. We have to get into the habit of remembering that regular expressions don't do this.

Non Printable Characters

As well as our normal characters, there are a few other characters which we don't actually see but which help in formatting our text. These are the:

Tab - represented in regular expressions as \t
Carriage return - represented in regular expressions as \r
Line feed (or newline) - represented in regular expressions as \n

The tab character you should be familiar with (it prints a larger gap than a normal space) but the other two are a bit more interesting.

The concepts of carriage return and line feed came about with mechanical typewriters. The carriage return function moved the cursor from the end of the line to the beginning of the line. The line feed function moved down a line.

Find out more about the Carriage return and Line feed characters.

Depending on the OS you are using, one or a combination of these can be used to signify a new line.

Windows - uses the sequence \r\n (in that order)
Mac OS (version 9 and below) - uses the sequence \r
Unix/Linux and OSX - uses the sequence \n

Thankfully, with the power of regular expressions, we can create a pattern that will identify a newline regardless of which OS the data has come from. To do this however we need to use some of the features that will be introduced in the advanced section of this tutorial. If you know for certain which characters the data you are searching is using however then you can just use that.

\n and \r are present in some regular expression implementation but not in others. If you are getting some weird behaviour it could be that they are not implemented in the particular tool you are using.

Anchors - ^ and $

Building upon the idea of new lines we introduce two particular locations on a line which are the beginning and the end of the line. We can refer to these locations in our regular expressions using the following special characters:

^ (caret) - represents the beginning of the line.
$ (dollar) - represents the end of the line.

It's important to understand exactly what these represent.

In the line above it is usual for people to say that the I in It's and the full stop at end represent the beginning and end of the line. This is in fact incorrect (with respect to regular expressions). The beginning of the line is actually a zero width character just before the I and the end of the line is another zero width character just after the full stop. By zero width we mean that they are effectively invisible. They are there but we may not see them.

These two positions on the line are referred to as anchors as they allow us to anchor our pattern to a particular point on the line.

Let's say we want to identify a number but only if it is the very first thing on the line.

^\d+

13 cats escaped from the 5 cages at the vet's clinic.

Using the two together can be a useful way to identify something which is the only thing on the line. Maybe we want to identify any lines which contain only a single word which is either bat or bit or but.

^b[aiu]t$

This line does not match but the next line does.
bat

Word Boundaries

Word boundaries are an example of another zero width character used often within regular expressions. A word boundary is the very beginning or end of a word. They may be identified using the following:

\< - represents the beginning of a word.
\> - represents the end of a word.
\b - represents either the beginning or end of a word.

The first two items listed above aren't available in all regular expression tools but \b generally is so it is the safer one to use.

A word is generally considered to be a string of characters that would be matched by the \w character class (that is, A-Z, a-z, 0-9 and _). Note that this doesn't include punctuation such as the apostrophe ( ' ) as may be seen in the example below.

\bt\w+\b

Now that's the truth and you know it.

Summary

\s: Match any character which is considered whitespace (space, tab etc).
\S: Match any character which is not whitespace.
\d: Match any character which is a digit ( 0 - 9 ).
\D: Match any character which is not a digit.
\w: Match any character which is a word character ( A-Z, a-z, 0-9 and _ ).
\W: Match any character which is not a word character.
\t: Match a tab.
\r: Match a carriage return.
\n: Match a line feed (or newline).
^ (caret): An anchor which matches the beginning of the line.
$ (dollar): An anchor which matches the end of the line.
\b: Matches the beginning or end of a word.
\<: Matches the beginning of a word.
\>: Matches the end of a word.

Regular Expressions Advanced!

Now there is looking back.

Now that you've got a feel for regular expressions, we'll add a bit more complexity. In demonstrating the features on this page we will also be using features introduced in the Basic and Intermediate sections of this tutorial. If some of this stuff seems a bit confusing it may be worth reviewing those sections first. Once you complete this section (and understand it) you won't be a complete Regular Expressions guru but you will be well on your way and you should be armed with enough Regular Expressions ammo to tackle the majority of problems you encounter.

Grouping

We may group several characters together in our regular expression using brackets '( )' (also referred to as parentheses). There are then various things which can be done with that group. Some of these we'll look at further down this page. They also allow us to add a multiplier to that group of characters (as a whole).

So, for instance, we may want to find out if a particular person is mentioned. Their name is John Reginald Smith but the middle name may or may not be present.

John (Reginald )?Smith

John Reginald Smith is sometime just called John Smith.

Notice where the spaces are and aren't in the regular expression above. It's important to remember that they are part of your regular expression and you need to make sure they are and aren't in the right places.

The above tip is very important and a common source of problems when people first start playing with regular expressions. Below is a common mistake that people make.

John (Reginald)? Smith

The problem with this regular expression is that it will match John Reginald Smith perfectly fine and John Smith (two spaces between John and Smith) but not John Smith. Can you see why?

We aren't limited to just normal characters in the brackets. You may include special characters in there (including multipliers) as well.

For instance, maybe we would like to find instances of IP addresses. An IP address is a set of 4 numbers (between 0 and 255) separated by full stops (eg. 192.168.0.5).

\b(\d{1,3}\.){3}\d{1,3}\b

The server has an address of 10.18.0.20 and the printer has an address of 10.18.0.116.

Let's break it down as this is starting to get a little complex:

\b indicates a word boundary so we can be sure the IP address is not part of something else.
We have now broken the IP address into 3 chunks consisting of a number between 0 and 255 and a full stop, and a final number between 0 and 255.
In the brackets we handle the first 3 chunks so \d{1,3} indicates we are looking for between 1 and 3 digits and we remember to escape the full stop to remove it's special meaning. We are looking for exactly 3 of these so we place the multiplier {3} just outside the brackets.
Finally we include the fourth number with \d{1,3} and end with another word boundary

The above expression uses elements that have been covered in the previous sections of this tutorial. Be sure to review these sections if need be.

As you can see, regular expressions can soon get hard to read once you get various brackets and backslashes in there. This makes it easy to make silly mistakes by missing or misplacing one of these characters and the mistakes can be hard to spot. Remember the strategies for handling this.

Back references

Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The first set of brackets is referred to with \1, the second set of brackets with \2 and so on.

Let's say we went to find lines with two mentions of a person whos last name is Smith. We don't know that their first name may be however. We could do the following:

(\b[A-Z]\w+\b) Smith.*\1 Smith

Harold Smith went to meet John Smith but John Smith was not there.

In the above example you'll notice that we matched the text between the two instances of John Smith as well but in this case that is ok as we are not too concerned in what was matched, only that there was a match.

Alternation

With alternation we are looking for something or something else. We have seen a very basic example of alternation with the range operator. This allows us to perform alternation with a single character, but sometimes we would like to perform the operation with a larger set of characters. We can achieve this with the pipe symbol ( | ) which means or.

So for intance, if we wanted to find all instance of either 'dog' or 'cat' we could do the following:

dog|cat

Harold Smith has two dogs and one cat.

We can also use more than one | to include more options.

dog|cat|bird

Harold Smith has two dogs, one cat and three birds.

Maybe we only want alternation to happen on a part of the regular expression instead of the whole regular expression. To achieve this we use brackets.

Maybe we want to match Harold Smith or John Smith but not any other Smith.

(John|Harold) Smith

Harold Smith went to meet John Smith but instead bumped into Jane Smith.

Lookahead and Lookbehind

Lookaheads and Lookbehinds are the final thing we are going to introduce in this tutorial and they can be one of the trickiest things you will encounter in regular expressions. Both of them operate in one of two modes:

Positive - in which we are seeking to find something which matches.
Negative - in which we are seeking to find something which doesn't match.

The main idea of both the lookahead and lookbehind is to see if something matches (or doesn't) and then to throw away what was actually matched.

Lookaheads

With a lookahead we want to look ahead (hence the name) in our string and see if it matches the given pattern, but then disregard it and move on. The concept is best illustrated with an example.

Let's say we wish to identify numbers greater than 4000 but less than 5000. This is a problem which seems simple but is in fact a little trickier than you suspect. A common first attempt is to try the following:

\b4\d\d\d\b

This looks promising with 4021 but unfortunately also matches 4000.

Then you realise that the way we can tackle this is to say we are looking for a '4' followed by 3 ditigs and at least one of those digits is not a '0'. For us as humans that seems like a simple thing to look for but with what we have learnt so far in regular expressions, it is not so easy. We could try something like:

\b4([1-9]\d\d|\d[1-9]\d|\d\d[1-9])\b

Now we will match 4010 but not 4000.

That is, use alternation to check three different scenarios, each with a different of the three digits not being '0'.

I reckon you're probably looking at the above and thinking that's a lot of regular expression to mach just 4 characters. Worse still, think about how that would increase if instead of between 4000 and 5000 we wanted between 40000 and 50000. It soon becomes clear that the above regular expression works but it is not elegant and it doesn't scale.

It turns out that a negative lookahead can solve problems like this quite well. A negative lookahead is set up as follows:

(?!x)

Our negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace x with what it is you don't want to match.

Now we can set up our regular expression as follows:

\b4(?!000)\d\d\d\b

Now we still match 4010 but not 4000.

That might seem a little confusing so let's break it down

First we look for the character '4'.
When we find a '4' the negative lookahead returns true if the next 3 characters are not '000'.
If this returns true we go back to just after the '4' and continue with our regular expression.

In plain english we could say: "We are looking for a '4' which is not followed by 3 '0's but is followed by 3 digits".

A positive lookahead works in the same way but the characters inside the lookahead have to match rather than not match. The syntax for a positive lookahead is as follows:

(?=x)

All we need to do is replace the '!' with an '='.

Lookbehinds

Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards and then throw it away. Similar to lookaheads, they are available in both positive and negative. They follow a similar syntax but include a '<' after the '?' (Think of it as an arrow pointing backwards).

(?<=x) and (?<!x)

Is the syntax for a positive lookbehind and negative lookbehind respectively.

Let's say we would like to find instances of the name 'Smith' but only if they are a surname. To achieve this we have said that we want to look at the word before it and if that word begins with a capital letter we'll assume it is a surname (the more astute of you will have already seen the flaw in this, ie what if Smith is the second word in a sentence, but we'll ignore that for now.)

(?<=[A-Z]\w* )Smith

Now we won't identify Smith Francis but we will identify Harold Smith.

Lookaheads and lookbeinds can be a bit tough to get your head around at first. I would suggest you experiment with a few different searches yourself to get the hang of it.

Applications and programming languages differ in how they implement lookaheads and lookbehinds. Some will allow you to use other regular expression features within a lookahead and lookbehind, some will not. Some will allow some features but not all of them. If you are getting unexpected behaviour you may need to find out which features are and aren't implemented for your particular application or programming language.

Where to from here?

You've now learnt enough about regular expressions to get you through the majority of problems you will probably face. You've really only been introduced to the building blocks though. Learning how to put the building blocks together into effective patterns is something which will take time and practice. Don't worry if some of this stuff is still a little confusing at this point in time. With practice it will all become clearer and you will become very powerful in terms of the things you can achieve.

Summary

( ): Group part of the regular expression.
\1 \2 etc: Refer to something matched by a previous grouping.
|: Match what is on either the left or right of the pipe symbol.
(?=x): Positive lookahead.
(?!x): Negative lookahead.
(?<=x): Positive lookbehind.
(?<!x): Negative lookbehind.

Regular Expressions Examples!

Regular expressions can be a bit of an abstract concept to get your head around. Let's take a look at some real world examples to give you a better idea of how they are actually used and what types of problems you can solve with them. Remember, the examples below are just a taste of what you can do with regular expressions. They have many uses and you are really only limited by your imagination and creativity.

In the examples below you may hover over the breakdowns to highlight the various components.

Hovers on this page.

To create the functionality on this page where you can hover on descriptions and examples and see which parts of the regular expression they are referring to a small amount of Javascript is used. If you want to see the code you can right-click and 'view source' to see it. I set ID's on relevant sections then the Javascript uses a regular expression to extract relevant data and from that identify the corresponding content to also highlight. So for instance, I have an item identified as eg1trigger3. When you hover over this a regular expression is used to extract the example number (1) and trigger number (3) as follows:

eg(\d+)trigger(\d+)

Broken down that is:

The beginning of the intended identifier - eg.
The example number. By placing it within brackets we can use it in the Javascript later on.
The trigger section of the identifier.
The trigger number. By placing it within brackets (similar to the example number) we may use it in the Javascript later on.

Credit card numbers

Form validation is an area where regular expressions are really useful. This example and the next few are really useful here.

So we know that a credit card number is 16 digits, and is typically divided into 4 groups of 4 digits. Don't you hate it when you are given just a sinle field and not told whether it accepts the number as one big number or four 4 digit groups. What if it could elegantly handle both. While we're at it, let's make it also allow for the separator to be a space, dash or comma.

\d{4}[-, ]?\d{4}[-, ]?\d{4}[-, ]\d{4}

Broken down that is:

4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits

If you want different separators it's a simple matter of changing what is inside the range operators. It's also easy to adapt this pattern to other digit based items such as membership numbers or IP addresses etc.

It's possible to expand on this to accommodate different brands of credit cards. This is because each brand has unique characteristics that we can identify.

Email addresses

Email addresses are an interesting case. There are a variety of approaches depending on how pedantic you want to be. What constitues valid syntax can be quite complex. Here is a less pedantic approach.

[a-zA-Z0-9.+-_]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,63}

Broken down that is:

The recipient name, followed by
The @ symbol, followed by
The main part of the domain, followed by
The top level domain part of the domain

You may be wondering why the top level domain section is between 2 and 63 characters and not something like 2 and 4 characters. Things used to be simple with nearly all domains being something like .com or .net, with maybe a .au to signify country. Now things have gotten silly though and domains such as .sandvikcoromant exist (with more likely to follow). The official specification allows for up to 63 charactes for the top level domain so it is probably safest to check for that.

HTML Mangling

Things like HTML, where our content follows a certain syntax, are great for utilising regular expressions when we want to identify certain elements. I regularly use regular expresisons with a search and replace to update certain parts of a page. Let's say for instance that we wish to identify img tags which don't contain an alt attribute and which are contained within an opening and closing tag of the same type, on a single line. (If you want to learn a bit more about HTML you can check out our HTML tutorial.)

<([a-zA-Z][a-zA-Z0-9]?).*>.*<[iI][mM][gG] (?![^>]*alt=).*>.*</\1>

Broken down that is:

The name of the opening tag. We place this in brackets so we can match it at the end as well.
Other potential stuff within the opening tag, attributes etc.
Content which is not an img tag.
An img tag. Using range operators to identify in either lowercase or uppwercase or a mixture.
Here we use a negative lookahead to make sure that alt= is not somewhere after the img tag befor the closing >.
Potential content after the img tab and before the closing tag.
The closing tag. \1 becomes the same as what was matched in the brackets around the opening tag name.

Let's look at an example to illustrate it (hover your mouse over sections to see how they line up with the regular expression above):

<p class='important'>This is a picture of my dog <img src='./mydog.jpg'>. He is awesome.</p>

Regular Expressions - Cheat Sheet

Regular Expressions Tutorial

The most basic example

The dot - any character

Ranges of Characters

Shortcut for characters in a row

Negating - Find characters that aren't

Multipliers

Escaping Metacharacters

The Mechanism

Summary

Regular Expressions Intermediate!

Shorthand Character Classes

Non Printable Characters

Anchors - ^ and $

Word Boundaries

Summary

Regular Expressions Advanced!

Grouping

Back references

Alternation

Lookahead and Lookbehind

Lookaheads

Lookbehinds

Where to from here?

Summary

Regular Expressions Examples!

Hovers on this page.

Credit card numbers

Email addresses

HTML Mangling

Regular Expressions - Cheat Sheet

Basic Metacharacters

Multipliers

Shorthand Character Classes

Non printable characters

Anchors and Word Boundaries

Grouping and backreferences

Alternation

Lookahead and Lookbehind