https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-basics.php
A regular expression is a description of a pattern of characters. The most basic pattern we can describe is an exact string (or sequence) of characters. So for example I may want to do a search for the characters th (Or in more specific terms, I am searching for the character t followed directly by the character h)
You may be wondering why the th in there was not picked up as a match. The reason is that it contains a capital T as opposed to lowercase which is what the regular expression was searching for. We know that they are the same character, just in a different form. Regular expressions do not however. Regular expressions do not interpret any meaning from the search pattern. All they do is look for exact matches to specifically what the pattern describes.
It is possible to make a regular expression look for matches in a case insensitive way but you'll learn about that later on.
A very basic expression like this is really no different to a search you may do in a search engine or in your favourite word processor or such. It's not really that exciting. From here in however things do start to get more interesting.
Regular expressions can sometimes be a bit hard to get your head around at first so if the material below seems a little confusing don't worry too much. With practice it will start to make more sense. If you find yourself getting stuck, it may be worth revising our bit on the previous page on Learning Regular Expressions
The dot ( . ) (or full stop) character is what we refer to as a metacharacter. Metacharacters are characters which have a special meaning. They help us to create more interesting patterns than just a string of specific characters. Pretty much everything we look at from here in will be metacharacters.
The dot ( . ) represents any character. So with the regular expression below, what we are looking for is the character b followed by any character, followed by the character g.
It is important to note that the . matches only a single character. We may get it to match more than a single character using multipliers which we'll look at further below. Alternatively, you could also use multiple .'s like so:
In the above example we are matching an l, followed by two characters, followed by an e.
The . allows us to match any character. Sometimes we would like to be a bit more specific than that. This is where ranges come in useful. We specify a range of characters by enclosing them within square brackets ( [ ] ).
In the regular expression above we are looking for the character t followed by either the character e or o, followed by the character d.
There is no limit to how many characters you may place in side the square brackets. You could place a single character, eg. [y] (which would be a bit silly but nevertheless it is legal), or you could have many, eg. [grf4s2#lknx].
Let's say we wanted to find the presence of a digit between 1 and 8. We could use a range like so [12345678] but there is a shortcut we may use to make things a bit easier.
You can combine a set of characters along with other characters.
In the above regular expression we are searching for the digits 1, 2, 3, 4 or 9.
We can also combine multiple sets. In the regular expression below we are looking for 1, 2, 3, 4, 5, a, b, c, d, e, f, x.
Using sets of characters can sometimes lead to odd behaviour. For example, you may use the range [a-f] and find that it matches D. This has to do with the character tables the system is using. Most systems have a character table where all the lowercase letters come first, then the uppercase letters. eg. abcdef....xyzABCD... A few systems however, alternate the lowercase and uppercase letters. eg. aAbBcCdD...yYzZ. If you encounter some strange behaviour and you're using ranges, this is the first place to check.
Sometimes we may want to find the presence of a character which is not a range of characters. We can do this by placing a caret ( ^ ) at the beginning of the range.
The following regular expression searches for the character t followed by a character which is not either e or o, followed by the character d.
Any characters which would normally have a special meaning (metacharacters) lose their special meaning and become literally their character when inside a range. The exception to this is the caret ( ^ ) which gains a new meaning which is not.
Multipliers allow us to increase the number of times an item may occur in our regular expression. Here is the basic set of multipliers:
Their effect will be applied to whatever is directly in front of them. It could be a normal character, eg:
In the above example we are looking for the character 'l' followed by the character 'o' zero or more times. That is why the 'l' in silk is also matched (it is an 'l' followed by zero 'o's).
Or it could be a metacharacter, eg:
Now this one may seem a bit odd to you at first. The '.*' matches zero or more of any character. It is normal to think that it will come across the first 'k' and then say 'yep, I've found a match', but what it actually does is say 'k is also any character however so let's see how far we can take this' and it keeps going until it finds the final 'k' in the string. This is what's referred to as greedy matching. It's normal behaviour is to try and find the largest string it can which matches the pattern. We may reverse this behaviour and make it not greedy or lazy by placing a question mark ( ? ) after the multiplier (which can seem a little confusing as the question mark is a multiplier itself but you'll get the hang of it).
Sometimes we may actually want to search for one of the characters which is a metacharacter. To do this we use a feature called escaping. By placing the backslash ( \ ) in front of a metacharacter we can remove it's special meaning. (In some instances of regular expressions we may also use escaping to introduce a special meaning to characters which normally don't have a special meaning but more on that in the intermediate section of this tutorial). Let's say we wanted to find instances of the word 'this' which are the last word of a sentence. If we did the following:
It would match the 'this.' at the end of the sentence but it also matches 'this ' in the middle of the sentence because the full stop in the regular expression normally matches any character. If we want to make sure it is limited to only 'this.' we may escape the full stop like so:
It is easy to forget to escape metacharacters when they are part of your search string. If you're getting weird behaviour in your regular expressions keep an eye out for any metacharacters you may have forgotten to escape.
Now that you have a reasonable idea what regular expressions are, the next step in taking your regular expression skills to the next level is a good understanding of the underlying mechanism that is used to apply regular expressions over text. When you understand the mechanism, it makes it easier to troubleshoot when things start going wrong.
The way it works is that we have a pointer which is moved progressively through the search string. Once it comes across a character which matches the beginning of the regular expression it stops. Now a second pointer is started which moves forward from the first pointer, character by character, checking with each step if the pattern still holds or if it fails. If we get to the end of the pattern and it still holds then we have found a match. If it fails at any point then the second pointer is discarded and the main pointer continues through the string.
Let's say that we are looking for the letter p followed by any character followed by the letter t. The example below illustrates how the mechanism works :
The reason that the main pointer continues from it's location, as opposed to where the second pointer either failed or completed a match is illustrated in the example above. It is possible that another match may be found within the set of characters we just checked.
Now it starts to get interesting.
Now that you've got a feel for regular expressions, we'll add a bit more complexity. The features you'll find below have to do with identifying particular types of characters and locations within a string.
In the previous section of this tutorial we looked at the range operator ( [] ). That allowed us to specify a set of characters, any of which could be matched. There are some ranges that are used frequently so a set of shortcuts has been created to refer to them. We access these by using the escape character ' \ ' followed by a letter. (In this case the escape character introduces a special meaning rather than taking it away.)
The last pair or shortcuts above where we are dealing with words starts to get interesting. See the section on word boundaries below for more information on this.
In the above list you'll notice that the same letter but in uppercase always matches the opposite of what the letter in lowercase does.
As with the elements we saw in the previous section, these will match a single character but may have a multiplier placed after them to increase that. For instance, if we wanted to find any dollar amounts which had four digits in them we could create a regular expression as follows:
You'll notice that in the above example I have escaped the '$' sign. The '$' has a special meaning which you'll learn about below. You'll also notice that we found a match in the first number even though it had more than 4 digits. This can sometimes be a little confusing but you have to remember that regular expressions don't consider the meaning of the content, only if a string of characters match the given pattern.
It can be easy for us as humans to overlook potential matches as we tend to look at things and percieve their meaning. We have to get into the habit of remembering that regular expressions don't do this.
As well as our normal characters, there are a few other characters which we don't actually see but which help in formatting our text. These are the:
The tab character you should be familiar with (it prints a larger gap than a normal space) but the other two are a bit more interesting.
The concepts of carriage return and line feed came about with mechanical typewriters. The carriage return function moved the cursor from the end of the line to the beginning of the line. The line feed function moved down a line.
Find out more about the Carriage return and Line feed characters.
Depending on the OS you are using, one or a combination of these can be used to signify a new line.
Thankfully, with the power of regular expressions, we can create a pattern that will identify a newline regardless of which OS the data has come from. To do this however we need to use some of the features that will be introduced in the advanced section of this tutorial. If you know for certain which characters the data you are searching is using however then you can just use that.
\n and \r are present in some regular expression implementation but not in others. If you are getting some weird behaviour it could be that they are not implemented in the particular tool you are using.
Building upon the idea of new lines we introduce two particular locations on a line which are the beginning and the end of the line. We can refer to these locations in our regular expressions using the following special characters:
It's important to understand exactly what these represent.
In the line above it is usual for people to say that the I in It's and the full stop at end represent the beginning and end of the line. This is in fact incorrect (with respect to regular expressions). The beginning of the line is actually a zero width character just before the I and the end of the line is another zero width character just after the full stop. By zero width we mean that they are effectively invisible. They are there but we may not see them.
These two positions on the line are referred to as anchors as they allow us to anchor our pattern to a particular point on the line.
Let's say we want to identify a number but only if it is the very first thing on the line.
Using the two together can be a useful way to identify something which is the only thing on the line. Maybe we want to identify any lines which contain only a single word which is either bat or bit or but.
Word boundaries are an example of another zero width character used often within regular expressions. A word boundary is the very beginning or end of a word. They may be identified using the following:
The first two items listed above aren't available in all regular expression tools but \b generally is so it is the safer one to use.
A word is generally considered to be a string of characters that would be matched by the \w character class (that is, A-Z, a-z, 0-9 and _). Note that this doesn't include punctuation such as the apostrophe ( ' ) as may be seen in the example below.
Now there is looking back.
Now that you've got a feel for regular expressions, we'll add a bit more complexity. In demonstrating the features on this page we will also be using features introduced in the Basic and Intermediate sections of this tutorial. If some of this stuff seems a bit confusing it may be worth reviewing those sections first. Once you complete this section (and understand it) you won't be a complete Regular Expressions guru but you will be well on your way and you should be armed with enough Regular Expressions ammo to tackle the majority of problems you encounter.
We may group several characters together in our regular expression using brackets '( )' (also referred to as parentheses). There are then various things which can be done with that group. Some of these we'll look at further down this page. They also allow us to add a multiplier to that group of characters (as a whole).
So, for instance, we may want to find out if a particular person is mentioned. Their name is John Reginald Smith but the middle name may or may not be present.
Notice where the spaces are and aren't in the regular expression above. It's important to remember that they are part of your regular expression and you need to make sure they are and aren't in the right places.
The above tip is very important and a common source of problems when people first start playing with regular expressions. Below is a common mistake that people make.
We aren't limited to just normal characters in the brackets. You may include special characters in there (including multipliers) as well.
For instance, maybe we would like to find instances of IP addresses. An IP address is a set of 4 numbers (between 0 and 255) separated by full stops (eg. 192.168.0.5).
Let's break it down as this is starting to get a little complex:
The above expression uses elements that have been covered in the previous sections of this tutorial. Be sure to review these sections if need be.
As you can see, regular expressions can soon get hard to read once you get various brackets and backslashes in there. This makes it easy to make silly mistakes by missing or misplacing one of these characters and the mistakes can be hard to spot. Remember the strategies for handling this.
Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The first set of brackets is referred to with \1, the second set of brackets with \2 and so on.
Let's say we went to find lines with two mentions of a person whos last name is Smith. We don't know that their first name may be however. We could do the following:
In the above example you'll notice that we matched the text between the two instances of John Smith as well but in this case that is ok as we are not too concerned in what was matched, only that there was a match.
With alternation we are looking for something or something else. We have seen a very basic example of alternation with the range operator. This allows us to perform alternation with a single character, but sometimes we would like to perform the operation with a larger set of characters. We can achieve this with the pipe symbol ( | ) which means or.
So for intance, if we wanted to find all instance of either 'dog' or 'cat' we could do the following:
We can also use more than one | to include more options.
Maybe we only want alternation to happen on a part of the regular expression instead of the whole regular expression. To achieve this we use brackets.
Maybe we want to match Harold Smith or John Smith but not any other Smith.
Lookaheads and Lookbehinds are the final thing we are going to introduce in this tutorial and they can be one of the trickiest things you will encounter in regular expressions. Both of them operate in one of two modes:
The main idea of both the lookahead and lookbehind is to see if something matches (or doesn't) and then to throw away what was actually matched.
With a lookahead we want to look ahead (hence the name) in our string and see if it matches the given pattern, but then disregard it and move on. The concept is best illustrated with an example.
Let's say we wish to identify numbers greater than 4000 but less than 5000. This is a problem which seems simple but is in fact a little trickier than you suspect. A common first attempt is to try the following:
Then you realise that the way we can tackle this is to say we are looking for a '4' followed by 3 ditigs and at least one of those digits is not a '0'. For us as humans that seems like a simple thing to look for but with what we have learnt so far in regular expressions, it is not so easy. We could try something like:
That is, use alternation to check three different scenarios, each with a different of the three digits not being '0'.
I reckon you're probably looking at the above and thinking that's a lot of regular expression to mach just 4 characters. Worse still, think about how that would increase if instead of between 4000 and 5000 we wanted between 40000 and 50000. It soon becomes clear that the above regular expression works but it is not elegant and it doesn't scale.
It turns out that a negative lookahead can solve problems like this quite well. A negative lookahead is set up as follows:
(?!x)
Our negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace x with what it is you don't want to match.
Now we can set up our regular expression as follows:
That might seem a little confusing so let's break it down
In plain english we could say: "We are looking for a '4' which is not followed by 3 '0's but is followed by 3 digits".
A positive lookahead works in the same way but the characters inside the lookahead have to match rather than not match. The syntax for a positive lookahead is as follows:
(?=x)
All we need to do is replace the '!' with an '='.
Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards and then throw it away. Similar to lookaheads, they are available in both positive and negative. They follow a similar syntax but include a '<' after the '?' (Think of it as an arrow pointing backwards).
(?<=x) and (?<!x)
Is the syntax for a positive lookbehind and negative lookbehind respectively.
Let's say we would like to find instances of the name 'Smith' but only if they are a surname. To achieve this we have said that we want to look at the word before it and if that word begins with a capital letter we'll assume it is a surname (the more astute of you will have already seen the flaw in this, ie what if Smith is the second word in a sentence, but we'll ignore that for now.)
Lookaheads and lookbeinds can be a bit tough to get your head around at first. I would suggest you experiment with a few different searches yourself to get the hang of it.
Applications and programming languages differ in how they implement lookaheads and lookbehinds. Some will allow you to use other regular expression features within a lookahead and lookbehind, some will not. Some will allow some features but not all of them. If you are getting unexpected behaviour you may need to find out which features are and aren't implemented for your particular application or programming language.
You've now learnt enough about regular expressions to get you through the majority of problems you will probably face. You've really only been introduced to the building blocks though. Learning how to put the building blocks together into effective patterns is something which will take time and practice. Don't worry if some of this stuff is still a little confusing at this point in time. With practice it will all become clearer and you will become very powerful in terms of the things you can achieve.
Regular expressions can be a bit of an abstract concept to get your head around. Let's take a look at some real world examples to give you a better idea of how they are actually used and what types of problems you can solve with them. Remember, the examples below are just a taste of what you can do with regular expressions. They have many uses and you are really only limited by your imagination and creativity.
In the examples below you may hover over the breakdowns to highlight the various components.
To create the functionality on this page where you can hover on descriptions and examples and see which parts of the regular expression they are referring to a small amount of Javascript is used. If you want to see the code you can right-click and 'view source' to see it. I set ID's on relevant sections then the Javascript uses a regular expression to extract relevant data and from that identify the corresponding content to also highlight. So for instance, I have an item identified as eg1trigger3. When you hover over this a regular expression is used to extract the example number (1) and trigger number (3) as follows:
eg(\d+)trigger(\d+)
Broken down that is:
Form validation is an area where regular expressions are really useful. This example and the next few are really useful here.
So we know that a credit card number is 16 digits, and is typically divided into 4 groups of 4 digits. Don't you hate it when you are given just a sinle field and not told whether it accepts the number as one big number or four 4 digit groups. What if it could elegantly handle both. While we're at it, let's make it also allow for the separator to be a space, dash or comma.
\d{4}[-, ]?\d{4}[-, ]?\d{4}[-, ]\d{4}
Broken down that is:
If you want different separators it's a simple matter of changing what is inside the range operators. It's also easy to adapt this pattern to other digit based items such as membership numbers or IP addresses etc.
It's possible to expand on this to accommodate different brands of credit cards. This is because each brand has unique characteristics that we can identify.
Email addresses are an interesting case. There are a variety of approaches depending on how pedantic you want to be. What constitues valid syntax can be quite complex. Here is a less pedantic approach.
[a-zA-Z0-9.+-_]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,63}
Broken down that is:
You may be wondering why the top level domain section is between 2 and 63 characters and not something like 2 and 4 characters. Things used to be simple with nearly all domains being something like .com or .net, with maybe a .au to signify country. Now things have gotten silly though and domains such as .sandvikcoromant exist (with more likely to follow). The official specification allows for up to 63 charactes for the top level domain so it is probably safest to check for that.
Things like HTML, where our content follows a certain syntax, are great for utilising regular expressions when we want to identify certain elements. I regularly use regular expresisons with a search and replace to update certain parts of a page. Let's say for instance that we wish to identify img tags which don't contain an alt attribute and which are contained within an opening and closing tag of the same type, on a single line. (If you want to learn a bit more about HTML you can check out our HTML tutorial.)
<([a-zA-Z][a-zA-Z0-9]?).*>.*<[iI][mM][gG] (?![^>]*alt=).*>.*</\1>
Broken down that is:
Let's look at an example to illustrate it (hover your mouse over sections to see how they line up with the regular expression above):
<p class='important'>This is a picture of my dog <img src='./mydog.jpg'>. He is awesome.</p>
This cheat sheet is intended to be a quick reminder for the main concepts involved in using regular expressions and assumes you already understand their usage. If you are new to regular expressions we strongly suggest you work through the Regular Expressions tutorial from the beginning.
Click the title of a section to be taken to the relevant tutorial page to learn more about those concepts.