PHP Perl Compatible Regular Expression Basics tutorial No.1
Hello welcome to the power of regexes! (regular expression for short) this tutorial here aims to teach you how to implement Perl Compatible Regular Expressions (PCRE) in your PHP applications. It's a basic tutorial aimed at learners who are new to regular expressions and using them with php. They are very powerful. I hope you enjoy and learn from it.
A regular expression is a pattern that matches a string. Basicly it either succeeds or fails. for example:
<?PHP /* First we start with a string that we want to find a matching pattern for. */ $sentence = "Once upon a time on a planet known as krypton."; /* Next we decide what we want to search for. In this case we are searching for the word planet. */ $pattern = "/planet/"; /* Now we use the php preg_match function to find the pattern within the string. */ $find_kryponite = preg_match($pattern,$sentence); /* $find kryponite returns true */ ?>
In Regular Expressions we have what are called metacharacters and character classes. A metacharacter defines and controls search criteria within our expression. For example: '*' matches "any character" within an expression, '.' matches 'any character that is not a newline character. '+' matches 'one or more characters'.
Regular Expressions are a language within themselves and when you use them it is more resource intensive than not using them but the trade off is often worth it in most cases.
View the whole list of meta-character and character class syntax below taken straight from php.net and let them sink into your brain.
Meta-characters and Character Classes
From PHP.net :: PCRE syntaxThe power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way.
There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows:
\ general escape character with several uses
^ assert start of subject (or line, in multiline mode)
$ assert end of subject (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
] end character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
{ start min/max quantifier
} end min/max quantifier
Part of a pattern that is in square brackets is called a "character class". In a character class the only meta-characters are:
\ general escape character
^ negate the class, but only if the first character
- indicates character range
] terminates the character class
Recognizing E-mail Addresses
<?PHP /* In our next example we are searching for an email address in a paragraph of text. */ $paragraph = " Dear Clark, I hope your trip to Niagra Falls is going well. Things are very hectic here in Metropolis, that Evil Villian Lex Luthor up to no good again and the tension here is as thick as grandma's oatmeal. Im wishing superman would show up. Your pal, Jimmy Olsen TheO-man@gmail.com "; /* Now we define our pattern again, this time searching for e-mail addresses. */ $email_pattern = "/[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{3,4}/"; /* using our \w character type /\w+@\w+(\.\w+)+/ */ ?>
Before we go further let's break down this common regex for recognizing e-mail addresses.
The [a-zA-Z0-9] is our first example of a character class starting with the meta-character '[' and ending with ]'.
In this case 'a-z' means any character thats a lower case letter between a through z. The last aspect of this class is defined by '0-9' which simply means any character in that matches our search can contain number 0-9.
The element that follows right after our character class is very important to our pattern.
The '+' right after [a-zA-Z0-9] (remember '+' means one or more characters) is specifying that we can have any letter and any number as long as one or more. This is all followed by the literal '@' character which is found in every e-mail address.
Next we repeat our first pattern [a-zA-Z0-9]+ again and then we have '\.' Here the '\' backslash is escaping the '.' which to our regex would normally mean "any character that's not a newline", but we don't want that in this case, we want the literal '.' as in '.com', '.net', 'Addicted@worldofwarcraft.com', 'Rich_and_retired@myspace.com', etc., etc.
I hope you enjoyed this so far. We will continue with Regular Expression Basics Tutorial No.2
Any comments and suggestions are welcome. You may email me at: shawn@web-hero.net
© 2006 web-hero.net
