Php regular expressions for newbies

October 17, 2007

Regular expressions are a strong and flexible way to define string patterns. An usual string belongs to a pattern when it has a certain structure. For example if we want all the strings that begin with two letters and end with six digits we can write the pattern using a regex like in the example below:

^[A-Z]{2}[0-9]{6}$

This is all we have to know. It seems to be difficult but as soon you understand how it works it will be quite easy. So, the “^” character mark the begin of the string and the “$” symbol means we reached the end of the string. Between these characters we will write our expression.

Between “[]” brackets is the sequence of characters. If we want our expression contains only digits we can put all the digits between the brackets, or simple we can write [0-9] . If we want small letters we write [a-z] and for capital letter we use [A-Z]. We can combine small and capital letters, digits, “-” and ” “(space) character this way: [a-zA-Z0-9_ ].

Between “{}” brackets we have the maximum numbers of characters for the previous sequence. [A-Z]{2} means we have a sequence with 2 capital letters. We can also set the length limit this way:

  • {2,5} - the sequence has more than 2 characters and less than 5 characters
  • {2,} - the sequence has more than 2 characters
  • {,2} - the sequence has less than 2 characters

Lets have another expression: ^([a-zA-Z]+)[/]*$

Here we have unknown operators like “+” which is equivalent with {1,} and “*” equivalent with {0,}. Instead of “*” we can use “?” with the difference that “?” refers to the previous character. The slash character is used for escape character.

We can simple find words to have this aspect: f[digit or x] using f[0-9x]{1}

If we want strings to have the configuration f[two digits][any character] we can use the f[0-9]{2}. regex. The “.” character means any character less line breaks.

The “()” brackets are used to group more sequences.

Below are special characters:

  • * zero or more characters
  • + one or more characters
  • \d one letter
  • \w one alphanumeric plus underscore
  • \s white space (including line breaks and tabs)
  • \t tab
  • \n new line (\r\n for Windows)
  • . any character less line breaks

I’d like to make an exercise to determine all the email addresses. Here is the regex for an email address: [a-zA-Z0-9_.-]+@[a-zA-Z0-9_.-]+\.[a-zA-Z]{2,4}+

The first [] is for the part before the “@” character. I used “+” because the sequence must have at least one character (letter, digit, -, . or _). Next is a character “@” and the same sequence for the domain name . Then we have a “.” “escaped” because it is not in a “[]” sequence. Next is the domain type which may have only 2-4 letters.

Lets see how we obtain this using PHP.

$string = ’simone@d-d.com

$emails = array();

We have the strings in which we look and the array where we put the found addresses. Now we call the function:

preg_match_all(’/[a-zA-Z0-9_.-]+@[a-zA-Z0-9_.-]+\.[a-zA-Z0-9_]{2,4}+/’, $string, $emails, PREG_SET_ORDER);

This function has three flags:

  • PREG_SET_ORDER returns an array with the values of the found brackets
  • PREG_OFFSET_CAPTURE returns the number of character where the bracket begin
  • PREG_PATTERN_ORDER returns a simple array with all the found brackets

preg_match() will return the match number and FALSE in case of error.

Note strpos() and strstr() are faster then preg_match.

Success ;)

 

Post a comment

Name (required)

Mail (will not be published) (required)

Website

*
To prove you're a person (not a spam script), type the security text shown in the picture. Click here to regenerate some new text.
Click to hear an audio file of the anti-spam word