Saturday, 2 July 2011

Regular Expressions and Pattern Matching

Regular expression is a group of characters that indicate patterns of the word it will match. Regular expressions are widely used in validation. For example, if you have a form with three text boxes, asking for name, email, and cellphone number, you can use regular expressions to prohibit users to enter invalid entries. The key to writing a good regular expression is checking the actual pattern of the valid input. For example, the name of a person starts with a capital letter followed by small letters. Name must not contain numbers or special characters. Once you have this facts, you can write the regular expression pattern.

The pattern for any English person's name (not including the middle initial) would be:

^([A-Z][a-z]+)(s[A-Z][a-z]+)*$

Now don't panic if you don't understand the pattern. We will discuss it in the following section. Regular expressions operators are used to add meaning to a pattern. For example, the * (pronounced as the Kleene Star) indicates the the preceding pattern will much 0 or more occurrences. So for example, a* will match any number of a's and an empty string. A list of regular expression operators are shown below

Operator Description

. Represents any one character.

[] Encloses a list characters that will be allowed.

[^ ] Encloses a list of characters that will not be allowed.

? Will match 0 or 1 occurrence of the preceding character or group.

* Will match 0 or more occurrence of the preceding character or group.

+ Will match 1 or more occurrence of the preceding character or group.

{n} Match declared or group n times.

{n, } Match declared element at least n times

{n,N} Match declared element at least n times, but not more than N times

^ Represents the beginning of a string.

$ Represents the end of the string.

< Represents the beginning of a word.

> Represents the end of a word.

\b Represents the beginning or end of a word.

\B Match in the middle of the word d Represents any digit (0 - 9)

\w Represents word characters (letters and numbers)

\s Represents a whitespace

The operator represents any 1 character which includes letters, numbers or special symbols. So having an expression of... matches any words with 3 characters.

The [ ] is used to enclose allowable characters in the pattern. For example, [A-Za-z] will match every capital letter and every lowercase letter. Notice we used the - sign to indicate the range of letters. You can also do that with digits by using [0-9], you can even combine all of them [A-Za-z0-9].

Using the [^ ] operator is the opposite of the [ ] operator. It matches all of the characters except those inside the operator. For example [^0-9] only matches letters since it bans the numbers 0-9.

The? operator will match the preceding element 0 or once. For example, a? will match a or nothing.

The * symbol matches 0 or more occurrences. So a* will match a single a, any number of a's, and nothing.

The + character matches 1 or more occurrences. So a+ matches 1 or more a's.

The {n} operator is useful when you want to match the preceding pattern a certain amount of times. So a{3} will match 3 a's and z{7} will match 7 z's.

The {n, } operator allows you to match the preceding pattern at least n times. Therefore, a{3, } will match 3 or more a's.

The {n, N} operator matches the preceding pattern at least n times, but not more than N times. For example, a{3, 5} will match 3 a's, 4 a's and 5 a's but not 2 or 6 a's.

We use the ^ and $ to enclose a string. The ^ signifies the beginning of any string. The $ signifies the end of any string. Suppose that we have a pattern like a*, and a word like baaaad. The word will still be accepted because the patter was found in it. Using the ^ and $ operators, you can specifically match the whole word and not its part. for example, ^a*$ only matches all a's.

We can use parentheses to enclosed patterns. The whole pattern enclosed in parentheses will now be considered a group and together with other operators, you can do some complex regular expressions.

([A-Z][a-z]+)(s[A-Z][a-z]+)*

Let's go back to our example earlier and dissect it. First, you will see the group ([A-Z][a-z]+). [A-Z] means the first letter must be a capital letter. [a-z]+ means at least one or more letters should follow the capital letter. We then group these patterns for readability using parentheses. The next patter (s[A-Z][a-z]+)*. The pattern starts with a s which means a white space should exist here. Then the pattern was repeated saying that the name must start with capital letters followed by small letters. The kleen star (*) is then used next to the group to indicate that the whole group can occur 0 or more times.

The following are examples of names that the pattern matches.

Randy Joseph Vincent John Mark Robert Smith

I hope you enjoy my article about regular expression. Visit my site which contains free C# Tutorials.

No comments:

Post a Comment