asp:Feature
Introduction to Regular Expressions: Part I
Creating Expressions
By Mitchel W. Sellers
This two-part article series provides a quick andpractical introduction to using regular expressions. Regular expressions can beused for many things; however, they are typically used for input validation orto perform advanced searches on text in supporting applications. This firstarticle will explain how to create a regular expression pattern; the expressiondefines what is considered a match. The second article will provide details onhow to implement regular expressions in .NET applications.
Before starting I?d like to point out I have a freeregular expression tester available on my Web site (http://www.mitchelsellers.com); youcan use this to test the behavior of your regular expressions. During thesecond article I?ll discuss the specific options available on this test page,as well as how the page was created.
Regular expressions have three basic types of symbols thatare used: meta characters, escape characters, and character classes. Thefollowing table lists the important meta character(s), a short description, andan example of each.
| Character | Description | Example | Matches |
| ^ | Indicates the start of a string; used to match a specific beginning sequence. | ^abc | abc, acb123, abcdefg |
| $ | Indicates end of a string; used to match a specific ending sequence. | abc$ | 123456789abc, 987abc |
| . | Any character excluding \n (new line). | a.c | abc, aac, a9c |
| | | Or operator used to specify one criteria or another. | john|jane | jane, john |
| * | Zero or more of previous expression. | 12c* | 12, 12c, 12cc |
| + | One or more of previous expression. | 1a+c | 1ac, 1aac |
| ? | Zero or one of previous expression. | 12?c | 1c, 12c |
| \ | Escape character, used to make any of the special characters (^, $, ., |, *, +, ?, (, [, {, etc...) literal for matching. See next chart for other escape characters. | 1\*a | 1*a |
| {....} | Explicit quantifier notation; used to indicate _ occurrences of a character or character class. A comma can be added to provide min/max occurrences. | 12a{2} | 12aa, 12aa3 |
| [....] | Matches a range of characters; you can provide collections of characters (abcdefg), as well as hyphenated ranges of characters for matching (A-Z). | 123[abc] | 123a, 123b, 123c |
| (....) | Groups a portion of the expression; used to group sections for display. | (123){2} | 123123 |
Meta Characters
The characters in the table below are used to matchspecial characters in regular expressions; we will use some of these later inthis article. NOTE: This is a list of commonly used escape characters, not acomplete list of escape characters.
| Character | Matches |
| \b | Word boundary; indicates a space or other non-word character to signify the end of a word. |
| \t | Tab character. |
| \n | New line character (great for multi-line textboxes). |
| \(any metacharacter) | Matches the entered meta character. (\* matches *, \$ matches $). |
Escape Characters
Below are character classes that represent differentgroups of characters to make it easier to match common groups of characters.
| Character Class | Description | Example | Matches |
| . | Matches any character except \n. If Single Line option is enabled, it matches ANY character. | a.c | aac, abc, a1c |
| [rstlne] | Matches any single character in the provided list. | a[rstlne] | ar, as, al |
| [^aeiou] | Matches any single character NOT in the provided list. | a[^aeiou] | ab, ad, ah |
| [0-9a-zA-Z] | Matches any single character in the following ranges (0 through 9, A through Z, and a through z). The hyphen indicates a range element. | 123[0-9A-F] | 123A, 1234 |
| \w | Matches any word character; in ECMAScript mode this matches [0-9A-Za-z]. | 123\w | 123a, 1234 |
| \W | Matches any NON-word character; in ECMAScript mode this is the same as [^0-9A-Za-z]. | 123\W | 123$, 123- |
| \s | Matches any whitespace character; in ECMAScript mode this matches spaces, tabs, and new lines. | 123\sa | 123 a |
| \S | Matches any NON-whitespace character. | 1\Sa | 14a, 1ba |
| \d | Matches any digit character; in ECMAScript mode this matches 0-9. | \d2 | 12, 32 |
| \D | Matches any NON-digit character; in ECMAScript mode this matches anything that is not 0-9 | \D2 | a2, b2 |
Character Classes
How to Apply this Information
Now that we?ve explained the various characters includedin matching regular expressions, let?s walk through some practical examples toillustrate how all these items are pulled together. In the followingsubsections I?ll walk you through a series of real-world validations andprovide examples with detailed information.
Before beginning the examples I want to point out that inALL of my examples the regular expressions created start with the ^ characterand end with the $ character. This is done to ensure that the expressionmatches the entire string. This is done to ensure that the string is thatmatch, and ONLY that match. Otherwise, you can receive matches for strings withmore than the included characters. You may play around with this using myexpression tester to see the effects of omitting the ^ and $ characters.
Postal Code Validation
Postal code validation is a very common user inputvalidation; typically, your postal code will either be five digits or ninedigits, with a hyphen after the fifth digit. We can validate this input withthe following expression:
^\d{5}(-\d{4})?$
First we have the ?\d{5}? portion of the expression, whichindicates that the input must start with five digit characters (0-9). Next theportion of the expression inside the parenthesis, ?-\d{4}? indicates a hyphen (-)to be followed by four digit characters. This is grouped within parentheses andhas a question mark appended to the end. This question mark indicates that theinput should have zero or one of the preceding items, which happens to be theentire expression in the parentheses. Therefore, in the case of zero, theexpression would simply be five digit characters; in the case of one, theexpression would be five digits, a hyphen, and four more digits.
Simple Date Validation
Validation of date input is another very common occurrence,full regular expression date validation is very involved; however, it is veryeasy to restrict users to a MM/DD/YYYY format with basic checking for incorrectinput. Below is a regular expression to validate a date in the MM/DD/YYYYformat; I?ve added parenthesis for readability:
^([01]\d)/([0-3]\d)/(\d{4})$
The first section of this expression ?([01]\d)? representsthe month portion of our date, because there are only 12 months in the year werestrict the first digit to either a zero or a one, and the second charactercan be any number 0-9. This is one portion of this example that can be improvedupon; you can modify and create regular expressions that are capable ofvalidating that the input is between 1 and 12 (however, this is outside thescope of this article).
The second section of this expression ?([0-3]\d)?represents the day portion of our date. This is separated from our first partby a / character, which is a literal requirement that the month be separatedfrom the date by a forward slash. The first part of our day check requires thatthe first digit of the day is a 0, 1, 2, or 3, then the second digit can be anynumber 0-9. Just as with the month portion, this can be expanded to ensure thatthe day value is appropriate for the month provided; however, it is outside thescope of this article.
The final section of this expression is again separated bya / character, then it allows for four digit characters to be entered. Thisforms the final portion of the date.
Phone Number Validation
Another common input item to validate are phone numbers,including area codes and extensions. Below is a sample regular expression thatvalidates a phone number that meets one of the following formats; (555)555-1212, 555-555-1212, (555) 555-1212 x1111, or 555-555-1212 x1111. Portionsof the expression have been highlighted to illustrate the different sections oflogic. These sections will be explained below:
^ (\(\d{3}\)\s|\d{3}\s) (\d{3}[\s-]\d{4}) (\sx\d+)?$
The yellow portion of this expression validates the areacode input. Notice that we have two individual groups separated by the oroperator (|). This indicates that one of the two expressions must be true. Thefirst one validates on a left parenthesis (, three digits, a right parenthesis), and a space; the second option validates on three digits and a space. Therefore,the phone number must begin with either (515) or 515; this validates the areacode portion of our phone number.
The green portion of this expression validates theremaining portion of the standard phone number. The first part ?\d{3}? requiresthree digits, then the ?[\s-]? allows for either a space or a hyphen. This isthen followed by the ?\d{4}? portion, which indicates that an additional four digitsare required. We now have validation for a standard 10-digit phone number withsupport for multiple formats.
The gray portion of this expression validates the optionaltelephone extension. The expression ?\sx\d+? indicates that the input stringshould have a space, the letter x, and then one or more digits. This isenclosed in parentheses and followed by a question mark to indicate that it isoptional. This provides for validation of numbers such as (555) 555-1212 x102.
This should provide a helpful overview of regularexpressions. Stay tuned for Part II.
Mitchel W. Sellersis a Microsoft Certified Professional Developer with multiple specializations. He?sbeen developing in .NET since shortly after the release of .NET 1.1 He is theCo-Founder of a startup software consulting firm, IowaComputerGurus L.L.P. Heis also very active in multiple online communities, including GotDotNet andDotNetNuke. Find out more about him at http://www.mitchelsellers.comor e-mail him at mailto:mitchel.sellers@gmail.com.