For example, if you need to search an entire web site to remove some outdated material and replace some HTML formatting tags, you can use a regular expression to test each file to see if the material or the HTML formatting tags you are looking for exists in that file. That way, you can narrow down the affected files to only those that contain the material that has to be removed or changed. You can then use a regular expression to remove the outdated material, and finally, you can use regular expressions to search for and replace the tags that need replacing. Another example of where a regular expression is useful occurs in a language that is not known for its string-handling ability. Regular expressions provide a significant improvement in string-handling for JScript. However, regular expressions may also be more efficient to use in VBScript as well, allowing you do perform multiple string manipulations in a single expression. Early Beginnings Regular expressions trace their ancestry back to early research on how the human nervous system works. Warren McCulloch and Walter Pitts, a pair of neuro-physiologists, developed a mathematical way of describing these neural networks. In 1956, a mathematician named Stephen Kleene, building on the earlier work of McCulloch and Pitts, published a paper entitled, Representation of Events in Nerve Nets that introduced the concept of regular expressions. Regular expressions were expressions used to describe what he called "the algebra of regular sets," hence the term "regular expression." Subsequently, his work found its way into some early efforts with computational search algorithms done by Ken Thompson, the principal inventor of Unix. The first practical application of regular expressions was in the Unix editor called qed. And the rest, as they say, is history. Regular expressions have been an important part of text-based editors and search tools ever since. |
Regular Expressions Unless you have worked with regular expressions before, the term and the concept may be unfamiliar to you. However, they may not be as unfamiliar as you think. Think about how you search for files on your hard disk. You most likely use the ? and * characters to help find the files you're looking for. The ? character matches a single character in a file name, while the * matches zero or more characters. A pattern such as 'data?.dat' would find the following files: data1.dat data2.dat datax.dat dataN.dat Using the * character instead of the ? character expands the number of files found. 'data*.dat' matches all of the following: data.dat data1.dat data2.dat data12.dat datax.dat dataXYZ.dat While this method of searching for files can certainly be useful, it is also very limited. The limited ability of the ? and * wildcard characters give you an idea of what regular expressions can do, but regular expressions are much more powerful and flexible. |
Regular Expression Syntax A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched. Here are some examples of regular expression you might encounter: Pattern | Matches |
---|
/^\s[ \t]*$/ | Match a blank line. | /\d{2}-\d{5}/ | Validate an ID number consisting of 2 digits, a hyphen, and another 5 digits. |
The following table contains the complete list of metacharacters and their behavior in the context of regular expressions: Character | Description |
---|
\ | Marks the next character as a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(". | ^ | Matches the position at the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position following '\n' or '\r'. | $ | Matches the position at the end of the input string. If the RegExp object's Multiline property is set, $ also matches the position preceding '\n' or '\r'. | * | Matches the preceding character or subexpression zero or more times. For example, zo* matches "z" and "zoo". * is equivalent to {0,}. | + | Matches the preceding character or subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}. | ? | Matches the preceding character or subexpression zero or one time. For example, "do(es)?" matches the "do" in "do" or "does". ? is equivalent to {0,1} | {n} | n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food". | {n,} | n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'. | {n,m} | m and n are nonnegative integers, where n<= m. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers. | ? | When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string "oooo", 'o+?' matches a single "o", while 'o+' matches all 'o's. | . | Matches any single character except "\n". To match any character including the '\n', use a pattern such as '[\s\S]. | (pattern) | Matches pattern and captures the match. The captured match can be retrieved from the resulting Matches collection, using the SubMatches collection in VBScript or the $0$9 properties in JScript. To match parentheses characters ( ), use '\(' or '\)'. | (?:pattern) | Matches pattern but does not capture the match, that is, it is a non-capturing match that is not stored for possible later use. This is useful for combining parts of a pattern with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'. | (?=pattern) | Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead. | (?!pattern) | Negative lookahead matches the search string at any point where a string not matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but does not match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead. | x|y | Matches either x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food". | [xyz] | A character set. Matches any one of the enclosed characters. For example, '[abc]' matches the 'a' in "plain". | [^xyz] | A negative character set. Matches any character not enclosed. For example, '[^abc]' matches the 'p' in "plain". | [a-z] | A range of characters. Matches any character in the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'. | [^a-z] | A negative range characters. Matches any character not in the specified range. For example, '[^a-z]' matches any character not in the range 'a' through 'z'. | \b | Matches a word boundary, that is, the position between a word and a space. For example, 'er\b' matches the 'er' in "never" but not the 'er' in "verb". | \B | Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never". | \cx | Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character. | \d | Matches a digit character. Equivalent to [0-9]. | \D | Matches a nondigit character. Equivalent to [^0-9]. | \f | Matches a form-feed character. Equivalent to \x0c and \cL. | \n | Matches a newline character. Equivalent to \x0a and \cJ. | \r | Matches a carriage return character. Equivalent to \x0d and \cM. | \s | Matches any white space character including space, tab, form-feed, and so on. Equivalent to [ \f\n\r\t\v]. | \S | Matches any non-white space character. Equivalent to [^ \f\n\r\t\v]. | \t | Matches a tab character. Equivalent to \x09 and \cI. | \v | Matches a vertical tab character. Equivalent to \x0b and \cK. | \w | Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'. | \W | Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'. | \xn | Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". Allows ASCII codes to be used in regular expressions. | \num | Matches num, where num is a positive integer. A reference back to captured matches. For example, '(.)\1' matches two consecutive identical characters. | \n | Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7). | \nm | Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captures, n is a backreference followed by literal m. If neither of the preceding conditions exist, \nm matches octal escape value nm when n and m are octal digits (0-7). | \nml | Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7). | \un | Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©). |
|
Alternation and Grouping
Alternation allows use of the '|' character to allow a choice between two or more alternatives. Expanding the chapter heading regular expression, you can expand it to cover more than just chapter headings. However, it is not as straightforward as you might think. When alternation is used, the largest possible expression on either side of the '|' character is matched. You might think that the following expressions match either 'Chapter' or 'Section' followed by one or two digits occurring at the beginning and ending of a line:
/^Chapter|Section [1-9][0-9]{0,1}$ Unfortunately, the regular expressions shown above matches either the word 'Chapter' at the beginning of a line, or 'Section' and whatever numbers follow that, at the end of the line. If the input string is 'Chapter 22', the expression shown above only matches the word 'Chapter'. If the input string is 'Section 22', the expression matches 'Section 22'. But that is not the intent here so there must be a way to make that regular expression more responsive to what you're trying to do and there is. You can use parentheses to limit the scope of the alternation, that is, make sure that it applies only to the two words, 'Chapter' and 'Section'. However, parentheses are also used to create subexpressions and possibly capture them for later use, something that is covered in the section on backreferences. By taking the regular expressions shown above and adding parentheses in the appropriate places, you can make the regular expression match either 'Chapter 1' or 'Section 3'. The following regular expressions use parentheses to group 'Chapter' and 'Section' so the expression works properly. /^(Chapter|Section) [1-9][0-9]{0,1}$/ Although these expressions work properly, the parentheses around 'Chapter|Section' also cause either of the two matching words to be captured for future use. Since there is only one set of parentheses in the expression shown above, there is only one captured submatch. This submatch can be referred to using the Submatches collection in VBScript or the $1-$9 properties of the RegExp object in JScript. In the above example, you merely want to use the parentheses to group a choice between the words 'Chapter' and 'Section'. To prevent the match from being saved for possible later use, place '?:' before the regular expression pattern inside the parentheses. The following modification provides the same capability without saving the submatch: /^(?:Chapter|Section) [1-9][0-9]{0,1}$/ In addition to the '?:' metacharacters, there are two other non-capturing metacharacters used for something called lookahead matches. A positive lookahead, specified using ?=, matches the search string at any point where a matching regular expression pattern in parentheses begins. A negative lookahead, specified using '?!', matches the search string at any point where a string not matching the regular expression pattern begins. For example, suppose you have a document containing references to Windows 3.1, Windows 95, Windows 98, and Windows NT. Suppose further that you need to update the document by finding all the references to Windows 95, Windows 98, and Windows NT and changing those reference to Windows 2000. You can use the following JScript regular expression, which is an example of a positive lookahead, to match Windows 95, Windows 98, and Windows NT: /Windows(?=95 |98 |NT )/ Once the match is found, the search for the next match begins immediately following the matched text, not including the characters included in the look-ahead. For example, if the expressions shown above matched 'Windows 98', the search resumes after 'Windows' not after '98'. |
So far, the examples you've seen have been concerned only with finding chapter headings wherever they occur. Any occurrence of the string 'Chapter' followed by a space, followed by a number, could be an actual chapter heading, or it could also be a cross-reference to another chapter. Since true chapter headings always appear at the beginning of a line, you'll need to devise a way to find only the headings and not find the cross-references. Anchors provide that capability. Anchors allow you to fix a regular expression to either the beginning or end of a line. They also allow you to create regular expressions that occur either within a word or at the beginning or end of a word. The following table contains the list of regular expression anchors and their meanings: Character | Description |
---|
^ | Matches the position at the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position following '\n' or '\r'. | $ | Matches the position at the end of the input string. If the RegExp object's Multiline property is set, $ also matches the position preceding '\n' or '\r'. | \b | Matches a word boundary, that is, the position between a word and a space. | \B | Matches a nonword boundary. |
You cannot use a quantifier with an anchor. Since you cannot have more than one position immediately before or after a newline or word boundary, expressions such as '^*' are not permitted. To match text at the beginning of a line of text, use the '^' character at the beginning of the regular expression. Do not confuse this use of the '^' with the use within a bracket expression. To match text at the end of a line of text, use the '$' character at the end of the regular expression. To use anchors when searching for chapter headings, the following JScript regular expression matches a chapter heading with up to two following digits that occurs at the beginning of a line:
/^Chapter [1-9][0-9]{0,1}/ Not only does a true chapter heading occur at the beginning of a line, it is also the only text on the line, so it also must be at the end of a line as well. The following expression ensures that the match specified only matches chapters and not cross-references. It does so by creating a regular expression that matches only at the beginning and end of a line of text. /^Chapter [1-9][0-9]{0,1}$/ Matching word boundaries is a little different but adds a very important capability to regular expressions. A word boundary is the position between a word and a space. A nonword boundary is any other position. The following JScript expression matches the first three characters of the word 'Chapter' because they appear following a word boundary: /\bCha/ The position of the '\b' operator is critical. If it is positioned at the beginning of a string to be matched, it looks for the match at the beginning of the word; if it is positioned at the end of the string, it looks for the match at the end of the word. For example, the following expressions match 'ter' in the word 'Chapter' because it appears before a word boundary: /ter\b/ The following expressions match 'apt' as it occurs in 'Chapter', but not as it occurs in 'aptitude': /\Bapt/ The string 'apt' occurs on a nonword boundary in the word 'Chapter' but on a word boundary in the word 'aptitude'. For the \B nonword boundary operator, position is not important because the match is not relative to the beginning or end of a word. |
The period (.) matches any single printing or non-printing character in a string, except a newline character (\n). The following JScript regular expression matches 'aac', 'abc', 'acc', 'adc', and so on, as well as 'a1c', 'a2c', a-c', and a#c': /a.c/ If you are trying to match a string containing a file name where a period (.) is part of the input string, you do so by preceding the period in the regular expression with a backslash (\) character. To illustrate, the following JScript regular expression matches 'filename.ext': /filename\.ext/ These expressions are still pretty limited. They only let you match any single character. Many times, it is useful to match specified characters from a list. For example, if you have an input text that contains chapter headings that are expressed numerically as Chapter 1, Chapter 2, and so on, you might want to find those chapter headings. Bracket ExpressionsYou can create a list of matching characters by placing one or more individual characters within square brackets ([ and ]). When characters are enclosed in brackets, the list is called a bracket expression. Within brackets, as anywhere else, ordinary characters represent themselves, that is, they match an occurrence of themselves in the input text. Most special characters lose their meaning when they occur inside a bracket expression. Here are some exceptions: The ']' character ends a list if it is not the first item. To match the ']' character in a list, place it first, immediately following the opening '['. The '\' character continues to be the escape character. To match the '\' character, use '\\'.
Characters enclosed in a bracket expression match only a single character for the position in the regular expression where the bracket expression appears. The following JScript regular expression matches 'Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', and 'Chapter 5': /Chapter [12345]/ Notice that the word 'Chapter' and the space that follows are fixed in position relative to the characters within brackets. The bracket expression then, is used to specify only the set of characters that matches the single character position immediately following the word 'Chapter' and a space. That is the ninth character position. If you want to express the matching characters using a range instead of the characters themselves, you can separate the beginning and ending characters in the range using the hyphen (-) character. The character value of the individual characters determines their relative order within a range. The following JScript regular expression contains a range expression that is equivalent to the bracketed list shown above. /Chapter [1-5]/ When a range is specified in this manner, both the starting and ending values are included in the range. It is important to note that the starting value must precede the ending value in Unicode sort order. If you want to include the hyphen character in your bracket expression, you must do one of the following: - Escape it with a backslash: [\-]
- Put the hyphen character at the beginning or the end of the bracketed list. The following expressions matches all lowercase letters and the hyphen: [-a-z][a-z-]
- Create a range where the beginning character value is lower than the hyphen character and the ending character value is equal to or greater than the hyphen. Both of the following regular expressions satisfy this requirement: [!--][!-~]
You can also find all the characters not in the list or range by placing the caret (^) character at the beginning of the list. If the caret character appears in any other position within the list, it matches itself, that is, it has no special meaning. The following JScript regular expression matches chapter headings with numbers greater than 5': /Chapter [^12345]/ In the examples shown above, the expression matches any digit character in the ninth position except 1, 2, 3, 4, or 5. So, for example, 'Chapter 7' is a match and so is 'Chapter 9'. The same expressions above can be represented using the hyphen character (-). /Chapter [^1-5]/ A typical use of a bracket expression is to specify matches of any upper- or lowercase alphabetic characters or any digits. The following JScript expression specifies such a match: /[A-Za-z0-9]/ |
There are a number of useful non-printing characters that must be used occasionally. The following table shows the escape sequences used to represent those non-printing characters: Character | Meaning |
---|
\cx | Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character. | \f | Matches a form-feed character. Equivalent to \x0c and \cL. | \n | Matches a newline character. Equivalent to \x0a and \cJ. | \r | Matches a carriage return character. Equivalent to \x0d and \cM. | \s | Matches any white space character including space, tab, form-feed, and so on. Equivalent to [\f\n\r\t\v]. | \S | Matches any non-white space character. Equivalent to [^ \f\n\r\t\v]. | \t | Matches a tab character. Equivalent to \x09 and \cI. | \v | Matches a vertical tab character. Equivalent to \x0b and \cK. |
|
Order of Precedence Once you have constructed a regular expression, it is evaluated much like an arithmetic expression, that is, it is evaluated from left to right and follows an order of precedence. The following table illustrates, from highest to lowest, the order of precedence of the various regular expression operators: Operator(s) | Description |
---|
\ | Escape
| (), (?:), (?=), [] | Parentheses and Brackets | *, +, ?, {n}, {n,}, {n,m} | Quantifiers | ^, $, \anymetacharacter | Anchors and Sequences | | | Alternation |
Characters have higher precedence than the alternation operator, which allows 'm|food' to match "m" or "food". To match "mood" or "food", use parentheses to create a subexpression, which results in '(m|f)ood'. |
Ordinary CharactersOrdinary characters consist of all printable and non-printable characters that are not explicitly designated as metacharacters. This includes all uppercase and lowercase alphabetic characters, all digits, all punctuation marks, and some symbols. The simplest form of a regular expression is a single, ordinary character that matches itself in a searched string. For example, the single-character pattern 'A' matches the letter 'A' wherever it appears in the searched string. Here are some examples of single-character regular expression patterns: /a/ /7/ /M/ You can combine a number of single characters to form a larger expression. For example, the following JScript regular expression is nothing more than an expression created by combining the single-character expressions 'a', '7', and 'M'. /a7M/ Notice that there is no concatenation operator. All that is required is that you just put one character after another. |
QuantifiersSometimes, you do not know how many characters there are to match. In order to accommodate that kind of uncertainty, regular expressions support the concept of quantifiers. These quantifiers let you specify how many times a given component of your regular expression must occur for your match to be true. The following table illustrates the various quantifiers and their meanings: Character | Description |
---|
* | Matches the preceding character or subexpression zero or more times. For example, 'zo*' matches "z" and "zoo". * is equivalent to {0,}. | + | Matches the preceding character or subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}. | ? | Matches the preceding character or subexpression zero or one time. For example, 'do(es)?' matches the "do" in "do" or "does". ? is equivalent to {0,1} | {n} | n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food". | {n,} | n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the 'o' in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'. | {n,m} | m and n are nonnegative integers, where n<= m. Matches at least n and at most m times. For example, 'o{1,3}' matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers. |
With a large input document, chapter numbers could easily exceed nine, so you need a way to handle two or three digit chapter numbers. Quantifiers give you that capability. The following JScript regular expression matches chapter headings with any number of digits: /Chapter [1-9][0-9]*/ Notice that the quantifier appears after the range expression. Therefore, it applies to the entire range expression which, in this case, specifies only digits from 0 through 9, inclusive. The '+' quantifier is not used here because there does not necessarily need to be a digit in the second or subsequent position. The '?' character also is not used because it limits the chapter numbers to only two digits. You want to match at least one digit following 'Chapter' and a space character. If you know that your chapter numbers are limited to only 99 chapters, you can use the following JScript expression to specify at least one, but not more than 2 digits. /Chapter [0-9]{1,2}/ The disadvantage to the expression shown above is that if there is a chapter number greater than 99, it will still only match the first two digits. Another disadvantage is that somebody could create a Chapter 0 and it would match. A better JScript expression for matching only two digits are the following: /Chapter [1-9][0-9]?/ -or- /Chapter [1-9][0-9]{0,1}/ The '*', '+', and '?' quantifiers are all what are referred to as greedy, that is, they match as much text as possible. Sometimes that is not at all what you want to happen. Sometimes, you just want a minimal match. Say, for example, you are searching an HTML document for an occurrence of a chapter title enclosed in an H1 tag. That text appears in your document as: Chapter 1 - Introduction to Regular ExpressionsThe following expression matches everything from the opening less than symbol (<) to the greater than symbol (>) at the end of the closing H1 tag. / <.*>/ If all you really wanted to match was the opening H1 tag, the following, non-greedy expression matches only /<.*?>/ -or- "<.*?>" By placing the '?' after a '*', '+', or '?' quantifier, the expression is transformed from a greedy to a non-greedy, or minimal, match. |
Special Characters There are a number of metacharacters that require special treatment when trying to match them. To match these special characters, you must first escape those characters, that is, precede them with a backslash character (\). The following table shows those special characters and their meanings: Special Character | Comment |
---|
$ | Matches the position at the end of an input string. If the RegExp object's Multiline property is set, $ also matches the position preceding '\n' or '\r'. To match the $ character itself, use \$. | ( ) | Marks the beginning and end of a subexpression. Subexpressions can be captured for later use. To match these characters, use \( and \). | * | Matches the preceding character or subexpression zero or more times. To match the * character, use \*. | + | Matches the preceding character or subexpression one or more times. To match the + character, use \+. | . | Matches any single character except the newline character \n. To match ., use \. | [ ] | Marks the beginning of a bracket expression. To match these characters, use \[ and \]. | ? | Matches the preceding character or subexpression zero or one time, or indicates a non-greedy quantifier. To match the ? character, use \?. | \ | Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character 'n'. '\n' matches a newline character. The sequence '\\' matches "\" and '\(' matches "(". | / | Denotes the start or end of a literal regular expression. To match the '/' character, use '\/'. | ^ | Matches the position at the beginning of an input string except when used in a bracket expression where it negates the character set. To match the ^ character itself, use \^. | { } | Marks the beginning of a quantifier expression. To match these characters, use \{ and \}. | | | Indicates a choice between two items. To match |, use \|. |
|
|