Regular Expressions
Regular expressions are
part of those small technology areas that are incredibly useful in
a wide range of programs, yet rarely used among developers. You can
think of regular expressions as a mini-programming language with
one specific purpose: to locate substrings within a large string
expression. It is not a new technology; it originated in the Unix
environment and is commonly used with the Perl programming
language. Microsoft ported it onto Windows, where up until now it
has been used mostly with scripting languages. Regular expressions
are today, however, supported by a number of .NET classes in the
namespace System.Text.RegularExpressions. You can also find
the use of regular expressions in various parts of the .NET
Framework. For instance, you will find that they are used within
the ASP.NET Validation server controls.
If you are not familiar with the regular
expressions language, this section gives a very basic introduction
to both regular expressions and their related .NET classes. If you
are already familiar with regular expressions, you’ll probably want
to just skim through this section to pick out the references to the
.NET base classes. You might like to know that the .NET regular
expression engine is designed to be mostly compatible with Perl 5
regular expressions, although it has a few extra features.
Introduction to Regular Expressions
The regular expressions language is designed
specifically for string processing. It contains two features:
-
A set of escape codes
for identifying specific types of characters. You will be familiar
with the use of the * character to
represent any substring in DOS expressions. (For example, the DOS
command Dir Re* lists the files with
names beginning with Re.) Regular
expressions use many sequences like this to represent items such as
any one character, a
word break, one optional character, and
so on.
-
A system for grouping parts of substrings and
intermediate results during a search operation.
With regular expressions, you can perform quite
sophisticated and high-level operations on strings. For example,
you can:
-
Identify (and perhaps either flag or remove)
all repeated words in a string (for example, “The computer site (roque-patrick.com)
site (roque-patrick.com)” to “The computer site (roque-patrick.com)”)
-
Convert all words to title case (for example,
“this is a Title” to “This Is aTitle”)
-
Convert all words longer than three
characters to title case (for example, “this is a Title” to “This
is a Title”)
-
Ensure that sentences are properly
capitalized
-
Separate the various elements of a URI (for
example, given http://www.wrox.com, extract the
protocol, computer name, file name, and so on)
Of course, all of these tasks can be performed in
C# using the various methods on System.String and System.Text.StringBuilder. However, in some cases,
this would involve writing a fair amount of C# code. If you use
regular expressions, this code can normally be compressed to just a
couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or,
even simpler, invoke a static RegEx()
method), pass it the string to be processed, and pass in a regular
expression (a string containing the instructions in the regular
expressions language), and you’re done.
A regular expression string looks at first sight
rather like a regular string, but interspersed with escape
sequences and other characters that have a special meaning. For
example, the sequence \b indicates
the beginning or end of a
word (a word boundary), so if you wanted to indicate you were
looking for the characters th at the
beginning of a word, you would search for the regular expression,
\bth. (that is, the sequence word
boundary -t-h). If you wanted to search for all occurrences of
th at the end of a word, you would write
th\b (the sequence t-h-word boundary).
However, regular expressions are much more sophisticated than that
and include, for example, facilities to store portions of text that
are found in a search operation. This section merely scratches the
surface of the power of regular expressions.
Suppose your application needed to convert U.S.
phone numbers to an international format. In the United States, the
phone numbers have this format: 314-123-1234, which is often
written as (314) 123-1234. When converting this national format to
an international format you have to include +1 (the country code of
the United States) and add brackets around the area code: +1 (314)
123-1234. As find-and-replace operations go, that’s not too
complicated, would still require some coding effort if you were
going to use the String class for this
purpose (which would mean that you would have to write your code
using the methods available on System.String).The regular expressions language
allows you to construct a short string that achieves the same
result.
This section is intended only as a very simple
example, so it concentrates on searching strings to identify
certain substrings, not on modifying them.
The RegularExpressionsPlayaround Example
For the rest of this section, you develop a
short example that illustrates some of the features of regular
expressions and how to use the .NET regular expressions engine in
C# by performing and displaying the results of some searches. The
text you are going to use as your sample document is an
introduction to a Wrox Press site on ASP.NET (Professional ASP.NET
2.0, ISBN 0-7645-7610-0):
|
|
Tip |
This code is valid C# code, despite all the
line breaks. It nicely illustrates the utility of verbatim strings
that are prefixed by the @ symbol.
|
This text is referred to as the input string. To get your bearings and get used to
the regular expressions .NET classes, you start with a basic plain
text search that doesn’t feature any escape sequences or regular
expression commands. Suppose that you want to find all occurrences
of the string ion. This search string is
referred to as the pattern. Using regular
expressions and the Text variable
declared previously, you can write this:
This code uses the static method Matches() of the Regex
class in the System.Text.RegularExpressions namespace. This
method takes as parameters some input text, a pattern, and a set of
optional flags taken from the
RegexOptions enumeration. In this case,
you have specified that all searching should be case insensitive.
The other flag, ExplicitCapture,
modifies the way that the match is collected in a way that, for
your purposes, makes the search a bit more efficient - you see why
this is later (although it does have other uses that we won’t
explore here). Matches() returns a
reference to a MatchCollection object. A
match is the technical term for the results
of finding an instance of the pattern in the expression. It is
represented by the class System.Text.RegularExpressions.Match. Therefore, you
return a MatchCollection that contains
all the matches, each represented by a Match object. In the preceding code, you simply
iterate over the collection and use the Index property of the Match class, which returns the index in the input
text of where the match was found. Running this code results in
three matches. The following table details some of the RegexOptions enumerations.
So far, nothing is really new from the preceding
example apart from some .NET base classes. However, the power of
regular collections really comes from that pattern string. The
reason is that the pattern string doesn’t have to only contain
plain text. As hinted at earlier, it can also contain what are
known as meta-characters, which are special
characters that give commands, as well as escape sequences, which
work in much the same way as C# escape sequences. They are
characters preceded by a backslash (\)
and have special meanings.
For example, suppose that you wanted to find words
beginning with n. You could use the
escape sequence \b, which indicates a
word boundary (a word boundary is just a point where an
alphanumeric character precedes or follows a whitespace character
or punctuation symbol). You would write this:
Notice the @ character
in front of the string. You want the \b
to be passed to the .NET regular expressions engine at runtime -
you don’t want the backslash intercepted by a well-meaning C#
compiler that thinks it’s an
escape sequence intended for itself! If you want to find words
ending with the sequence ion, you write
this:
If you want to find all words beginning with the
letter a and ending with the sequence
ion (which has as its only match the
word application in the example), you will
have to put a bit more thought into your code. You clearly need a
pattern that begins with \ba and ends
with ion\b, but what goes in the middle?
You need to somehow tell the application that between the
n and the ion
there can be any number of characters as long as none of them are
whitespace. In fact, the correct pattern looks like this:
Eventually you will get used to seeing weird
sequences of characters like this when working with regular
expressions. It actually works quite logically. The escape sequence
\S indicates any character that is not a
whitespace character. The * is called a
quantifier. It means that the preceding
character can be repeated any number of times, including zero
times. The sequence \S* means any number of characters as long as they are not whitespace characters. The preceding
pattern will, therefore, match any single word that begins with
a and ends with ion.
The following table lists some of the main special
characters or escape sequences that you can use. It is not
comprehensive, but a fuller list is available in the MSDN
documentation.
If you want to search for one of the
meta-characters, you can do so by escaping the corresponding
character with a backslash. For example, . (a single period) means any single character other
than the newline character, whereas \.
means a dot.
You can request a match that contains
alternative characters by enclosing them in square brackets. For
example, [1|c] means one character that
can be either 1 or c. If you wanted to search for any occurrence of the
words map or man, you would use the sequence ma[n|p]. Within the square brackets, you can also
indicate a range, for example [a-z] to
indicate any single lowercase letter, [A-E] to indicate any uppercase letter between
A and E, or
[0-9] to represent a single digit. If
you want to search for an integer (that is, a sequence that
contains only the characters 0 through 9), you could write
[0-9]+ (note the use of the + character to indicate there must be at least one
such digit, but there may be more than one - so this would match 9,
83, 854, and so on).
Displaying Results
In this section, you code the RegularExpressionsPlayaround example, so you can get
a feel for how the regular expressions work.
The core of the example is a method called
WriteMatches(), which writes out all the
matches from a MatchCollection in a more
detailed format. For each match, it displays the index of where the
match was found in the input string, the string of the match, and a
slightly longer string, which consists of the match plus up to ten
surrounding characters from the input text - up to five characters
before the match and up to five afterward (it is fewer than five
characters if the match occurred within five characters of the
beginning or end of the input text). In other words, a match on the
word messaging that occurs near the end
of the input text quoted earlier would display and messaging of d (five characters before and after
the match), but a match on the final word data would display g of
data. (only one character after the match), because after
that you get to the end of the string. This longer string lets you
see more clearly where the regular expression locates the
match:
The bulk of the processing in this method is
devoted to the logic of figuring out how many characters in the
longer substring it can display without overrunning the beginning
or end of the input text. Note that you use another property on the
Match object, Value, which contains the string identified for the
match. Other than that,
RegularExpressionsPlayaround simply
contains a number of methods with names like Find1, Find2, and so on,
which perform some of the searches based on the examples in this
section. For example, Find2 looks for
any string that contains a at the
beginning of a word:
Along with this comes a simple Main() method that you can edit to select one of the
Find<n>()
methods:
The code also needs to make use of the RegularExpressions namespace:
Running the example with the Find1() method shown previously gives these
results:
Matches, Groups, and Captures
One nice feature of regular expressions is
that you can group characters. It works the same way as compound
statements in C#. In C# you can group any number of statements by
putting them in braces, and the result is treated as one compound
statement. In regular expression patterns, you can group any
characters (including meta-characters and escape sequences), and
the result is treated as a single character. The only difference is
that you use parentheses instead of braces. The resultant sequence
is known as a group.
For example, the pattern (an)+ locates any recurrences of the sequence
an. The +
quantifier applies only to the previous character, but because you
have grouped the characters together, it now applies to repeats of
an treated as a unit. This means that if
you apply (an)+ to the input text,
bananas came to Europe late in the annals of history , the anan from bananas is
identified. On the other hand, if you write an+, the program selects the ann from annals, as well
as two separate sequences of an from
bananas. The expression (an)+ identifies occurrences of an, anan, ananan, and so on, whereas the expression
an+ identifies occurrences of
an, ann,
annn, and so on.
|
|
Tip |
You might wonder why with the preceding
example (an)+ picks out anan from the word banana but doesn’t identify
either of the two occurrences of an from the same word. The rule is
that matches must not overlap. If there are a couple of
possibilities that would overlap, then by default the longest
possible sequence will be matched.
|
However, groups are actually more powerful than
that. By default, when you form part of the pattern into a group,
you are also asking the regular expression engine to remember any
matches against just that group, as well as any matches against the
entire pattern. In other words, you are treating that group as a
pattern to be matched and returned in its own right. This can
actually be extremely useful if you want to break up strings into
component parts.
For example, URIs have the format: <protocol>://<address>:<port>, where the port is optional. An
example of this is http://www.wrox.com:4355. Suppose
that you want to extract the protocol, the address, and the port
from a URI, where you know that there may or may not be whitespace
(but no punctuation) immediately following the URI. You could do so
using this expression:
Here is how this expression works: First, the
leading and trailing \b sequences ensure
that you only consider portions of text that are entire words.
Within that, the first group, (\S+)://,
identifies one or more characters that don’t count as whitespace,
and that are followed by :// - the
http:// at the start of an HTTP URI. The
brackets cause the http to be stored as
a group. The subsequent (\S+) identifies
the string www.wrox.com in the URI. This group will
end either when it encounters the end of the word (the closing
\b) or a colon (:) as marked by the next group.
The next group identifies the port (:4355). The following ?
indicates that this group is optional in the match - if there is no
:xxxx, this won’t prevent a match from
being marked. This very important, because the port number is not
always specified in a URI - in fact, it is absent most of the time.
However, things are a bit more complicated than that. You want to
indicate that the colon might or might not appear too, but you
don’t want to store this colon in the group. You’ve achieved this
by having two nested groups. The inner (\S+) identifies anything that follows the colon
(for example, 4355). The outer group
contains the inner group preceded by the colon, and this group in
turn is preceded by the sequence ?:.
This sequence indicates that the group in question should not be
saved (you only want to save 4355; you
don’t need :4355 as well!). Don’t get
confused by the two colons following each other - the first colon
is part of the ?: sequence that says
“don’t save this group,” and the second is text to be searched
for.
If you run this pattern on the following string,
you’ll get one match: http://www.wrox.com.
Within this match, you will find the three groups
just mentioned as well as a fourth group, which represents the
match itself. Theoretically, it is possible that each group itself
might return no, one, or more than one match. Each of these
individual matches is known as a capture.
So, the first group, (\S+), has one
capture, http. The second group also has
one capture (www.wrox.com). The third group, however,
has no captures, because there is no port number on this URI.
Notice that the string contains a second
http://. Although this does match up to
the first group, it will not be captured by the search, because the
entire search expression does not match this part of the text.
There isn’t space to show any examples of C# code
that uses groups and captures, but you should know that the .NET
RegularExpressions classes support
groups and captures, through classes known as Group and Capture. Also,
the GroupCollection and CaptureCollection classes represent collections of
groups and captures. The Match class
exposes the Groups() method, which
returns the corresponding GroupCollection object. The Group class correspondingly implements the
Captures() method, which returns a
CaptureCollection. The relationship
between the objects is shown in Figure 8-3.
You might not want to return a Group object every time you just want to group some
characters. A fair amount of overhead is involved in instantiating
the object, which is not necessary if all you want is to group some
characters as part of your search pattern. You can disable this by
starting the group with the character sequence ?: for an individual group, as was done for the URI
example, or for all groups by specifying the RegExOptions.ExplicitCaptures flag on the
RegEx.Matches() method, as was done in
the earlier examples.
|