Home
C# Language C# Language
C# Language Contents C# Language C# Language C# Language

C# with .NET

Prev Page Next PageNext C# Language
C# Language   C# Language
C# Language Table of Contents
C# Language Back Cover
C# Language Professional C# 2009 with .NET 3.0
C# Language Introduction
C# Language Looking at What’s New in the .NET Framework 2.0
C# Language Introducing the .NET Framework 3.0
C# Language Where C# Fits In
C# Language What You Need to Write and Run C# Code
C# Language What This site Covers
C# Language Conventions
C# Language Source Code
C# Language Errata
C# Language roque-patrick.com
C# Language The C# Language
C# Language .NET Architecture
C# Language The Relationship of C# to .NET
C# Language The Common Language Runtime
C# Language A Closer Look at Intermediate Language
C# Language Assemblies
C# Language .NET Framework Classes
C# Language Namespaces
C# Language Creating .NET Applications Using C#
C# Language The Role of C# in the .NET Enterprise Architecture
C# Language Summary
C# Language C# Basics
C# Language Before We Start
C# Language Your First C# Program
C# Language Variables
C# Language Predefined Data Types
C# Language Flow Control
C# Language Enumerations
C# Language Arrays
C# Language Namespaces
C# Language The Main() Method
C# Language More on Compiling C# Files
C# Language Console I/O
C# Language Using Comments
C# Language The C# Preprocessor Directives
C# Language C# Programming Guidelines
C# Language Summary
C# Language Objects and Types
C# Language Classes and Structs
C# Language Class Members
C# Language Structs
C# Language Partial Classes
C# Language Static Classes
C# Language The Object Class
C# Language Summary
C# Language Inheritance
C# Language Implementation Inheritance
C# Language Modifiers
C# Language Interfaces
C# Language Summary
C# Language Arrays
C# Language Simple Arrays
C# Language Multidimensional Arrays
C# Language Jagged Arrays
C# Language Array Class
C# Language Array and Collection Interfaces
C# Language Enumerations
C# Language Summary
C# Language Operators and Casts
C# Language Operators
C# Language Type Safety
C# Language Comparing Objects for Equality
C# Language Operator Overloading
C# Language User-Defined Casts
C# Language Summary
C# Language Delegates and Events
C# Language Delegate Inference
C# Language Anonymous Methods
C# Language Events
C# Language Summary
C# Language Strings and Regular Expressions
C# Language System.String
C# Language Regular Expressions
C# Language Summary
C# Language Generics
C# Language Overview
C# Language Creating Generic Classes
C# Language Generic Classes’ Features
C# Language Generic Interfaces
C# Language Generic Methods
C# Language Generic Delegates
C# Language Other Generic Framework Types
C# Language Summary
C# Language Collections
C# Language Collection Interfaces and Types
C# Language Lists
C# Language Queue
C# Language Stack
C# Language Linked Lists
C# Language Sorted Lists
C# Language Dictionaries
C# Language Dictionary with Multiple Keys
C# Language Bit Arrays
C# Language Performance
C# Language Summary
C# Language Memory Management and Pointers
C# Language Memory Management under the Hood
C# Language Freeing Unmanaged Resources
C# Language Unsafe Code
C# Language Summary
C# Language Reflection
C# Language Custom Attributes
C# Language Reflection
C# Language Summary
C# Language Errors and Exceptions
C# Language Looking into Errors and Exception Handling
C# Language Summary
C# Language Visual Studio
C# Language Visual Studio 2009
C# Language Refactoring
C# Language Visual Studio 2009 for .NET Framework 3.0
C# Language Summary
C# Language Deployment
C# Language Designing for Deployment
C# Language Deployment Options
C# Language Deployment Requirements
C# Language Deploying the .NET Runtime
C# Language Simple Deployment
C# Language Installer Projects
C# Language ClickOnce
C# Language Summary
C# Language Base Class Libraries
C# Language Assemblies
C# Language What Are Assemblies?
C# Language Assembly Structure
C# Language Cross-Language Support
C# Language Global Assembly Cache
C# Language Creating Shared Assemblies
C# Language Configuration
C# Language Summary
C# Language Tracing and Events
C# Language Tracing
C# Language Event Logging
C# Language Performance Monitoring
C# Language Summary
C# Language Threading and Synchronization
C# Language Overview
C# Language Asynchronous Delegates
C# Language The Thread Class
C# Language Thread Pools
C# Language Threading Issues
C# Language Synchronization
C# Language COM Apartments
C# Language Background Worker
C# Language Summary
C# Language .NET Security
C# Language Code Access Security
C# Language Support for Security in the Framework
C# Language Managing Security Policies
C# Language Role-Based Security
C# Language Summary
C# Language Localization
C# Language Namespace System.Globalization
C# Language Resources
C# Language Localization Example Using Visual Studio
C# Language Localization with ASP.NET
C# Language A Custom Resource Reader
C# Language Creating Custom Cultures
C# Language Summary
C# Language Transactions
C# Language Overview
C# Language Database and Classes
C# Language Traditional Transactions
C# Language System.Transactions
C# Language Isolation Level
C# Language Custom Resource Managers
C# Language Transactions with Windows Vista
C# Language Summary
C# Language Windows Services
C# Language What Is a Windows Service?
C# Language Windows Services Architecture
C# Language System.ServiceProcess Namespace
C# Language Creating a Windows Service
C# Language Monitoring and Controlling the Service
C# Language Troubleshooting
C# Language Power Events
C# Language Summary
C# Language COM Interoperability
C# Language .NET and COM
C# Language Marshaling
C# Language Using a COM Component from a .NET Client
C# Language Using a .NET Component from a COM Client
C# Language Platform Invoke
C# Language Summary
C# Language Data
C# Language Manipulating Files and the Registry
C# Language Managing the File System
C# Language Moving, Copying, and Deleting Files
C# Language Reading and Writing to Files
C# Language Reading Drive Information
C# Language File Security
C# Language Reading and Writing to the Registry
C# Language Reading and Writing to Isolated Storage
C# Language Summary
C# Language Data Access with .NET
C# Language ADO.NET Overview
C# Language Using Database Connections
C# Language Commands
C# Language Fast Data Access: The Data Reader
C# Language Managing Data and Relationships: The DataSet Class
C# Language Populating a DataSet
C# Language Persisting DataSet Changes
C# Language Working with ADO.NET
C# Language Summary
C# Language Manipulating XML
C# Language XML Standards Support in .NET
C# Language Introducing the System.Xml Namespace
C# Language Using MSXML in .NET
C# Language Using System.Xml Classes
C# Language Reading and Writing Streamed XML
C# Language Using the DOM in .NET
C# Language Using XPathNavigators
C# Language XML and ADO.NET
C# Language Serializing Objects in XML
C# Language Summary
C# Language .NET Programming with SQL Server 2009
C# Language .NET Runtime Host
C# Language Microsoft.SqlServer.Server
C# Language User-Defined Types
C# Language Stored Procedures
C# Language User-Defined Functions
C# Language Triggers
C# Language XML Data Type
C# Language Summary
C# Language Presentation
C# Language Windows Forms
C# Language Creating a Windows Form Application
C# Language Control Class
C# Language Standard Controls and Components
C# Language Forms
C# Language Summary
C# Language Viewing .NET Data
C# Language The DataGridView Control
C# Language DataGridView Class Hierarchy
C# Language Data Binding
C# Language Visual Studio .NET and Data Access
C# Language Summary
C# Language Graphics with GDI+
C# Language Understanding Drawing Principles
C# Language Measuring Coordinates and Areas
C# Language A Note about Debugging
C# Language Drawing Scrollable Windows
C# Language World, Page, and Device Coordinates
C# Language Colors
C# Language The Safety Palette
C# Language Pens and Brushes
C# Language Drawing Shapes and Lines
C# Language Displaying Images
C# Language Issues When Manipulating Images
C# Language Drawing Text
C# Language Simple Text Example
C# Language Fonts and Font Families
C# Language Example: Enumerating Font Families
C# Language Editing a Text Document: The CapsEditor Sample
C# Language Printing
C# Language Summary
C# Language Windows Presentation Foundation
C# Language Overview
C# Language Shapes
C# Language Controls
C# Language Layout
C# Language Event Handling
C# Language Commands
C# Language Styles, Templates, and Resources
C# Language Styles
C# Language Animations
C# Language Data Binding
C# Language Windows Forms Integration
C# Language Summary
C# Language ASP.NET Pages
C# Language ASP.NET Introduction
C# Language ASP.NET Web Forms
C# Language ADO.NET and Data Binding
C# Language Application Configuration
C# Language Summary
C# Language ASP.NET Development
C# Language Custom Controls
C# Language Master Pages
C# Language Site Navigation
C# Language Security
C# Language Themes
C# Language Web Parts
C# Language Summary
C# Language ASP.NET AJAX
C# Language What Is Ajax?
C# Language What Is ASP.NET AJAX?
C# Language ASP.NET AJAX-Enabled Web Sites
C# Language Summary
C# Language Communication
C# Language Accessing the Internet
C# Language The WebClient Class
C# Language WebRequest and WebResponse Classes
C# Language Displaying Output as an HTML Page
C# Language Utility Classes
C# Language Lower-Level Protocols
C# Language Summary
C# Language Web Services with ASP.NET
C# Language SOAP
C# Language WSDL
C# Language Web Services
C# Language Extending the Event-siteing Example
C# Language Exchanging Data Using SOAP Headers
C# Language Summary
C# Language .NET Remoting
C# Language What Is .NET Remoting?
C# Language .NET Remoting Overview
C# Language Contexts
C# Language Remote Objects, Clients, and Servers
C# Language .NET Remoting Architecture
C# Language Miscellaneous .NET Remoting Features
C# Language Summary
C# Language Enterprise Services
C# Language Overview
C# Language Creating a Simple COM+ Application
C# Language Deployment
C# Language Component Services Explorer
C# Language Client Application
C# Language Transactions
C# Language Sample Application
C# Language Integrating WCF and Enterprise Services
C# Language Summary
C# Language Message Queuing
C# Language Overview
C# Language Message Queuing Products
C# Language Message Queuing Architecture
C# Language Message Queuing Administrative Tools
C# Language Programming Message Queuing
C# Language Course Order Application
C# Language Receiving Results
C# Language Transactional Queues
C# Language Message Queue Installation
C# Language Summary
C# Language Windows Communication Foundation
C# Language Overview
C# Language Simple Service and Client
C# Language Contracts
C# Language Service Implementation
C# Language Binding
C# Language Hosting
C# Language Clients
C# Language Duplex Communication
C# Language Summary
C# Language Windows Workflow Foundation
C# Language Activities
C# Language Custom Activities
C# Language Workflows
C# Language The Workflow Runtime
C# Language Workflow Services
C# Language Hosting Workflows
C# Language The Workflow Designer
C# Language Summary
C# Language Download Details
C# Language Directory Services
C# Language The Architecture of Active Directory
C# Language Administration Tools for Active Directory
C# Language Programming Active Directory
C# Language Searching for User Objects
C# Language DSML
C# Language Summary
C# Language Part VII: Additional Information
C# Language C#, Visual Basic, and C++/CLI
C# Language Namespaces
C# Language Defining Types
C# Language Methods
C# Language Static Members
C# Language Arrays
C# Language Control Statements
C# Language Loops
C# Language Exception Handling
C# Language Inheritance
C# Language Resource Management
C# Language Delegates
C# Language Events
C# Language Generics
C# Language C++/CLI Mixing Native and Managed Code
C# Language Summary
C# Language Windows Vista
C# Language Vista Bridge
C# Language User Account Control
C# Language Directory Structure
C# Language New Controls and Dialogs
C# Language Search
C# Language Summary
C# Language Language Integrated Query
C# Language Traditional Queries
C# Language LINQ Query
C# Language Query Expressions
C# Language Extension Methods
C# Language Standard Query Operators
C# Language Lambda Expressions
C# Language Deferred Query Execution
C# Language Expression Trees
C# Language Type Inference
C# Language Object and Collection Initializers
C# Language Anonymous Types
C# Language Summary
C# Language Index
C# Language A
C# Language B
C# Language C
C# Language D
C# Language E
C# Language F
C# Language G
C# Language H
C# Language I
C# Language J
C# Language K
C# Language L
C# Language M
C# Language N
C# Language O
C# Language P
C# Language Q
C# Language R
C# Language S
C# Language T
C# Language U
C# Language V
C# Language W
C# Language X
C# Language Y
C# Language Z
C# Language
C# Language
Previous PagePrevious
Next PageNext

Regular Expressions

Regular expressions are part of those small technology areas that are incredibly useful in a wide range of programs, yet rarely used among developers. You can think of regular expressions as a mini-programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the Unix environment and is commonly used with the Perl programming language. Microsoft ported it onto Windows, where up until now it has been used mostly with scripting languages. Regular expressions are today, however, supported by a number of .NET classes in the namespace System.Text.RegularExpressions. You can also find the use of regular expressions in various parts of the .NET Framework. For instance, you will find that they are used within the ASP.NET Validation server controls.

If you are not familiar with the regular expressions language, this section gives a very basic introduction to both regular expressions and their related .NET classes. If you are already familiar with regular expressions, you’ll probably want to just skim through this section to pick out the references to the .NET base classes. You might like to know that the .NET regular expression engine is designed to be mostly compatible with Perl 5 regular expressions, although it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is designed specifically for string processing. It contains two features:

  • A set of escape codes for identifying specific types of characters. You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re* lists the files with names beginning with Re.) Regular expressions use many sequences like this to represent items such as any one character, a word break, one optional character, and so on.

  • A system for grouping parts of substrings and intermediate results during a search operation.

With regular expressions, you can perform quite sophisticated and high-level operations on strings. For example, you can:

  • Identify (and perhaps either flag or remove) all repeated words in a string (for example, “The computer site (roque-patrick.com) site (roque-patrick.com)” to “The computer site (roque-patrick.com)”)

  • Convert all words to title case (for example, “this is a Title” to “This Is aTitle”)

  • Convert all words longer than three characters to title case (for example, “this is a Title” to “This is a Title”)

  • Ensure that sentences are properly capitalized

  • Separate the various elements of a URI (for example, given http://www.wrox.com, extract the protocol, computer name, file name, and so on)

Of course, all of these tasks can be performed in C# using the various methods on System.String and System.Text.StringBuilder. However, in some cases, this would involve writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed to just a couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or, even simpler, invoke a static RegEx() method), pass it the string to be processed, and pass in a regular expression (a string containing the instructions in the regular expressions language), and you’re done.

A regular expression string looks at first sight rather like a regular string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if you wanted to indicate you were looking for the characters th at the beginning of a word, you would search for the regular expression, \bth. (that is, the sequence word boundary -t-h). If you wanted to search for all occurrences of th at the end of a word, you would write th\b (the sequence t-h-word boundary). However, regular expressions are much more sophisticated than that and include, for example, facilities to store portions of text that are found in a search operation. This section merely scratches the surface of the power of regular expressions.

Suppose your application needed to convert U.S. phone numbers to an international format. In the United States, the phone numbers have this format: 314-123-1234, which is often written as (314) 123-1234. When converting this national format to an international format you have to include +1 (the country code of the United States) and add brackets around the area code: +1 (314) 123-1234. As find-and-replace operations go, that’s not too complicated, would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available on System.String).The regular expressions language allows you to construct a short string that achieves the same result.

This section is intended only as a very simple example, so it concentrates on searching strings to identify certain substrings, not on modifying them.

The RegularExpressionsPlayaround Example

For the rest of this section, you develop a short example that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text you are going to use as your sample document is an introduction to a Wrox Press site on ASP.NET (Professional ASP.NET 2.0, ISBN 0-7645-7610-0):


string Text =
@"This comprehensive compendium provides a broad and thorough investigation of all
aspects of programming with ASP.NET. Entirely revised and updated for the 2.0
Release of .NET, this site will give you the information you need to master ASP.NET
and build a dynamic, successful, enterprise Web application.";
Tip 

This code is valid C# code, despite all the line breaks. It nicely illustrates the utility of verbatim strings that are prefixed by the @ symbol.

This text is referred to as the input string. To get your bearings and get used to the regular expressions .NET classes, you start with a basic plain text search that doesn’t feature any escape sequences or regular expression commands. Suppose that you want to find all occurrences of the string ion. This search string is referred to as the pattern. Using regular expressions and the Text variable declared previously, you can write this:


string Pattern = "ion";
MatchCollection Matches = Regex.Matches(Text, Pattern,
                                        RegexOptions.IgnoreCase |
                                        RegexOptions.ExplicitCapture);
foreach (Match NextMatch in Matches)
{
   Console.WriteLine(NextMatch.Index);
}

This code uses the static method Matches() of the Regex class in the System.Text.RegularExpressions namespace. This method takes as parameters some input text, a pattern, and a set of optional flags taken from the RegexOptions enumeration. In this case, you have specified that all searching should be case insensitive. The other flag, ExplicitCapture, modifies the way that the match is collected in a way that, for your purposes, makes the search a bit more efficient - you see why this is later (although it does have other uses that we won’t explore here). Matches() returns a reference to a MatchCollection object. A match is the technical term for the results of finding an instance of the pattern in the expression. It is represented by the class System.Text.RegularExpressions.Match. Therefore, you return a MatchCollection that contains all the matches, each represented by a Match object. In the preceding code, you simply iterate over the collection and use the Index property of the Match class, which returns the index in the input text of where the match was found. Running this code results in three matches. The following table details some of the RegexOptions enumerations.

C# Language Open table as spreadsheet

Member Name

Description

CultureInvariant

Specifies that the culture of the string is ignored

ExplicitCapture

Modifies the way the match is collected by making sure that valid captures are the ones that are explicitly named

IgnoreCase

Ignores the case of the string that is input

IgnorePatternWhitespace

Removes unescaped whitespace from the string and enables comments that are specified with the pound or hash sign

Multiline

Changes the characters ^ and $ so that they are applied to the beginning and end of each line and not to just to the beginning and end of the entire string

RightToLeft

Causes the inputted string to be read from right to left instead of the default left to right (ideal for some Asian and other languages that are read in this direction)

Singleline

Specifies a single-line mode were the meaning of the dot (.) is changed to match every character

So far, nothing is really new from the preceding example apart from some .NET base classes. However, the power of regular collections really comes from that pattern string. The reason is that the pattern string doesn’t have to only contain plain text. As hinted at earlier, it can also contain what are known as meta-characters, which are special characters that give commands, as well as escape sequences, which work in much the same way as C# escape sequences. They are characters preceded by a backslash (\) and have special meanings.

For example, suppose that you wanted to find words beginning with n. You could use the escape sequence \b, which indicates a word boundary (a word boundary is just a point where an alphanumeric character precedes or follows a whitespace character or punctuation symbol). You would write this:

string Pattern = @"\bn";
MatchCollection Matches = Regex.Matches(Text, Pattern,
                                        RegexOptions.IgnoreCase |
                                        RegexOptions.ExplicitCapture);

Notice the @ character in front of the string. You want the \b to be passed to the .NET regular expressions engine at runtime - you don’t want the backslash intercepted by a well-meaning C# compiler that thinks it’s an escape sequence intended for itself! If you want to find words ending with the sequence ion, you write this:


string Pattern = @"ion\b";

If you want to find all words beginning with the letter a and ending with the sequence ion (which has as its only match the word application in the example), you will have to put a bit more thought into your code. You clearly need a pattern that begins with \ba and ends with ion\b, but what goes in the middle? You need to somehow tell the application that between the n and the ion there can be any number of characters as long as none of them are whitespace. In fact, the correct pattern looks like this:


string Pattern = @"\ba\S*ion\b";

Eventually you will get used to seeing weird sequences of characters like this when working with regular expressions. It actually works quite logically. The escape sequence \S indicates any character that is not a whitespace character. The * is called a quantifier. It means that the preceding character can be repeated any number of times, including zero times. The sequence \S* means any number of characters as long as they are not whitespace characters. The preceding pattern will, therefore, match any single word that begins with a and ends with ion.

The following table lists some of the main special characters or escape sequences that you can use. It is not comprehensive, but a fuller list is available in the MSDN documentation.

C# Language Open table as spreadsheet

Symbol

Meaning

Example

Matches

^

Beginning of input text

^B

B, but only if first character in text

$

End of input text

X$

X, but only if last character in text

.

Any single character except the new-line character (\n)

i.ation

isation, ization

*

Preceding character may be repeated 0 or more times

ra*t

rt, rat, raat, raaat, and so on

+

Preceding character may be repeated 1 or more times

ra+t

rat, raat, raaat and so on, (but not rt)

?

Preceding character may be repeated 0 or 1 times

ra?t

rt and rat only

\s

Any whitespace character

\sa

[space]a, \ta, \na (\t and \n have the same meanings as in C#)

\S

Any character that isn’t a whitespace

\SF

aF, rF, cF, but not \tf

\b

Word boundary

ion\b

Any word ending in ion

\B

Any position that isn’t a word boundary

\BX\B

Any X in the middle of a word

If you want to search for one of the meta-characters, you can do so by escaping the corresponding character with a backslash. For example, . (a single period) means any single character other than the newline character, whereas \. means a dot.

You can request a match that contains alternative characters by enclosing them in square brackets. For example, [1|c] means one character that can be either 1 or c. If you wanted to search for any occurrence of the words map or man, you would use the sequence ma[n|p]. Within the square brackets, you can also indicate a range, for example [a-z] to indicate any single lowercase letter, [A-E] to indicate any uppercase letter between A and E, or [0-9] to represent a single digit. If you want to search for an integer (that is, a sequence that contains only the characters 0 through 9), you could write [0-9]+ (note the use of the + character to indicate there must be at least one such digit, but there may be more than one - so this would match 9, 83, 854, and so on).

Displaying Results

In this section, you code the RegularExpressionsPlayaround example, so you can get a feel for how the regular expressions work.

The core of the example is a method called WriteMatches(), which writes out all the matches from a MatchCollection in a more detailed format. For each match, it displays the index of where the match was found in the input string, the string of the match, and a slightly longer string, which consists of the match plus up to ten surrounding characters from the input text - up to five characters before the match and up to five afterward (it is fewer than five characters if the match occurred within five characters of the beginning or end of the input text). In other words, a match on the word messaging that occurs near the end of the input text quoted earlier would display and messaging of d (five characters before and after the match), but a match on the final word data would display g of data. (only one character after the match), because after that you get to the end of the string. This longer string lets you see more clearly where the regular expression locates the match:


static void WriteMatches(string text, MatchCollection matches)
{
   Console.WriteLine("Original text was: \n\n" + text + "\n");
   Console.WriteLine("No. of matches: " + matches.Count);
   foreach (Match nextMatch in matches)
   {
      int Index = nextMatch.Index;
      string result = nextMatch.ToString();
      int charsBefore = (Index < 5) ? Index : 5;
      int fromEnd = text.Length - Index - result.Length;
      int charsAfter = (fromEnd < 5) ? fromEnd : 5;
      int charsToDisplay = charsBefore + charsAfter + result.Length;

      Console.WriteLine("Index: {0}, \tString: {1}, \t{2}",
          Index, result,
          text.Substring(Index - charsBefore, charsToDisplay));
   }
}

The bulk of the processing in this method is devoted to the logic of figuring out how many characters in the longer substring it can display without overrunning the beginning or end of the input text. Note that you use another property on the Match object, Value, which contains the string identified for the match. Other than that, RegularExpressionsPlayaround simply contains a number of methods with names like Find1, Find2, and so on, which perform some of the searches based on the examples in this section. For example, Find2 looks for any string that contains a at the beginning of a word:


static void Find2()
{
   string text = @"This comprehensive compendium provides a broad and thorough
     investigation of all aspects of programming with ASP.NET. Entirely revised and
     updated for the 2.0 Release of .NET, this site will give you the information
     you need to master ASP.NET and build a dynamic, successful, enterprise Web
     application.";
   string pattern = @"\ba";
   MatchCollection matches = Regex.Matches(text, pattern,
     RegexOptions.IgnoreCase);
   WriteMatches(text, matches);
}

Along with this comes a simple Main() method that you can edit to select one of the Find<n>() methods:


static void Main()
{
   Find1();
   Console.ReadLine();
}

The code also needs to make use of the RegularExpressions namespace:

using System;
using System.Text.RegularExpressions;

Running the example with the Find1() method shown previously gives these results:

RegularExpressionsPlayaround
Original text was:

This comprehensive compendium provides a broad and thorough investigation of all
aspects of programming with ASP.NET. Entirely revised and updated for the 2.0
Release of .NET, this site will give you the information you need to master ASP.NET
and build a dynamic, successful, enterprise Web application.

No. of matches: 1
Index: 291,     String: application,     Web application.

Matches, Groups, and Captures

One nice feature of regular expressions is that you can group characters. It works the same way as compound statements in C#. In C# you can group any number of statements by putting them in braces, and the result is treated as one compound statement. In regular expression patterns, you can group any characters (including meta-characters and escape sequences), and the result is treated as a single character. The only difference is that you use parentheses instead of braces. The resultant sequence is known as a group.

For example, the pattern (an)+ locates any recurrences of the sequence an. The + quantifier applies only to the previous character, but because you have grouped the characters together, it now applies to repeats of an treated as a unit. This means that if you apply (an)+ to the input text, bananas came to Europe late in the annals of history , the anan from bananas is identified. On the other hand, if you write an+, the program selects the ann from annals, as well as two separate sequences of an from bananas. The expression (an)+ identifies occurrences of an, anan, ananan, and so on, whereas the expression an+ identifies occurrences of an, ann, annn, and so on.

Tip 

You might wonder why with the preceding example (an)+ picks out anan from the word banana but doesn’t identify either of the two occurrences of an from the same word. The rule is that matches must not overlap. If there are a couple of possibilities that would overlap, then by default the longest possible sequence will be matched.

However, groups are actually more powerful than that. By default, when you form part of the pattern into a group, you are also asking the regular expression engine to remember any matches against just that group, as well as any matches against the entire pattern. In other words, you are treating that group as a pattern to be matched and returned in its own right. This can actually be extremely useful if you want to break up strings into component parts.

For example, URIs have the format: <protocol>://<address>:<port>, where the port is optional. An example of this is http://www.wrox.com:4355. Suppose that you want to extract the protocol, the address, and the port from a URI, where you know that there may or may not be whitespace (but no punctuation) immediately following the URI. You could do so using this expression:


\b(\S+)://(\S+)(?::(\S+))?\b

Here is how this expression works: First, the leading and trailing \b sequences ensure that you only consider portions of text that are entire words. Within that, the first group, (\S+)://, identifies one or more characters that don’t count as whitespace, and that are followed by :// - the http:// at the start of an HTTP URI. The brackets cause the http to be stored as a group. The subsequent (\S+) identifies the string www.wrox.com in the URI. This group will end either when it encounters the end of the word (the closing \b) or a colon (:) as marked by the next group.

The next group identifies the port (:4355). The following ? indicates that this group is optional in the match - if there is no :xxxx, this won’t prevent a match from being marked. This very important, because the port number is not always specified in a URI - in fact, it is absent most of the time. However, things are a bit more complicated than that. You want to indicate that the colon might or might not appear too, but you don’t want to store this colon in the group. You’ve achieved this by having two nested groups. The inner (\S+) identifies anything that follows the colon (for example, 4355). The outer group contains the inner group preceded by the colon, and this group in turn is preceded by the sequence ?:. This sequence indicates that the group in question should not be saved (you only want to save 4355; you don’t need :4355 as well!). Don’t get confused by the two colons following each other - the first colon is part of the ?: sequence that says “don’t save this group,” and the second is text to be searched for.

If you run this pattern on the following string, you’ll get one match: http://www.wrox.com.

Hey I've just found this amazing URI at http:// what was it -- oh yes
http://www.wrox.com

Within this match, you will find the three groups just mentioned as well as a fourth group, which represents the match itself. Theoretically, it is possible that each group itself might return no, one, or more than one match. Each of these individual matches is known as a capture. So, the first group, (\S+), has one capture, http. The second group also has one capture (www.wrox.com). The third group, however, has no captures, because there is no port number on this URI.

Notice that the string contains a second http://. Although this does match up to the first group, it will not be captured by the search, because the entire search expression does not match this part of the text.

There isn’t space to show any examples of C# code that uses groups and captures, but you should know that the .NET RegularExpressions classes support groups and captures, through classes known as Group and Capture. Also, the GroupCollection and CaptureCollection classes represent collections of groups and captures. The Match class exposes the Groups() method, which returns the corresponding GroupCollection object. The Group class correspondingly implements the Captures() method, which returns a CaptureCollection. The relationship between the objects is shown in Figure 8-3.

Image from site
Figure 8-3

You might not want to return a Group object every time you just want to group some characters. A fair amount of overhead is involved in instantiating the object, which is not necessary if all you want is to group some characters as part of your search pattern. You can disable this by starting the group with the character sequence ?: for an individual group, as was done for the URI example, or for all groups by specifying the RegExOptions.ExplicitCaptures flag on the RegEx.Matches() method, as was done in the earlier examples.


Previous PagePrevious
Next PageNext