System.String
Before examining the other string classes, this
section quickly reviews some of the available methods on the
String class.
System.String is a class
specifically designed to store a string and allow a large number of
operations on the string. Also, because of the importance of this
data type, C# has its own keyword and associated syntax to make it
particularly easy to manipulate strings using this class.
You can concatenate strings using operator
overloads:
C# also allows extraction of a particular character
using an indexer-like syntax:
This enables you to perform such common tasks as
replacing characters, removing whitespace, and capitalization. The
following table introduces the key methods.
|
|
Tip |
Please note that this table is not
comprehensive but is intended to give you an idea of the features
offered by strings.
|
Building Strings
As you have seen, String is an extremely powerful class that
implements a large number of very useful methods. However, the
String class has a shortcoming that
makes it very inefficient for making repeated modifications to a
given string - it is actually an immutable
data type, which means that once you initialize a string object,
that string object can never change. The methods and operators that
appear to modify the contents of a string actually create new
strings, copying across the contents of the old string if
necessary. For example, look at the following code:
What happens when this code executes is this:
first, an object of type System.String
is created and initialized to hold the text Hello from all the guys at Wrox Press. Note the
space after the period. When this happens,
the .NET runtime allocates just enough memory in the string to hold
this text (39 chars), and the variable greetingText is set to refer to this string
instance.
In the next line, syntactically it looks like some
more text is being added onto the string - though it is not.
Instead, what happens is that a new string instance is created with
just enough memory allocated to store the combined text - that’s
103 characters in total. The original text, Hello from all the people at
Wrox Press., is copied into this new string instance along
with the extra text, We do hope you
enjoy this site as much as we enjoyed
writing it. Then, the address stored in the variable
greetingText is updated, so the variable
correctly points to the new String
object. The old String object is now
unreferenced - there are
no variables that refer to it - and so will be removed the next
time the garbage collector comes along to clean out any unused
objects in your application.
By itself, that doesn’t look too bad, but suppose
that you wanted to encode that string by replacing each letter (not
the punctuation) with the character that has an ASCII code further
on in the alphabet, as part of some extremely simple encryption
scheme. This would change the string to Ifmmp
gspn bmm uif hvst bu Xspy Qsftt. Xf ep
ipqf zpv fokpz uijt cppl bt nvdi bt xf fokpzfe xsjujoh ju.
Several ways of doing this exist, but the simplest and (if you are
restricting yourself to using the String
class) almost certainly the most efficient way is to use the
String.Replace() method, which replaces
all occurrences of a given substring in a string with another
substring. Using Replace(), the code to
encode the text looks like this:
|
|
Tip |
For simplicity, this code doesn’t wrap Z to A
or z to a. These letters get encoded to [ and {, respectively.
|
Here, the Replace()
method works in a fairly intelligent way, to the extent that it
won’t actually create a new string unless it actually makes some
changes to the old string. The original string contained 23
different lowercase characters and 3 different uppercase ones. The
Replace method will therefore have
allocated a new string 26 times in total, each new string storing
103 characters. That means that as a result of the encryption
process there will be string objects capable of storing a combined
total of 2,678 characters now sitting on the heap waiting to be
garbage collected! Clearly, if you use strings to do text
processing extensively, your applications will run into severe
performance problems.
To address this kind of issue, Microsoft has
supplied the System.Text.StringBuilder
class. StringBuilder isn’t as powerful
as String in terms of the number of
methods it supports. The processing you can do on a StringBuilder is limited to substitutions and
appending or removing text from strings. However, it works in a
much more efficient way.
When you construct a string using the String class, just enough memory is allocated to
hold the string. The StringBuilder,
however, does better than this and normally allocates more memory
than is actually needed. You, as a developer, have the option to
indicate how much memory the StringBuilder should allocate, but if you don’t, the
amount will default to some value that depends on the size of the
string that the StringBuilder instance is initialized with. The
StringBuilder class has two main
properties:
-
Length, which
indicates the length of the string that it actually contains
-
Capacity, which
indicates the maximum length of the string in the memory
allocation
Any modifications to the string take place within
the block of memory assigned to the StringBuilder instance, which makes appending
substrings and replacing individual characters within strings very
efficient. Removing or inserting substrings is inevitably still
inefficient, because it means that the following part of the string
has to be moved. Only if you perform some operation that exceeds
the capacity of the string is it necessary to allocate new memory
and possibly move the entire contained string. In adding extra
capacity, based on our experiments the StringBuilder appears to double its capacity if it
detects the capacity has been exceeded and no new value for the
capacity has been set.
For example, if you use a StringBuilder object to construct the original
greeting string, you might write this code:
|
|
Tip |
In order to use the StringBuilder class, you will need a System.Text reference in your code.
|
This code sets an initial capacity of 150 for the StringBuilder. It is always a good idea to set some
capacity that covers the likely maximum length of a string, to
ensure the StringBuilder doesn’t need to
relocate because its capacity was exceeded. Theoretically, you can
set as large a number as you can pass in an int, although the system will probably complain that
it doesn’t have enough memory if you actually try to allocate the
maximum of 2 billion characters (this is the theoretical maximum
that a StringBuilder instance is in
principle allowed to contain).
When the preceding code is executed, it first
creates a StringBuilder object that
looks like Figure
8-1.
Then, on calling the AppendFormat() method, the remaining text is placed
in the empty space, without the need for more memory allocation.
However, the real efficiency gain from using a StringBuilder comes when you are making repeated
text substitutions. For example, if you try to encrypt the text in
the same way as before, you can perform the entire encryption
without allocating any more memory whatsoever:
This code uses the StringBuilder.Replace() method, which does the same
thing as String.Replace(), but without
copying the string in the process. The total memory allocated to
hold strings in the preceding code is 150 characters for the
StringBuilder instance, as well as the
memory allocated during the string operations performed internally
in the final Console.WriteLine()
statement.
Normally, you will want to use StringBuilder to perform any manipulation of strings
and String to store or display the final
result.
StringBuilder Members
You have seen a demonstration of one
constructor of StringBuilder, which
takes an initial string and capacity as its parameters. There are
others. For example, you can supply only a string:
Or you can create an empty StringBuilder with a given capacity:
Apart from the Length
and Capacity properties, there is a
read-only MaxCapacity property that
indicates the limit to which a given StringBuilder instance is allowed to grow. By
default, this is given by int.MaxValue
(roughly 2 billion, as noted earlier), but you can set this value
to something lower when you construct the StringBuilder object:
You can also explicitly set the capacity at any
time, though an exception will be raised if you set it to a value
less than the current length of the string or a value that exceeds
the maximum capacity:
The following table lists the main StringBuilder methods.
Several overloads of many of these methods
exist.
|
|
Tip |
AppendFormat() is
actually the method that is ultimately called when you call
Console.WriteLine(), which has
responsibility for working out what all the format expressions like
{0:D} should be replaced with. This
method is examined in the next section.
|
There is no cast (either implicit or explicit) from
StringBuilder to String. If you want to output the contents of a
StringBuilder as a String, you must use the ToString() method.
Now that you have been introduced to the
StringBuilder class and shown some of
the ways in which you can use it to increase performance, you
should be aware that this class will not always give you the
increased performance that you are looking for. Basically, the
StringBuilder class should be used when
you are manipulating multiple strings. However, if you are just
doing something as simple as concatenating two strings, you will
find that System.String will be better
performing.
Format Strings
So far, a large number of classes and structs
have been written for the code samples presented in this site, and
they have normally implemented a ToString() method in order to be able to display the
contents of a given variable. However, quite often users might want
the contents of a variable to be displayed in different, often
culture- and locale-dependent, ways. The .NET base class,
System.DateTime, provides the most
obvious example of this. For example, you might want to display the
same date as 10 June 2007, 10 Jun 2007, 6/10/07 (USA), 10/6/07
(UK), or 10.06.2007 (Germany).
Similarly, the Vector
struct in Chapter 3, “Objects and Types,”
implements the Vector.ToString() method
to display the vector in the format (4, 56,
8). There is, however, another very common way of writing
vectors, in which this vector would appear as 4i + 56j + 8k. If you want the classes that you
write to be user-friendly, they need to support the facility to
display their string representations in any of the formats that
users are likely to want to use. The .NET runtime defines a
standard way that this
should be done: the IFormattable
interface. Showing how to add this important feature to your
classes and structs is the subject of this section.
As you probably know, you need to specify the
format in which you want a variable displayed when you call
Console.WriteLine(). Therefore, this
section uses this method as an example, although most of the
discussion applies to any situation in which you want to format a
string. For example, if you want to display the value of a variable
in a list box or text box, you will normally use the String.Format() method to obtain the appropriate
string representation of the variable. However, the actual format
specifiers you use to request a particular format are identical to
those passed to Console.WriteLine().
Hence, you will focus on Console.WriteLine() as an example. You start by
examining what actually happens when you supply a format string to
a primitive type, and from this you will see how you can plug in
format specifiers for your own classes and structs into the
process.
Chapter 2, “C# Basics,” uses format
strings in Console.Write() and
Console.WriteLine() like this:
The format string itself consists mostly of the
text to be displayed, but wherever there is a variable to be
formatted, its index in the parameter list appears in braces. You
might also include other information inside the brackets concerning
the format of that item. For example, you can include:
-
The number of characters to be occupied by
the representation of the item, prefixed by a comma. A negative
number indicates that the item should be left-justified, whereas a
positive number indicates that it should be right-justified. If the
item actually occupies more characters than have been requested, it
will still appear in full.
-
A format specifier, preceded by a colon. This
indicates how you want the item to be formatted. For example, you
can indicate whether you want a number to be formatted as a
currency or displayed in scientific notation.
The following table lists the common format
specifiers for the numeric types, which were briefly discussed in
Chapter 2.
If you want an integer to be padded with zeros, you
can use the format specifier 0 (zero)
repeated as many times as the number length is required. For
example, the format specifier 0000 will
cause 3 to be displayed as 0003, and 99 to be
displayed as 0099, and so on.
It is not possible to give a complete list, because
other data types can add their own specifiers. Showing how to
define your own specifiers for your own classes is the aim of this
section.
How the String Is Formatted
As an example of how strings are formatted,
if you execute the following statement:
Console.WriteLine() just
passes the entire set of parameters to the static method,
String.Format(). This is the same method
that you would call if you wanted to format these values for use in
a string to be displayed in a text box, for example. The
implementation of the three-parameter overload of WriteLine() basically does this:
The one-parameter overload of this method, which is
in turn called in the preceding code sample, simply writes out the
contents of the string it has been passed, without doing any
further formatting on it.
String.Format() now
needs to construct the final string by replacing each format
specifier with a suitable string representation of the
corresponding object. However, as you saw earlier, for this process
of building up a string, you need a StringBuilder instance rather than a string instance. In this example, a StringBuilder instance is created and initialized
with the first known portion of the string, the text “The double is “. Next, the StringBuilder.AppendFormat() method is called,
passing in the first format specifier, {0,10:E}, as well as the associated object,
double, in order to add the string
representation of this object to the string object being
constructed. This process continues with StringBuilder.Append() and StringBuilder.AppendFormat() being called repeatedly
until the entire formatted string has been obtained.
Now comes the interesting part; StringBuilder.AppendFormat() has to figure out how
to format the object. First, it probes the object to find out
whether it implements an interface in the System namespace called IFormattable. You can find this out quite simply by
trying to cast an object to this interface and seeing whether the
cast succeeds, or by using the C# is
keyword. If this test fails, AppendFormat() calls the object’s ToString() method, which all objects either inherit
from System.Object or override. This is
exactly what happens here, because none of the classes written so
far has implemented this interface. That is why the overrides of
Object.ToString() have been sufficient
to allow the structs and classes from earlier chapters such as
Vector to get displayed in Console.WriteLine() statements.
However, all of the predefined primitive numeric
types do implement this interface, which means that for those
types, and in particular for double and
int in the example, the basic
ToString() method inherited from
System.Object will not be called. To
understand what happens instead, you need to examine the
IFormattable interface.
IFormattable defines
just one method, which is also called ToString(). However, this method takes two
parameters as opposed to the System.Object version, which doesn’t take any
parameters. The following code shows the definition of IFormattable:
The first parameter that this overload of
ToString() expects is a string that
specifies the requested format. In other words, it is the specifier
portion of the string that appears inside the braces ({}) in the string originally passed to Console.WriteLine() or String.Format(). For example, in the example the
original statement was:
Hence, when evaluating the first specifier,
{0,10:E}, this overload will be called
against the double variable,
d, and the first parameter passed to it
will be E. StringBuilder.AppendFormat() will pass in here the
text that appears after the colon in the appropriate format
specifier from the original string.
We won’t worry about the second ToString() parameter in this site. It is a reference
to an object that implements the IFormatProvider interface. This interface gives
further information that ToString()
might need to consider when formatting the object such as
culture-specific details (a .NET culture is similar to a Windows
locale; if you are formatting currencies or dates, you need this
information). If you are calling this ToString() overload directly from your source code,
you might want to supply such an object. However, StringBuilder.AppendFormat() passes in null for this parameter. If formatProvider is null,
then ToString() is expected to use the
culture specified in the system settings.
Getting back to the example, the first item you
want to format is a double, for which
you are requesting exponential notation, with the format specifier
E. The StringBuilder.AppendFormat() method establishes that
the double does implement IFormattable, and will therefore call the
two-parameter ToString() overload,
passing it the string E for the first
parameter and null for the second
parameter. It is now up to the double’s implementation of this
method to return the string representation of the double in the
appropriate format, taking into account the requested format and
the current culture. StringBuilder
.AppendFormat() will then sort out
padding the returned string with spaces, if necessary, to fill the
10 characters the format string specified.
The next object to be formatted is an int, for which you are not requesting any particular
format (the format specifier was simply {1}). With no format requested, StringBuilder.AppendFormat() passes in a null
reference for the format string. The two-parameter overload of
int.ToString() is expected to respond
appropriately. No format has been specifically requested;
therefore, it will call the no-parameter ToString() method.
This entire string formatting process is summarized
in Figure 8-2.
The FormattableVector example
Now that you know how format strings are
constructed, in this section you extend the Vector example from earlier in the site, so that you
can format vectors in a variety of ways. You can download the code
for this example from www.wrox.com. Now that you understand the
principles involved, you will discover the actual coding is quite
simple. All you need to do is implement IFormattable and supply an implementation of the
ToString() overload defined by that
interface.
The format specifiers you are going to support
are:
-
N - Should be
interpreted as a request to supply a quantity known as the
Norm of the Vector. This is just the sum of squares of its
components, which for mathematics buffs happens to be equal to the
square of the length of the Vector, and
is usually displayed between double vertical bars, like this:
||34.5||.
-
VE - Should be
interpreted as a request to display each component in scientific
format, just as the specifier E applied
to a double indicates (2.3E+01,
4.5E+02, 1.0E+00).
-
IJK - Should be
interpreted as a request to display the vector in the form
23i + 450j + 1k.
-
Anything else should simply return the
default representation of the Vector (23, 450,
1.0).
To keep things simple, you are not going to
implement any option to display the vector in combined IJK and scientific format. You will, however, make
sure you test the specifier in a case-insensitive way, so that you
allow ijk instead of IJK. Note that it is entirely up to you which
strings you use to indicate the format specifiers.
To achieve this, you first modify the declaration
of Vector so it implements IFormattable:
Now you add your implementation of the
two-parameter ToString() overload:
That is all you have to do! Notice how you take the
precaution of checking whether format is null before you call any methods against this
parameter - you want this method to be as robust as reasonably
possible. The format specifiers for all the primitive types are
case insensitive, so that’s the behavior that other developers are going to expect from your
class, too. For the format specifier VE,
you need each component to be formatted in scientific notation, so
you just use String.Format() again to
achieve this. The fields x, y, and z are all doubles.
For the case of the IJK format
specifier, there are quite a few substrings to be added to the
string, so you use a StringBuilder
object to improve performance.
For completeness, you also reproduce the
no-parameter ToString() overload
developed earlier:
Finally, you need to add a Norm() method that computes the square (norm) of the
vector, because you didn’t actually supply this method when you
developed the Vector struct:
Now you can try out your formattable vector with
some suitable test code:
The result of running this sample is this:
This shows that your custom specifiers are
being picked up correctly.
|