Yet another String class

This library defines String class that stores a non-null-terminated style character code array in UTF-8. Although you may think it is just a loss of time to invent a new string class and you don't want to study about it, we should do it because we've found that there are several security considerations in the design of ISO C++ std::string class. The String class in this library is designed to do zero-clear of the memory space used in string operations to reduce the risk of sniffing the sensitive data in the memory (by dumping the memory blocks).
Anyway, if you want to preserve some sensitive information (such as passwords) relatively long time on the memory, use SecureString instead of String. SecureString encrypts the string to preverve it secure.

Initializing String instances

String class has several constructors and it can be initialized with string of char or wchar_t. Usually the actual charset and encoding of char string depends on the locale and the platform and when a String instance is initialized with a char string, the string is automatically converted into UTF-8 (of UCS-4) using the system function.

String from_char = "This is a sample string."; // from char

Similarly, the charset and encoding of wchar_t is also environment dependent. In Windows, wchar_t string uses UTF-16 and you can convert wchar_t string into String instance like the following code:

String from_wchart = L"This is a sample string."; // from wchar_t

Since wchar_t is not identical to UTF-16 in general, you should not use wchar_t on non-Windows environment. You should use UChar2 for UTF-16 (of UCS-4) strings and UChar4 for UTF-32 (of UCS-4) strings. You can assume that UChar2 is always 16bit and UChar4 is 32bit (regardless of the endianness). The code below illustrates how to use them:

const UChar2 sample_1[] = {
    0xfeff, 0x30bb, 0x30e9, 0x30fc, 0x30c6, 0x30e0, 0x306f, 0x65e5,
    0x672c, 0x306e, 0x4f1a, 0x793e, 0x3067, 0x3059, 0x3002, 0x000d,
    0x000a};
String str(sample_1); // from UTF-16

You can embed strings expressed in Unicode without depending on the platform dependent encoding/charset by using UChar2 and UChar4.

There is also a String constructor that receives UTF-8 string. Since UTF-8 string is also expressed in char string, the UTF-8 version constructor should be explicitly called using utf8s proxy object:

String from_UTF8 = utf8s("This is a sample UTF-8 string.");

We define NULL_STRING enumeration type and NullString value to call a String constructor that initializes the instance with empty string. The following code is a sample use of them:

String str(NullString);
assert(str.isEmpty()); // str is empty.
String str2 = NullString; // This is also valid.
str2 = "Test test test";
str2 = NullString; // the result is identical to String::clear().

You can also use NullString as default value for function parameters:

void someFunction(const String& inString = NullString)
{
    ...
}
someFunction(); // call the function without any explicit parameter.

You can also initialize a String from another string that is not terminated by '\0' or a portion of a terminated string. The following code illustrates this:

String str1("This is test!", 4); // The string is initialized with "This".

To initialize a String from a portion of another String, use String::substring or String::substringByChar function:

String str1 = "This is test string";

String str2 = str1.substring(8, 4); // str2 is "test".

The difference between String::substring and String::substringByChar is discussed on Character Count vs. String Length.

Concatenation of Strings

The concatenation of several String instances are very easy. All you have to do is to use + operator:

String str1 = "This";
String str2 = "sample code";
String str2 = str1 + " is a " + str2 + "."; // "This is a sample code"

You can also use traditional printf syntax on String by format or format_utf8 function. format function regards the parameters as normal char string and format_utf8 regards them as UTF-8 string.

String str = format("A is %u\n", a);

String Length/Size

The byte size of the string is obtained by String::getLength function and the character count is by String::getNumOfChars function. Anyway, both of length and size of the string does not contain the terminating "\\0" character.

String str = "This is test string!";

size_t length = str.getLength();

The difference between the length and the character count is discussed on Character Count vs. String Length.

Comparison of two Strings

String has several comparison operators and you can easily compare two String instances:

String str1 = "test", str2 = "TEST";

if(str1 < str2) ...

There are also String::compare (case sensitive) and String::compareI (case insensitive) functions.

Referencing and Modification of a Character on a String

You can modify String instances by [] operator:

String str = "help me!";
UChar1 chr = str[1]; // returns 'e'. (value)
str[0] = 'H'; // str is to be "Help me!". (reference)

by [] returns the reference to or value of the specified offset address in the string. The drawback for writing a character to some position by the function is it happens to duplicate the whole string to modify the portion if the string is referenced by several instances. If you want to modify the whole string effeciently, see Allocate Character Array.

There is also a way to know the character code of the specified character position. () gets the UCS-4 character code of the specified UCS-4 character index:

String str = "help me!";

UChar4 chr = str(5); // get character code of 'm'.

The difference between offset address and UTF-4 character position is discussed on Character Count vs. String Length.

Getting Pointer to the String

You can get a pointer to the raw UTF-8 string using String::c_str function:

String str = "This is sample.";

const UChar1 *p = str.c_str();

The pointer is valid until you call any function that accesses to the String instance.

Eliminating Unnecessary Spaces from Strings

There are several functions that eliminates unnecessary white space characters (space, tab, and line feeds) from String instances:

Allocate Character Array

If your String manipulation code works on traditional character code array directly, allocate function can be used. The following code illustrates how to use the function:

String str;
// allocate the buffer; 255 means the actual size allocated is 256 bytes.
// The function automatically make room for null-terminator and initialized
// it with '\\0'.
UChar1 *pstr = str.allocate(255);
// copy a string to the buffer.
std::strcpy(pstr, "This code works!");

In this case, you should manipulate UTF-8 string by your own hand.

String Functions

There are also several String functions.

Regular Expression

You can also use Regular Expression for seaching and matching of the strings. For more information, see Regular Expression.

Charset Conversion

Since the String class adopts UTF-8 as its intermediate charset, there're needs of converting the charset into the platform native ones. This library provides UtfConverter class and nifty String methods (String::toMbs, String::toWcs, String::toUcs2 and String::toUcs4) which wraps UtfConverter.
String::toMbs converts the String instance into multibyte string pointed by const char*.
String::toWcs converts the String instance into wide character string pointed by const wchar_t*.
String::toUcs2 converts the String instance into UCS2 character string pointed by const UChar2*.
String::toUcs4 converts the String instance into UCS4 character string pointed by const UChar4*.

Charset Conversion on Windows

On Windows, most of the programs can work with String::toMbs and/or String::toWcs. But if you use tchar.h to use TCHAR, you can use TO_TCS to deal with String::toMbs and String::toWcs methods indirectly. It behaves as String::toWcs if UNICODE macro is defined, otherwise as String::toMbs.
The following code illustrates how to use these macros:

void some_function(const char *string);
String str = "This is sample!";
// This is good use of toMbs
some_function(str.toMbs());
// With Win32 API, it works correctly with generic char mapping
LPCTSTR pstr = TO_TCS(str);
// temp will point to internal buffer of str at this moment
const char *temp = str.toMbs();
// This will renew str and temp points released "invalid" buffer
str = "Modify the original string";
// Something wrong happens...
some_function(temp);

Character Count vs. String Length

In this library, there is difference between character count and string length.
The character count means the number of UCS-4 characters in string. The string length means the number of UChar1 entries in string. Anyway, both usually does not include the terminating null character ('\0').
In ASCII 7bit string, the difference is not a problem because an UChar1 entry can hold an UCS-4 character code that corresponding to an ASCII 7bit character.
String::substring function extracts a substring based on the length, position of UChar1 entry and String::substringByChar function does the same based on the position, count in the number of UCS-4 characters. Likewise, String::getLength function returns the number of UChar1 entries and String::getNumOfChars function returns the number of UCS-4 characters in the string (Both do not include the terminating null).