World Wide Characters

It is a big old world, full of many varied characters.

There are familiar friends like A, B and C. There are also Chinese characters and Cyrillic characters. We might even want to use Klingon characters!

There are over 65536 different characters that a computer might have to handle. Every possible character has been assigned its own number, called a ‘Unicode code point’. You can see them all here .

A byte is a collection of 8 ones or zeros, the basic unit of computer memory. The largest number that can be stored in a byte is 256. In the old days when the world was smaller and simpler, about 20 or 30 years ago, this was sufficient to store A, B and C and their familiar friends. So each character was stored in one byte.

Nowadays we need to include the whole word and even the occasional Klingon. We need more space for our characters and their new friends.

Two bytes are called a word, and can be handled conveniently by most computers. The largest number that can be stored in a word is 65536. It is almost enough for every character. So, in the system called UTF-16 ( because a word contains 16 ones or zeroes ), most every character is stored in a word, with the overflow stored in two words.

The Microsoft Windows operating system uses UTF-16 to handle Unicode characters. Here is how the C programming language creates a UTF-16 encoded Unicode string


wchar_t * ws = L”Hello World”;

There is a snag. Although computers use UTF-16 internally, they cannot communicate easily with each other using UTF-16. This is because, although every computer agrees on the order in which the ones and zeros of a byte should be arranged, they do not all agree on which order the bytes in a word should be arranged. In a reference to Jonathan Swift’s novel ‘Gulliver’s Travels’ where characters fought over which end an egg should be opened, the two ways of arranging the bytes in a word are called ‘Big Endian’ and ‘Little Endian’. When communicating with each other computers use another standard called UTF-8 where each Unicode character is encoded by a series of bytes in a specified order which is the same whether the computer is ‘Big Endian’ or ‘Little Endian’

When a computer program needs to communicate with another computer, perhaps by reading or writing a web page, it must constantly convert back and forward between UTF-8 and UTF-16 encoded character strings. The Windows API provides C programming language routines for doing this: WideCharToMultiByte() and MultiByteToWideChar(). However, they are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. We need a wrapper!

Here is the interface to my wrapper:

/**

Conversion between UTF-8 and UTF-16 strings.

UTF-8 is used by web pages.  It is a variable byte length encoding
of UNICODE characters which is independant of the byte order in a computer word.

UTF-16 is the native Windows UNICODE encoding.

The class stores two copies of the string, one in each encoding,
so should only exist briefly while conversion is done.

This is a wrapper for the WideCharToMultiByte and MultiByteToWideChar
*/
class cUTF
{
wchar_t * myString16;		///< string in UTF-16
char * myString8;			///< string in UTF-6
public:
/// Construct from UTF-16
cUTF( const wchar_t * ws );
///  Construct from UTF8
cUTF( const char * s );
/// get UTF16 version
const wchar_t * get16() { return myString16; }
/// get UTF8 version
const char * get8() { return myString8; }
/// free buffers
~cUTF() { free(myString8); free(myString16); }
};

Here is the code to implement this interface

/// Construct from UTF-16
cUTF::cUTF( const wchar_t * ws )
{
// store copy of UTF16
myString16 = (wchar_t * ) malloc( wcslen( ws ) * 2 + 2 );
wcscpy( myString16, ws );
// How long will the UTF-8 string be
int len = WideCharToMultiByte(CP_UTF8, 0,
ws, wcslen( ws ),
NULL, NULL, NULL, NULL );
// allocate a buffer
myString8 = (char * ) malloc( len + 1 );
// convert to UTF-8
WideCharToMultiByte(CP_UTF8, 0,
ws, wcslen( ws ),
myString8, len, NULL, NULL);
// null terminate
*(myString8+len) = '\0';
}
///  Construct from UTF8
cUTF::cUTF( const char * s )
{
myString8 = (char * ) malloc( strlen( s ) + 1 );
strcpy( myString8, s );
// How long will the UTF-16 string be
int len = MultiByteToWideChar(CP_UTF8, 0,
s, strlen( s ),
NULL, NULL );
// allocate a buffer
myString16 = (wchar_t * ) malloc( len * 2 + 2 );
// convert to UTF-16
MultiByteToWideChar(CP_UTF8, 0,
s, strlen( s ),
myString16, len);
// null terminate
*(myString16+len) = L'\0';
}

And here is some code to test the wrapper

// create a native unicode string with some chinese characters
wchar_t * unicode_string = 
    L"String with some chinese characters \x751f\x4ea7\x8bbe\x7f6e ";

// convert to UTF8
cUTF utf( unicode_string );

// create a web page
FILE * fp = fopen("test_unicode.html","w");

// let browser know we are using UTF-8
fprintf(fp,"<head><meta http-equiv=\"Content-Type\" "
    "content=\"text/html;charset=UTF-8\"></head>\n");

// output the converted string
fprintf(fp, "After conversion using cUTF8 - %s<p>\n",
 utf.get8() );

fclose(fp);

Advertisement
This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to World Wide Characters

  1. Mirco says:

    Instead of malloc (len *2 + 2) you should use malloc( (len+1) * sizeof(wchar_t) ). Not every platform has wchar_t defined as 2 bytes.

  2. Chris says:

    This line:
    *(myString16+len) = ”;
    should be
    *(myString16+len) = L”;
    otherwise strange things may happen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s