It is a big old world, full of many varied characters.
There are familiar friends like A, B and C. There are also Chinese characters and Cyrillic characters. We might even want to use Klingon characters!
There are over 65536 different characters that a computer might have to handle. Every possible character has been assigned its own number, called a ‘Unicode code point’. You can see them all here .
A byte is a collection of 8 ones or zeros, the basic unit of computer memory. The largest number that can be stored in a byte is 256. In the old days when the world was smaller and simpler, about 20 or 30 years ago, this was sufficient to store A, B and C and their familiar friends. So each character was stored in one byte.
Nowadays we need to include the whole word and even the occasional Klingon. We need more space for our characters and their new friends.
Two bytes are called a word, and can be handled conveniently by most computers. The largest number that can be stored in a word is 65536. It is almost enough for every character. So, in the system called UTF-16 ( because a word contains 16 ones or zeroes ), most every character is stored in a word, with the overflow stored in two words.
The Microsoft Windows operating system uses UTF-16 to handle Unicode characters. Here is how the C programming language creates a UTF-16 encoded Unicode string
wchar_t * ws = L”Hello World”;
There is a snag. Although computers use UTF-16 internally, they cannot communicate easily with each other using UTF-16. This is because, although every computer agrees on the order in which the ones and zeros of a byte should be arranged, they do not all agree on which order the bytes in a word should be arranged. In a reference to Jonathan Swift’s novel ‘Gulliver’s Travels’ where characters fought over which end an egg should be opened, the two ways of arranging the bytes in a word are called ‘Big Endian’ and ‘Little Endian’. When communicating with each other computers use another standard called UTF-8 where each Unicode character is encoded by a series of bytes in a specified order which is the same whether the computer is ‘Big Endian’ or ‘Little Endian’
When a computer program needs to communicate with another computer, perhaps by reading or writing a web page, it must constantly convert back and forward between UTF-8 and UTF-16 encoded character strings. The Windows API provides C programming language routines for doing this: WideCharToMultiByte() and MultiByteToWideChar(). However, they are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. We need a wrapper!
Here is the interface to my wrapper:
/** Conversion between UTF-8 and UTF-16 strings. UTF-8 is used by web pages. It is a variable byte length encoding of UNICODE characters which is independant of the byte order in a computer word. UTF-16 is the native Windows UNICODE encoding. The class stores two copies of the string, one in each encoding, so should only exist briefly while conversion is done. This is a wrapper for the WideCharToMultiByte and MultiByteToWideChar */ class cUTF { wchar_t * myString16; ///< string in UTF-16 char * myString8; ///< string in UTF-6 public: /// Construct from UTF-16 cUTF( const wchar_t * ws ); /// Construct from UTF8 cUTF( const char * s ); /// get UTF16 version const wchar_t * get16() { return myString16; } /// get UTF8 version const char * get8() { return myString8; } /// free buffers ~cUTF() { free(myString8); free(myString16); } };
Here is the code to implement this interface
/// Construct from UTF-16 cUTF::cUTF( const wchar_t * ws ) { // store copy of UTF16 myString16 = (wchar_t * ) malloc( wcslen( ws ) * 2 + 2 ); wcscpy( myString16, ws ); // How long will the UTF-8 string be int len = WideCharToMultiByte(CP_UTF8, 0, ws, wcslen( ws ), NULL, NULL, NULL, NULL ); // allocate a buffer myString8 = (char * ) malloc( len + 1 ); // convert to UTF-8 WideCharToMultiByte(CP_UTF8, 0, ws, wcslen( ws ), myString8, len, NULL, NULL); // null terminate *(myString8+len) = '\0'; } /// Construct from UTF8 cUTF::cUTF( const char * s ) { myString8 = (char * ) malloc( strlen( s ) + 1 ); strcpy( myString8, s ); // How long will the UTF-16 string be int len = MultiByteToWideChar(CP_UTF8, 0, s, strlen( s ), NULL, NULL ); // allocate a buffer myString16 = (wchar_t * ) malloc( len * 2 + 2 ); // convert to UTF-16 MultiByteToWideChar(CP_UTF8, 0, s, strlen( s ), myString16, len); // null terminate *(myString16+len) = L'\0'; }
And here is some code to test the wrapper
// create a native unicode string with some chinese characters wchar_t * unicode_string = L"String with some chinese characters \x751f\x4ea7\x8bbe\x7f6e "; // convert to UTF8 cUTF utf( unicode_string ); // create a web page FILE * fp = fopen("test_unicode.html","w"); // let browser know we are using UTF-8 fprintf(fp,"<head><meta http-equiv=\"Content-Type\" " "content=\"text/html;charset=UTF-8\"></head>\n"); // output the converted string fprintf(fp, "After conversion using cUTF8 - %s<p>\n", utf.get8() ); fclose(fp);
Instead of malloc (len *2 + 2) you should use malloc( (len+1) * sizeof(wchar_t) ). Not every platform has wchar_t defined as 2 bytes.
Thank you for your comment. The code uses the Windows API. It does not apply to any other platform.
This line:
*(myString16+len) = ”;
should be
*(myString16+len) = L”;
otherwise strange things may happen.
Thanks. Well spotted! I have edited the code in the blog and in the repository