Work with Unicode
XMLBlueprint allows you to create and open documents in several different encoding standards - a set of rules that assign a numeric value to each character in the file.
Many encoding standards exist to represent the character sets used in different languages, and some encoding standards support the characters used only in a particular language. For example, a text written in Simplified Chinese might use the GB2312-80 encoding standard, while a text written in Traditional Chinese might use Big5.
XMLBlueprint supports ANSI and Unicode encoding standards.
When you open a file, the encoding standard is detected automatically and displayed in the status bar. When you save a file for the first time, you must choose the encoding standard you want to use.
As an example we'll see how the Chinese character for tea is encoded in each of these encoding forms.
(tea)
ANSI
ANSI uses a single byte for each character. Only 256 characters are supported (2^8). There's no support for Chinese characters.
Unicode
Unicode is a character encoding standard developed by the Unicode Consortium that represents almost all of the written languages of the world.
The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit):
|
• | UTF-8 |
• | UTF-16BE (Big Endian / Little Endian) |
• | UTF-32 |
UTF-32 is not supported by XMLBlueprint.
UTF-8
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
The Chinese character for tea is encoded as [E8 8C B6].
UTF-16
UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
UTF-16 has two variants: UTF-16BE (Big Endian) and UTF-16LE (Little Endian).
UTF-16BE
Documents created on a Macintosh or Unix platform are normally encoded as UTF-16BE (Big Endian).
The Chinese character for tea is encoded as [83 36].
UTF-16LE
Documents created on a Windows platform are normally encoded as UTF-16LE (Little Endian).
The Chinese character for tea is encoded as [36 83].
Fonts
Some fonts cannot display all of the Unicode characters. If you see any characters missing in your text file, you can change the font to one that includes the character. Generally, Microsoft Sans Serif is a good choice for Unicode characters.