Work with Unicode

XMLBlueprint allows you to create and open documents in several different encoding standards - a set of rules that assign a numeric value to each character in the file.

Many encoding standards exist to represent the character sets used in different languages, and some encoding standards support the characters used only in a particular language. For example, a text written in Simplified Chinese might use the GB2312-80 encoding standard, while a text written in Traditional Chinese might use Big5.

XMLBlueprint supports ANSI and Unicode encoding standards.

When you open a file, the encoding standard is detected automatically and displayed in the status bar. When you save a file for the first time, you must choose the encoding standard you want to use.

As an example we'll see how the Chinese character for tea is encoded in each of these encoding forms.

(tea)

ANSI

ANSI uses a single byte for each character. Only 256 characters are supported (2^8). There's no support for Chinese characters.

Unicode

Unicode is a character encoding standard developed by the Unicode Consortium that represents almost all of the written languages of the world.

See http://www.unicode.org for more information.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit):


•	UTF-8
•	UTF-16BE (Big Endian / Little Endian)
•	UTF-32

UTF-32 is not supported by XMLBlueprint.

UTF-8

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

The Chinese character for tea is encoded as [E8 8C B6].

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-16 has two variants: UTF-16BE (Big Endian) and UTF-16LE (Little Endian).

UTF-16BE

Documents created on a Macintosh or Unix platform are normally encoded as UTF-16BE (Big Endian).

The Chinese character for tea is encoded as [83 36].

UTF-16LE

Documents created on a Windows platform are normally encoded as UTF-16LE (Little Endian).

The Chinese character for tea is encoded as [36 83].

Fonts

Some fonts cannot display all of the Unicode characters. If you see any characters missing in your text file, you can change the font to one that includes the character. Generally, Microsoft Sans Serif is a good choice for Unicode characters.

Work with Unicode

ANSI

Unicode

UTF-8

UTF-16

UTF-16BE

UTF-16LE

Fonts

Welcome!

Introducing XMLBlueprint

Working with files

Working with text

Creating and Editing XML

Validating XML

Working with XSLT

Working with XPath

Working with XProc

Working with JSON

Working with HTML, CSS and JavaScript

Working with XML Catalogs

Converting CSV files and Excel files

Generating Sample XML

Generating DTDs and Schemas

Customizing XMLBlueprint

Automating XMLBlueprint

Regular Expressions

Non-English Support

Appendix