Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other modern systems, provides Internationalization solutions based on Unicode.
This chapter is intended as an introduction to codepages in general and Unicode in particular. For further information, see:
Go to the online ICU demos to see how a Unicode-based server application can handle text in many languages and many encodings.
Representing text-format data in computers is a matter of defining a set of characters and assigning each of them a number and a bit representation. Underlying this basic idea are three related concepts:
For simple encodings such as ASCII, the last two concepts are basically the same: ASCII assigns 128 characters and control codes to consecutive numbers from 0 to 127. These characters and control codes are encoded as simple, unsigned, binary integers. Therefore, ASCII is both a coded character set and a character encoding scheme.
ASCII only encodes 128 characters, 33 of which are control codes rather than graphic, displayable characters. It was designed to represent English-language text for an American user base, and is therefore insufficient for representing text in almost any language other than American English. In fact, most traditional encodings were limited to one or few languages and scripts.
ASCII offered a natural way to extend it: Designed in the 1960’s to work in systems with 7-bit bytes while most computers and Internet protocols since the 1970’s use 8-bit bytes, the extra bit allowed another 128 byte values to represent more characters. Various encodings were developed that supported different languages. Some of these were based on ASCII, others were not.
Languages such as Japanese need to encode considerably more than 256 characters. Various encoding schemes enable large character sets with thousands or tens of thousands of characters to be represented. Most of those encodings are still byte-based, which means that many characters require two or more bytes of storage space. A process must be developed to interpret some byte values.
Various character sets and encoding schemes have been developed independently, cover only one or few languages each, and are incompatible. This makes it very difficult for a single system to handle text in more than one language at a time, and especially difficult to do so in a way that is interoperable across different systems.
Generally, the minimum requirement for the interoperable exchange of text data is that the encoding (character set & encoding scheme) must be properly specified in the document and in the protocol. For example, email/SMTP and HTML/HTTP provide the means to specify the “charset”, as it is called in Internet standards. However, very often the encoding is not specified, specified incorrectly, or the sender and receiver disagree on its implementation.
The ISO 2022 encoding scheme was created to store text in many different languages. It allows other encodings to be embedded by first announcing them and then switching between them. Full support for all features and possible encodings with ISO 2022 requires complicated processing and the need to support many encodings. For East Asian languages, subsets were developed that cover only one language or a few at a time, but they are much more manageable. ISO 2022 is not well-suited for use in internal processing. It is designed for data exchange.
Programmers often need to distinguish between characters and glyphs. A character is the smallest semantic unit in a writing system. It is an abstract concept such as the letter A or the exclamation point. A glyph is the visual presentation of one or more characters, and is often dependent on adjacent characters.
There is not always a one-to-one mapping between characters and glyphs. In many languages (Arabic is a prime example), the way a character looks depends heavily on the surrounding characters. Standard printed Arabic has as many as four different printed representations (glyphs) for every letter of the alphabet. In many languages, two or more letters may combine together into a single glyph (called a ligature), or a single character might be displayed with more than one glyph.
Despite the different visual variants of a particular letter, it still retains its identity. For example, the Arabic letter heh has four different visual representations in common use. Whichever one is used, it still keeps its identity as the letter heh. It is this identity that Unicode encodes, not the visual representation. This also cuts down on the number of independent character values required.
Unicode was developed as a single-coded character set that contains support for all languages in the world. The first version of Unicode used 16-bit numbers, which allowed for encoding 65,536 characters without complicated multibyte schemes. With the inclusion of more characters, and following implementation needs of many different platforms, Unicode was extended to allow more than one million characters. Several other encoding schemes were added. This introduced more complexity into the Unicode standard, but far less than managing a large number of different encodings.
Starting with Unicode 2.0 (published in 1996), the Unicode standard began assigning numbers from 0 to 10ffff16,which requires 21 bits but does not use them completely. This gives more than enough room for all written languages in the world. The original repertoire covered all major languages commonly used in computing. Unicode continues to grow, and it includes more scripts.
The design of Unicode differs in several ways from traditional character sets and encoding schemes:
The early inclusion of all characters of commonly used character sets makes Unicode a useful “pivot” point for converting between traditional character sets, and makes it feasible to process non-Unicode text by first converting into Unicode, process the text, and convert it back to the original encoding without loss of data.
The first 128 Unicode code point values are assigned to the same characters as in US-ASCII. For example, the same number is assigned to the same character. The same is true for the first 256 code point values of Unicode compared to ISO 8859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it easy to adapt many applications to Unicode because the numbers for many syntactically important characters are the same.
Unicode assigns characters a number from 0 to 10FFFF16, giving enough elbow room to allow for unambiguous encoding of every character in common use. Such a character number is called a “code point”.
Unicode code points are just non-negative integer numbers in a certain range. They do not have an implicit binary representation or a width of 21 or 32 bits. Binary representation and unit widths are defined for encoding forms.
For internal processing, the standard defines three encoding forms, and for file storage and protocols, some of these encoding forms have encoding schemes that differ in their byte ordering. The difference between an encoding form and an encoding scheme is that an encoding form maps the character set codes to values that fit into internal data types (like a short in C), while an encoding scheme maps to bits and bytes. For traditional encodings, they are the same since the encoding forms already map to bytes.
The different Unicode encoding forms are optimized for a variety of different uses:
ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters (with code points 1000016..10FFFF16). Older versions of ICU provided only partial support for supplementary characters.
For input/output, character encoding schemes define a byte serialization of text. UTF-8 is itself both an encoding form, and an encoding scheme because it is byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one that serializes the code units in big-endian byte order (most significant byte first), and one that serializes the code units in little-endian byte order (least significant byte first). The corresponding encoding schemes are called UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
The names “UTF-16” and “UTF-32” are ambiguous. Depending on context, they refer either to character encoding forms where 16/32-bit words are processed and are naturally stored in the platform endianness, or they refer to the IANA-registered charset names, i.e., to character encoding schemes or byte serializations. In addition to simple byte serialization, the charsets with these names also use optional Byte Order Marks (see Serialized Formats below).
The default encoding form of the Unicode Standard uses 16-bit code units. Code point values for the most common characters are in the range of 0 to FFFF16 and are encoded with just one 16-bit unit of the same value. Code points from 1000016 to 10FFFF16 are encoded with two code units that are often called “surrogates”, and they are called a “surrogate pair” when, together, they correctly encode one Unicode character. The first surrogate in a pair must be in the range D80016 to DBFF16, and the second one must be in the range DC0016 to DFFF16. Every Unicode code point has only one possible UTF-16 encoding with either one code unit that is not a surrogate or with a correct pair of surrogates. The code point values D80016 to DFFF16 are set aside just for this mechanism and will never, by themselves, be assigned any characters.
Most commonly used characters have code points below FFFF16, but Unicode 3.1 assigns more than 40,000 supplementary characters that make use of surrogate pairs in UTF-16.
Note that comparing UTF-16 strings lexically based on their 16-bit code units does not result in the same order as comparing the code points. This is not usually an issue since only rarely-used characters are affected. Most processes do not rely on the same results in such comparisons. Where necessary, a simple modification to a string comparison can be performed that still allows efficient code unit-based comparisons and makes them compatible with code point comparisons. ICU has C and C++ API functions for this.
To meet the requirements of byte-oriented, ASCII-based systems, the Unicode Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that preserves ASCII transparency.
UTF-8 maintains transparency for all the ASCII code values (0..127). These values do not appear in any byte of a transformed result except as the direct representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
Characteristics of UTF-8 include:
The UTF-32 encoding form always uses one single 32-bit integer per Unicode code point. This results in a very simple encoding.
The drawback is its memory consumption: Since code point values use only 21 bits, one-third of the memory is always unused, and since most commonly used characters have code point values of up to FFFF16, they take up only one 16-bit unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
UTF-32 is mainly used in APIs that are defined with the same data type for both code points and code units. Modern versions of the C standard library that support Unicode use a 32-bit wchar_t with UTF-32 semantics.
SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of Unicode text for both input and output. It is a simple compression that transforms the text into a byte stream. It typically uses one byte per character in small scripts, and two bytes per character in large, East Asian scripts.
It is usually shorter than any of the UTFs. However, SCSU is stateful, which makes it unsuitable for internal processing. It also uses all possible byte values, which might require additional processing for protocols such as SMTP (email).
Other Unicode encodings have been developed over time for various purposes. Most of them are implemented in ICU, see source/data/mappings/convrtrs.txt
Programming using any of the UTFs is much more straightforward than with traditional multi-byte character encodings, even though UTF-8 and UTF-16 are also variable-width encodings.
Within each Unicode encoding form, the code unit values for singletons (code units that alone encode characters), lead units, and for trailing units are all disjointed. This has crucial implications for implementations. The following lists these implications:
Conversion between different UTFs is very fast. Unlike converting to and from legacy encodings like Latin-2, conversion between UTFs does not require table look-ups.
ICU provides two basic data type definitions for Unicode. UChar32 is a 32-bit type for code points, and used for single Unicode characters. It may be signed or unsigned. It is the same as wchar_t if it is 32 bits wide. UChar is an unsigned 16-bit integer for UTF-16 code units. It is the base type for strings ( UChar * ), and it is the same as wchar_t if it is 16 bits wide.
Some higher-level APIs, used especially for formatting, use characters closer to a representation for a glyph. Such “user characters” are also called “graphemes” or “grapheme clusters” and require strings so that combining sequences can be included.
In files, input, output, and network protocols, text must be accompanied by the specification of its character encoding scheme for a client to be able to interpret it correctly. (This is called a “charset” in Internet protocols.) However, an encoding scheme specification is not necessary if the text is only used within a single platform, protocol, or application where it is otherwise clear what the encoding is. (The language and text directionality should usually be specified to enable spell checking, text-to-speech transformation, etc.)
The discussion of encoding specifications in this section applies to standard Internet protocols where charset name strings are used. Other protocols may use numeric encoding identifiers and assign different semantics to those identifiers than Internet protocols.
Typically, the encoding specification is done in a protocol- and document format-dependent way. However, the Unicode standard offers a mechanism for tagging text files with a “signature” for cases where protocols do not identify character encoding schemes.
The character ZERO WIDTH NO-BREAK SPACE (FEFF16) can be used as a signature by prepending it to a file or stream. The alternative function of U+FEFF as a format control character has been copied to U+2060 WORD JOINER, and U+FEFF should only be used for Unicode signatures.
The different character encoding schemes generate different, distinct byte sequences for U+FEFF:
UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F ) |
ICU provides the function ucnv_detectUnicodeSignature() for Unicode signature detection.
There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The opportunity for misidentification of one as the other is one of the reasons why CESU-8 should only be used in limited, closed, specific environments.
In UTF-16 and UTF-32, where the signature also distinguishes between big-endian and little-endian byte orders, it is also called a byte order mark (BOM). The signature works for UTF-16 since the code point that has the byte-swapped encoding, FFFE16, will never be a valid Unicode character. (It is a “non-character” code point.) In Internet protocols, if an encoding specification of “UTF-16” or “UTF-32” is used, it is expected that there is a signature byte sequence (BOM) that identifies the byte ordering, which is not the case for the encoding scheme/charset names with “BE” or “LE”.
If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE, respectively.
A signature is not part of the content, and must be stripped when processing. For example, blindly concatenating two files will give an incorrect result.
If a signature was detected, then the signature “character” U+FEFF should be removed from the Unicode stream after conversion. Removing the signature bytes before conversion could cause the conversion to fail for stateful encodings like BOCU-1 and UTF-7.
Whether a signature is to be recognized or not depends on the protocol or application.
The Unicode standard is an industry standard and parallels ISO 10646-1. Around 1993, these two standards were effectively merged into the same character set standard. Both standards have the same character repertoire and the same encoding forms and schemes.
One difference used to be that the ISO standard defined code point values to be from 0 to 7FFFFFFF16, not just up to 10FFFF16. The ISO work group decided to add an amendment to the standard. The amendment removes this difference by declaring that no characters will ever be assigned code points above 10FFFF16. The main reason for the ISO work group’s decision is interoperability between the UTFs. UTF-16 can not encode any code points above this limit.
This means that the code point space for both Unicode and ISO 10646 is now the same! These changes to ISO 10646 have been made recently and should be complete in the edition ISO 10646:2003 which also combines all parts of the standard into one.
The former, larger code space is the reason why the ISO definition of UTF-8 specifies sequences of five and six bytes to cover that whole range.
Another difference is that the ISO standard defines encoding forms “UCS-4” and “UCS-2”. UCS-4 is essentially UTF-32 with a theoretical upper limit of 7FFFFFFF16, using 31 out of the 32 bits. However, in practice, the ISO committee has accepted that the characters above 10FFFF will not be encoded, so there is essentially no difference between the forms. The “4” stands for “four-byte form”.
UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF, excluding the surrogate code points. Thus, it cannot represent the characters with code points above FFFF (called supplementary characters).
There is no conversion necessary between UCS-2 and UTF-16. The difference is only in the interpretation of surrogates.
The standards differ in what kind of information they provide: The Unicode standard provides more character properties and describes algorithms etc., while the ISO standard defines collections, subsets and similar.
The standards are synchronized, and the respective committees work together to add new characters and assign code point values.