Adding surrogate support to a Unicode library

 
Author: San Jose


Èñòî÷íèê: http://unicode.org/iuc/iuc17/b2/slides.ppt


Adding surrogate support to a Unicode library

 

17th International Unicode Conference

San Jose, California, September 2000

 

From UCS-2 to UTF-16

Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16

This talk discusses the need to support "surrogate characters", analyzes many of the implementation choices with their pros and cons, and presents a practical example.

As the preparation of the second part of ISO 10646 and the next version of Unicode draws to an end, Unicode applications need to prepare to support assigned characters outside the BMP. Although the Unicode encoding range was formally extended via the "surrogate" mechanism with Unicode 2.0 in 1996, many implementations still assume that a code point fits into 16 bits. At the end of this year, the drafts for the new standard versions are expected to be stable, and the assignment of surrogate characters for use in the East Asian markets will soon require implementations that support the full Unicode range.
For example, the International Components for Unicode (ICU), an open-source project, provides low-level Unicode support in C and C++ similar to the Java JDK 1.1. In order to support the full encoding range, some of the APIs and implementations had to be changed. Several alternatives were discussed for this project and are presented in this talk.
The ICU APIs and implementation are now being adapted for UTF-16, with 32-bit code point values for single characters, and the lookup of character properties is extended to work with surrogate characters. This approach is compared with what other companies and organizations are doing, especially for Java, Linux and other Unixes, and Windows.

Why is this an issue?

The concept of the Unicode standard changed during its first few years

Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M

APIs and libraries need to follow this change and support the full range

Upcoming character assignments (Unicode 3.1, 2001) fall into the added range

The Unicode standard was designed to encode fewer than 65000 characters; in “The Unicode Standard, Version 1.0” on page 2 it says:

Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange.

It was thought possible to have fewer than 65000 characters because rarely used, ancient, obsolete, and precomposed characters were not to be encoded – they were not assumed to be “likely to be used in general text interchange”. Unicode included a private use area of several thousand code points to accommodate such needs.
The only original encoding form was a fixed-width 16-bit encoding. With the expansion of the coding space that became necessary later, the 16-bit encoding became variable-width. Byte-based and 32-bit fixed-width encodings were also added over time. These changes went hand in hand with the maturation and growing acceptance and use of Unicode.

16-bit APIs

APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2

Assuming 1:1 character:code unit

Examples: Win32, Java, COM, ICU, Qt/KDE

Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols

Programming libraries that were designed several years ago assumed for their APIs and implementations that Unicode text was stored with the original 16-bit fixed-width form. One 16-bit code unit in what the parallel ISO 10646 standard calls UCS-2 always encoded one Unicode character. This was true with Unicode 1.0 and 1.1.
Examples for libraries that worked with this assumption include the Microsoft Windows Win32 API set and related APIs like COM; Java with its char type and String class; the International Components for Unicode (ICU); and the Qt library from Troll Technologies for user interface programming with the KDE desktop environment that is built on Qt.

Libraries, APIs, and protocols that were defined in terms of byte-based text processing use the byte-based UTF-8 encoding of Unicode. It encodes almost all characters in more than one byte – those systems always dealt with variable-width encodings when they were internationalized at all.

Parallel with ISO-10646

ISO-10646 uses 31-bit codes: UCS-4

UCS-2: 16-bit codes for subset 0..ffff16

UTF-16: transformation of subset 0..10ffff16

UTF-8 covers all 31 bits

Private Use areas above 10ffff16 slated for removal from ISO-10646 for UTF interoperability and synchronization with Unicode

In 1993, the Unicode standard and ISO 10646-1 were merged so that they encode the same characters with the same numeric code point.

ISO 10646-1 defines a 31-bit code space with values of 0 to 7fffffff16 for 2G characters. The canonical encoding, called UCS-4, uses 32-bit integer values (4 bytes). The alternative encoding UCS-2 covers only the subset of values up to ffff16 with 16-bit (2-byte) values. No character was assigned a higher value than ffff16 but several ranges above that value were set aside for private use.
The Unicode standard originally also used single 16-bit code units, and the two standards assigned the same numeric values beginning with the merger that was completed in 1993.
UTF-8, the byte-based and first variable-width encoding for the two standards, was created even before then (in 1992) to help transition byte-oriented systems. It allows up to 6 bytes per character for all of UCS-4.
The definition (in 1994) of UTF-16 as the variable-width 16-bit encoding for both standards allowed the extension of the Unicode code point range. UTF-32 was defined in 1999 to clarify the use of Unicode characters with the more limited range compared to UCS-4 but in an otherwise fully compatible way.
In 2000, the workgroup for the ISO standard (JTC1/SC2/WG2) agreed to remove any allocations above the UTF-16-accessible range, i.e., above 10ffff16, in order to remove any interoperability problems between UCS-4, UTF-8, and UTF-16.

The following slides explore each of these ideas.

UCS-2 to UTF-32

Fixed-width, single base type for strings and code points

UCS-2 programming assumptions mostly intact

Wastes at least 33% space, typically 50%

Performance bottleneck CPU – memory

Option:

Changing the string base type from a 16-bit integer to a 32-bit integer.

Advantage:

Assumptions made in programming for UCS-2 stay intact:

Each character is stored in one single code unit and the same type can be used for both strings and code points.

Disadvantage:

Memory usage, and potentially a reduction in performance:

Since Unicode code points only use 21 bits, 11 out of 32 bits – 33% – would never be used.

In fact, since the most common characters were assigned smaller values, typical applications would not use 50% of the memory that strings take up.

In text processing, more memory needs to be moved from main and virtual memory into and out of the CPU cache, which may cost more performance than the reduction in operations per character from the simpler fixed-width encoding.

UCS-2 to UTF-8

UCS-2 programming assumes many characters in single code units

Breaks a lot of code

Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices

Option:

Changing the string type to UTF-8. This alone does not affect the choice of a data type for code points.

Advantage:

UTF-8 is a popular variable-width encoding for Unicode. The memory consumption is higher or lower than with UTF-16 depending on the text.

Disadvantage:

Changing a UCS-2 library to use UTF-8 would break a lot of code even for code points below ffff16 because much of the implementation relies on special characters (digits, modifiers, controls for bidirectional text, etc.) to be encoded in a single unit each.

Note:

Existing UTF-8 systems need to make sure that 4-byte sequences and 21-bit code points are handled; some may assume that UTF-8 would never use more than 3 bytes per character and that the scalar values would fit into 16 bits, which was the case for Unicode 1.1.

Surrogate pairs for single chars

Caller avoids code point calculation

But: caller and callee need to detect and handle pairs: caller choosing argument values, callee checking for errors

Harder to use with code point constants because they are published as scalar values

Significant change for caller from using scalars

Option:

Duplicating code point APIs by adding surrogate-pair variants. Strings are in UTF-16.

A caller would check for surrogate pairs and call either function variant and advance in the text by one or two units in case of an iteration.

It is also possible to replace existing functions by the pair variant; the caller could always pass in the current and the following code unit. In this case, the function needs to return the number of units that it used to allow forward iteration. For backward iteration, there may be additional provisions.

Advantage:

The API would still only work with 16-bit integers.

Disadvantage:

The usage model becomes significantly more complicated. Some of the work for detecting and dealing with surrogates would be done twice for robust interfaces, once by the caller and a second time by the API implementation. The API itself becomes more convoluted and harder to use.Also, character code points are typically published, discussed, and accessed as scalar values. Forcing a programmer to calculate the surrogate values would be clumsy and error-prone.

Interoperability

Break existing API users no more than necessary

Interoperability with other APIs: Win32, Java, COM, now also XML DOM

UTF-16 is Unicode default: good compromise (speed/ease/space)

String units should stay 16 bits wide

Further considerations for the support of “surrogate characters” include the question of migration and interoperability.

It is desirable to not change the programming model more than necessary.

It is also desirable to use the same string type that other popular and important systems use: The leading Unicode implementations in Windows Win32 and COM as well as in Java use 16-bit Unicode strings, as well as the XML DOM specification and many other systems. Windows is an important platform, Java an important programming language, and specifically for ICU, the open-source IBM XML parser is one of the most important applications using ICU.

UTF-16 is also the default encoding form of Unicode, and it provides a good compromise between memory usage, performance, and ease of use.

This lead to the decision to continue to use 16-bit strings in ICU.

 
Summary

Transition from UCS-2 to UTF-16 gains importance after four years of standard

APIs for single characters need change or new versions

String APIs: no change

Implementations need to handle 21-bit code points

Range of options

The transition of the default Unicode encoding form from UCS-2 to UTF-16 that started in 1994 with the definition of UTF-16 and was published as part of Unicode 2.0 in 1996 gains importance with “real” character assignments above ffff16 expected in 2001. Especially the new CJKV characters are expected to be important for East Asian text processing and are likely to accelerate the acceptance of Unicode in East Asia.
Software that uses 16-bit Unicode needs to be modified to handle the extended encoding range. On the API level, 16-bit strings are syntactically compatible. Single-character APIs need to be modified, or a new version made available, if they used 16-bit integers for Unicode code points. The actual code point values take up 21 bits.

There are several options for transition APIs for surrogate support, and some of them are discussed in this presentation.

Resources

Unicode FAQ: http://www.unicode.org/unicode/faq/

Unicode on IBM developerWorks: http://www.ibm.com/developer/unicode/

ICU: http://oss.software.ibm.com/icu/

 

 

 

ÂÅÐÍÓÒÜÑß ÍÀÇÀÄ