Adding surrogate
support to a Unicode library
Author:
Èñòî÷íèê: http://unicode.org/iuc/iuc17/b2/slides.ppt
Adding surrogate
support to a Unicode library
17th
International Unicode Conference
From
UCS-2 to UTF-16
Discussion
and practical example for the
transition of a Unicode library from UCS-2 to UTF-16
This talk
discusses the need to support "surrogate characters", analyzes many
of the implementation choices with their pros and cons, and presents a
practical example.
As the
preparation of the second part of ISO 10646 and the next version of
Unicode
draws to an end, Unicode applications need to prepare to support
assigned
characters outside the BMP. Although the Unicode encoding range was
formally
extended via the "surrogate" mechanism with Unicode
For example,
the International Components for Unicode (ICU), an open-source project,
provides low-level Unicode support in C and C++ similar to the Java JDK
1.1. In
order to support the full encoding range, some of the APIs and
implementations
had to be changed. Several alternatives were discussed for this project
and are
presented in this talk.
The ICU
APIs and implementation are now being adapted for UTF-16, with 32-bit
code
point values for single characters, and the lookup of character
properties is
extended to work with surrogate characters. This approach is compared
with what
other companies and organizations are doing, especially for Java, Linux
and
other Unixes, and Windows.
Why
is this an issue?
•The
concept of the Unicode standard changed during its first few years
•Unicode
2.0 (1996) expanded the code point range from 64k to 1.1M
•APIs
and libraries need to follow this change and support the full range
•Upcoming
character assignments (Unicode 3.1, 2001) fall into the added
range
The Unicode
standard was designed to encode fewer than 65000 characters; in
“The Unicode
Standard, Version
Completeness.
The coded character set would be large enough to encompass all
characters that
were likely to be used in general text interchange.
It was thought
possible to have fewer than 65000 characters because rarely used,
ancient,
obsolete, and precomposed characters were not to be encoded –
they were not
assumed to be “likely to be used in general text
interchange”. Unicode included
a private use area of several thousand code points to accommodate such
needs.
The only
original encoding form was a fixed-width 16-bit encoding. With the
expansion of
the coding space that became necessary later, the 16-bit encoding
became
variable-width. Byte-based and 32-bit fixed-width encodings were also
added
over time. These changes went hand in hand with the maturation and
growing
acceptance and use of Unicode.
16-bit
APIs
•APIs
developed for Unicode 1.1 used 16-bit characters and strings: UCS-2
•Assuming
1:1 character:code unit
•Examples:
Win32, Java, COM, ICU, Qt/KDE
•Byte-based
UTF-8 (1993) mostly for MBCS compatibility and transfer
protocols
Programming
libraries that were designed several years ago assumed for their APIs
and
implementations that Unicode text was stored with the original 16-bit
fixed-width form. One 16-bit code unit in what the parallel ISO 10646
standard
calls UCS-2 always encoded one Unicode character. This was true with
Unicode
1.0 and 1.1.
Examples
for libraries that worked with this assumption include the Microsoft
Windows
Win32 API set and related APIs like COM; Java with its char type and
String
class; the International Components for Unicode (ICU); and the Qt
library from
Troll Technologies for user interface programming with the KDE desktop
environment
that is built on Qt.
Libraries, APIs, and protocols that were defined in terms of byte-based text processing use the byte-based UTF-8 encoding of Unicode. It encodes almost all characters in more than one byte – those systems always dealt with variable-width encodings when they were internationalized at all.
Parallel
with ISO-10646
•ISO-10646
uses 31-bit codes:
UCS-4
•UCS-2:
16-bit codes for subset 0..ffff16
•UTF-16:
transformation of subset 0..10ffff16
•UTF-8
covers all 31 bits
•Private
Use areas above 10ffff16 slated for removal from
ISO-10646 for UTF interoperability and synchronization with Unicode
In 1993, the Unicode standard and ISO 10646-1 were merged so that they encode the same characters with the same numeric code point.
ISO 10646-1
defines a 31-bit code space with values of 0 to 7fffffff16 for 2G
characters.
The canonical encoding, called UCS-4, uses 32-bit integer values (4
bytes). The
alternative encoding UCS-2 covers only the subset of values up to
ffff16 with
16-bit (2-byte) values. No character was assigned a higher value than
ffff16
but several ranges above that value were set aside for private use.
The Unicode
standard originally also used single 16-bit code units, and the two
standards
assigned the same numeric values beginning with the merger that was
completed
in 1993.
UTF-8, the
byte-based and first variable-width encoding for the two standards, was
created
even before then (in 1992) to help transition byte-oriented systems. It
allows
up to 6 bytes per character for all of UCS-4.
The definition
(in 1994) of UTF-16 as the variable-width 16-bit encoding for both
standards
allowed the extension of the Unicode code point range. UTF-32 was
defined in
1999 to clarify the use of Unicode characters with the more limited
range
compared to UCS-4 but in an otherwise fully compatible way.
In 2000,
the workgroup for the ISO standard (JTC1/SC2/WG2) agreed to remove any
allocations above the UTF-16-accessible range, i.e., above 10ffff16, in
order
to remove any interoperability problems between UCS-4, UTF-8, and
UTF-16.
The
following slides explore each of these
ideas.
UCS-2
to UTF-32
•Fixed-width,
single base type for strings and code points
•UCS-2
programming assumptions mostly intact
•Wastes
at least 33% space, typically 50%
•Performance
bottleneck CPU – memory
Option:
Changing
the string base type from a 16-bit integer to a 32-bit integer.
Advantage:
Assumptions
made in programming for UCS-2 stay intact:
Each
character is stored in one single code unit and the same type can be
used for
both strings and code points.
Disadvantage:
Memory
usage, and potentially a reduction in performance:
Since
Unicode code points only use 21 bits, 11 out of 32 bits – 33%
– would never be
used.
In fact,
since the most common characters were assigned smaller values, typical
applications
would not use 50% of the memory that strings take up.
In text
processing, more memory needs to be moved from main and virtual memory
into and
out of the CPU cache, which may cost more performance than the
reduction in
operations per character from the simpler fixed-width encoding.
UCS-2
to UTF-8
•UCS-2
programming assumes many characters in single code units
•Breaks
a lot of code
•Same
question of type for code points; follow C model, 32-bit wchar_t?
–
More difficult transition than other choices
Option:
Changing
the string type to UTF-8. This alone does not affect the choice of a
data type
for code points.
Advantage:
UTF-8 is a
popular variable-width encoding for Unicode. The memory consumption is
higher
or lower than with UTF-16 depending on the text.
Disadvantage:
Changing a
UCS-2 library to use UTF-8 would break a lot of code even for code
points below
ffff16 because much of the implementation relies on special characters
(digits,
modifiers, controls for bidirectional text, etc.) to be encoded in a
single
unit each.
Note:
Existing UTF-8 systems need to make sure that 4-byte sequences and 21-bit code points are handled; some may assume that UTF-8 would never use more than 3 bytes per character and that the scalar values would fit into 16 bits, which was the case for Unicode 1.1.
Surrogate
pairs for single chars
•Caller
avoids code point calculation
•But:
caller and callee need to detect and handle pairs:
caller
choosing argument values, callee checking for errors
•Harder
to use with code point constants because they are published as
scalar values
•Significant
change for caller from using scalars
Option:
Duplicating
code point APIs by adding surrogate-pair variants. Strings are in
UTF-16.
A caller
would check for surrogate pairs and call either function variant and
advance in
the text by one or two units in case of an iteration.
It is also
possible to replace existing functions by the pair variant; the caller
could
always pass in the current and the following code unit. In this case,
the
function needs to return the number of units that it used to allow
forward
iteration. For backward iteration, there may be additional provisions.
Advantage:
The API
would still only work with 16-bit integers.
Disadvantage:
The usage
model becomes significantly more complicated. Some of the work for
detecting
and dealing with surrogates would be done twice for robust interfaces,
once by
the caller and a second time by the API implementation. The API itself
becomes
more convoluted and harder to use.Also, character code points are
typically
published, discussed, and accessed as scalar values. Forcing a
programmer to
calculate the surrogate values would be clumsy and error-prone.
Interoperability
•Break
existing API users no more than necessary
•Interoperability
with other APIs: Win32, Java, COM, now also XML DOM
•UTF-16
is Unicode default: good compromise (speed/ease/space)
•String
units should stay 16 bits wide
Further
considerations for the support of “surrogate
characters” include the question
of migration and interoperability.
It is
desirable to not change the programming model more than necessary.
It is also
desirable to use the same string type that other popular and important
systems
use: The leading Unicode implementations in Windows Win32 and COM as
well as in
Java use 16-bit Unicode strings, as well as the XML DOM specification
and many
other systems. Windows is an important platform, Java an important
programming
language, and specifically for ICU, the open-source IBM XML parser is
one of
the most important applications using ICU.
UTF-16 is
also the default encoding form of Unicode, and it provides a good
compromise
between memory usage, performance, and ease of use.
This lead
to the decision to continue to use 16-bit strings in ICU.
Summary
•Transition
from UCS-2 to UTF-16 gains importance after four years of
standard
•APIs
for single characters need change or new versions
•String
APIs: no change
•Implementations
need to handle 21-bit code points
•Range
of options
The
transition of the default Unicode encoding form from UCS-2 to UTF-16
that
started in 1994 with the definition of UTF-16 and was published as part
of
Unicode
Software
that uses 16-bit Unicode needs to be modified to handle the extended
encoding
range. On the API level, 16-bit strings are syntactically compatible.
Single-character
APIs need to be modified, or a new version made available, if they used
16-bit
integers for Unicode code points. The actual code point values take up
21 bits.
There are several options for transition APIs for surrogate support, and some of them are discussed in this presentation.
Resources
•Unicode
FAQ: http://www.unicode.org/unicode/faq/
•Unicode
on IBM developerWorks: http://www.ibm.com/developer/unicode/
•ICU: http://oss.software.ibm.com/icu/