Access to the Unicode™ Standard
Format: E-mail questionnaire
Questions: Open ended with some
Target: Two large e-mail lists devoted
to internationalization issues
Duration: 72 hours
Number of responses: 99
The Unicode Standard has been in publication since 1991. It is now in
version 4.1. Originally, the only documentation was the book, later
augmented by a CD-ROM. The CD-ROM contains a copy of the Unicode Character
database which is also available online. Since Unicode 3.0 there have been online
PDF files covering the text of the standard, as well as the code charts.
Since 1996 there have been online editions of Unicode Technical Reports,
and later Unicode Technical Standards as well as Unicode Standard
Annexes. In parallel the Unicode web-site provides an ever-growing set
of additional information about the standard.
Web-statistics can be used to discern usage patterns for online
access to the standard, and the book sales reported by the publisher
allow some conclusions about the scale and trends in the use of the
hardcopy book. However, the information provided by these two sources is
at a different level of detail and, more importantly, allows no
conclusions about the motivations of the user. Therefore, we
conducted an informal poll to obtain additional information about how
people access the Unicode Standard.
The poll is an extension of our informal practice of
asking participants in our Unicode Tutorials about their sources of
getting information about the Unicode Standard. While such information
can be helpful in designing tutorials, we realized that the same
information collected from a wider circle of respondents could be used
to improve the authoritative sources of
information as well.
The sample reported here are 99 self-selected respondents from two
large mailing lists focused on Unicode and internationalization issues.
The sample is clearly not representative in that it tends to
underrepresent the more occasional users and tends to overrepresent
people knowledgeable about the standard. However, reading the detailed
answers, it becomes clear that the sample successfully represents a wide
range of strategies for accessing the standard and that makes the
results useful for the intended purpose.
Modes of Accessing the Unicode Standard
||The first part of the poll
addressed the users' primary choice of mode of access of information about the
standard. The tabulated results (at right) suggest a fairly balanced
picture overall picture.
- Respondents are divided fairly symmetrically as to whether they 'primarily'
use the book or the online files, with a minority indicating equal
use of both modes.
- A considerable number of replies were accompanied by detailed
comments on the situations for which users prefer the book
compared to tasks for which they prefer online access.
- Respondents use the online
files primarily for 'hard data', such as the Unicode Character
Database, when they need to use a search engine to locate the
information, and when traveling or otherwise
away from their books. People explained that they use the online
text and charts when they are more recent than the book.
- Respondents prefer the book to read up on "difficult topics"
such as conformance or algorithms; to
use as a quick reference; to allow browsing of text and code charts
without bumping into artificial chunks (PDF files). People find
large amounts of text more readable in print; they report that
having the book open next to their screen when working is quicker
than switching files on screen.
- Finally, the book is enjoyed by some for its
mere physicality: "beautiful"; "makes the standard visible"; "great
to show the boss".
If you need to look up something in the Unicode
do you primarily use the book or the online files?
Access to the Unicode Standard as a Book
Do you have access to a hard copy (book edition)
The Unicode Standard and "the book" are not identical. The
book represents a coherent snapshot of a particular major version of the
standard — with the Unicode Annexes and Unicode Character Database as
softcopy on the CD-ROM. As soon as errata or update versions are available
online, the information contained in the book is no longer guaranteed to
be the most recent. However, comments indicated that rather than looking
(only) for the latest information, people conceive of the book as a
reference that can serve to get a grasp on the whole. The poll
asked about whether respondents had access to a book:
- A minority of users (14%) work exclusively online. A few people
consciously avoid the use of "dead trees". However, a significant
proportion of online-only users commented that their use of the standard was
limited in some way.
- 86% of respondents have access to the book; for the most part
they are the proud owners of several (personally owned) versions of
the standard. We did not ask for information about multiple versions,
but nearly everybody listed all the versions that they have.
Where the book is not a personal copy, it's a job perk to have a
Versions of the Unicode Standard in Book Form
The intent of this question was to simply find out which version of the
Unicode Standard were accessible to respondents in book form. Some
people responded as intended, citing the latest version in their
possession. Many other listed several or even all the versions that they
have access to. In many cases, detailed comments explained their usage
- Over 70% of those having access to the standard in book form
have access to Unicode 4.0. One person keeps a copy at each of
multiple work locations.
- 13% of those owning a book, do not own 4.0. The extreme case
is one person who uses a pre-publication 1.0 for Cyrillic characters and double checks the online files only for characters
that might have changed.
- Many people report using an out-of-date
copy at a secondary location, like their home; surprisingly many
reported that while they have the latest book, while their
colleagues have an older version.
Which version of the Unicode Standard do you have
Reason to get the Latest Book
The Unicode Standard continues to be extended to cover new scripts, and
to complete the repertoire of existing scripts and symbol collections.
Each version adds additional information to the Unicode Character
Database, whether in corrections or improvements of character property
assignments, introduction of new properties, or extension of existing
properties to cover newly added characters. The text of the standard,
including the UAXs is continually revised, both for clarity and to cover
additional topics. Approximately every three years, a new edition of the
book is released.
It is in the nature of things that the most widely used characters
have been encoded early on and more recent additions have focused on
more obscure usage. Equally, the most fundamental aspects of character
behavior are already described in the earliest versions. Later
improvements often had the character of detail fixes, even though
sometimes a more thorough restatement has introduced additional clarity.
In this context we asked the open-ended question:
What would be a compelling reason to
The comments about what would make respondents choose to purchase a
new edition of a book they already own were very much in line with what one would expect from people
making the decision of whether to buy a revised version of a popular
- Price factors significantly in the upgrade decision. While
almost 25% would upgrade as a matter of course, many others
mentioned the need to justify the expense; others gave explicit
indications of improvements they would like to see.
- Suggested improvements ranged from significant additions
to the standard, from "lots of character", to "I will buy the next
version when the the vagaries of usage for Hebrew, Arabic, and
the Indic codes have been straightened out, including sorting algorithms."
Many suggested they would upgrade when they "major overhaul" or
other types of not-so-incremental changes to make the book more
usable or more complete; by having the UAXs and similar
specifications bound in or by providing more tutorial material.
- Weight and size. Somewhere around the transition between Unicode
2.0 and 3.0 we lost those people who want to be able to transport
the standard. One correspondent resorted to creating his own book of
excerpts; some suggested being able to print from the online
chapters; others noted that they can live without the CJK charts (and
were unhappy about the price and weight penalties they represent).
- In terms of purchasing the book, few people outside the
US/Europe/Israel reported ownership of a book. It seems impossible
to get the books in China. Our own research indicates that the cost to
European users is inflated 25% and some vendors cannot ship without
substantial delays. In many instances the older versions are cheaper
and easier or faster to obtain.
Access to Online Information about the Unicode Standard
The questions in the next part of the poll were designed to gather
information about which of the online resources users typically access,
and to what degree. Also included was a question about the CD-ROM that
has been included in every copy the book since Unicode 2.0.
- The online files fail to give their users a complete perspective
of the standard. Many of the 'online only' users, the ones that will
never use a book, because you "can't search dead trees" as one of
them put it, reported at the same time that they never used the
PDF files for the text. Some of them did not even report any usage of
the UAXs, UTSs, or UTRs.
- Several users reported that they rely on third-party tools to
browse the charts; Unibook was the most frequently mentioned. Some
users reported that they use third-party web-sites to get
information on Unicode. One person volunteered that he reads the
- A significant number of comments mentioned slow access speed as
driving people's choice of access mode. Some users make local copies of the UCD,
usually selectively; a few people make local
copies of charts or book chapters in PDF. Slow access is one the few
reasons people use the CD-ROM. Those that do, often make a hard-disk
The CD-ROM essentially provides a snapshot of certain parts of the
Unicode website at the time of the publication of the book. The drawback
of this is that the information is stale, the moment the standard is
updated, which tends to occur between two and three times between major
versions. Nearly 75% of all users never use the CD-ROM or used it only
once to check its content. However, the 10-15% that seriously use the CD-ROM do so
primarily in order to deal with limited connectivity issues that make
ongoing use of the online information impractical, expensive or
impossible. 2% suggested that the CD-ROM serves a useful archival
Unicode Character Database and Online Charts
These are among the most consistently used parts of the online
information about the Unicode Standard, with the charts being used more
often online while many users reported that they use a local copy of the
UCD. Note that the questionnaire allowed both yes/no answers as well as
more specific answer to degree of usage. From the way the answers were
formulated and from the comments, we get the impression that users
that used a resource "often" were highly motivated to disclose that
fact. Therefore, we show the "yes" answer between the "often" and
"sometimes" answer in the bar chart on the right.
Unicode Standard Annexes as well as
Unicode Technical Standards and Reports
Of these, only the Unicode Standard Annexes are formally part of the
Unicode Standard. Unicode Technical Standards are formally independent
specifications and Unicode Technical Reports contain loosely related
additional information. In answer to the question on UAXs, UTSs and UTRs,
some people replied in terms of their overall frequency of use, others
in terms of which specific titles they tend to use, and many in a
combination of both types of answers.
- Given that the book does not contain the text of these
specification, the reported rates of use of UAXs seem low,
particularly so for the 'online-only' set of users.
- About 10% of users report that they use "all" or "most" of these
specification, compared to 25% that report they do not use any.
- The most popular topics were: the Bidirectional Algorithm (UAX
#9), followed by Normalization (UAX #15), Collation (UTS #10),
Linebreaking (UAX #14).
- The following topics also received frequent mention: Security,
Unicode and XML, Text Boundaries, Compression, followed by LDML (the
data format definition for the CLDR).
Online Files Containing the Text of the Standard
These were the least use of all the online resources, presumably
because they get updated only as often as the book text which they
reflect. A consistent comment was that users found the best use of the
online files as being able to point someone to a particular chapter by
giving a URL.
- The rate of use among online-only users was not significantly
higher than among those who have access to the book, almost the
Parts of the Unicode Standard
(Source: Unicode 4.1
If you have access to the book, did you ever
use the CD-ROM?
Do you use the online...
..... Unicode character database?
.... code charts?
..... UAXs, UTSs, or UTRs?
..... PDF files for the text?
Poll conducted and interpreted by Asmus Freytag
Asmus Freytag, Ph.D. is president of ASMUS,
Inc. a Seattle-based company specializing in consulting services
and seminars on topics ranging from software
internationalization to implementing Unicode.
He has been a contributor to the Unicode
Standard since before the inception of the Unicode Consortium
and a co-author of the Unicode Standard for many years. He has
written or contributed to several Unicode Technical Reports and
Standards. He is a vice-president of the Unicode Consortium and
represents the Consortium in several standards groups such as
NCITS/L2 and ISO/IEC JTC1/SC2/WG2.
Note: all comments have been summarized or edited. Unicode is a
trademark of the Unicode Consortium.
Copyright © 2005 ASMUS, Inc. All rights reserved.