Format: E-mail questionnaire

Questions: Open ended with some
suggested categorization

Target: Two large e-mail lists devoted
 to internationalization issues

Date: 2005-10-07

Duration: 72 hours

Number of responses: 99


The Unicode Standard has been in publication since 1991. It is now in version 4.1. Originally, the only documentation was the book, later augmented by a CD-ROM. The CD-ROM contains a copy of the Unicode Character database which is also available online. Since Unicode 3.0 there have been online PDF files covering the text of the standard, as well as the code charts. Since 1996 there have been online editions of Unicode Technical Reports, and later Unicode Technical Standards as well as Unicode Standard Annexes. In parallel the Unicode web-site provides an ever-growing set of additional information about the standard.

The Poll

Web-statistics can be used to discern usage patterns for online access to the standard, and the book sales reported by the publisher allow some conclusions about the scale and trends in the use of the hardcopy book. However, the information provided by these two sources is at a different level of detail and, more importantly, allows no conclusions about the motivations of the user. Therefore, we conducted an informal poll to obtain additional information about how people access the Unicode Standard.

The poll is an extension of our informal practice of asking participants in our Unicode Tutorials about their sources of getting information about the Unicode Standard. While such information can be helpful in designing tutorials, we realized that the same information collected from a wider circle of respondents could be used  to improve the authoritative sources of information as well.

The Sample

The sample reported here are 99 self-selected respondents from two large mailing lists focused on Unicode and internationalization issues. The sample is clearly not representative in that it tends to underrepresent the more occasional users and tends to overrepresent people knowledgeable about the standard. However, reading the detailed answers, it becomes clear that the sample successfully represents a wide range of strategies for accessing the standard and that makes the results useful for the intended purpose.


Modes of Accessing the Unicode Standard

  The first part of the poll addressed the users' primary choice of mode of access of information about the standard. The tabulated results (at right) suggest a fairly balanced picture overall picture.
  • Respondents are divided fairly symmetrically as to whether they 'primarily' use the book or the online files, with a minority indicating equal use of both modes.
  • A considerable number of replies were accompanied by detailed comments on the situations for which users prefer the book compared to tasks for which they prefer online access.
  • Respondents use the online files primarily for 'hard data', such as the Unicode Character Database, when they need to use a search engine to locate the information, and when traveling or otherwise away from their books. People explained that they use the online text and charts when they are more recent than the book.
  • Respondents prefer the book to read up on "difficult topics" such as conformance or algorithms; to use as a quick reference; to allow browsing of text and code charts without bumping into artificial chunks (PDF files). People find large amounts of text more readable in print; they report that having the book open next to their screen when working is quicker than switching files on screen.
  • Finally, the book is enjoyed by some for its mere physicality: "beautiful"; "makes the standard visible"; "great to show the boss".

36% book - 39% online - 22% both - 2% other

If you need to look up something in the Unicode Standard
 do you primarily use the book or the online files?





Access to the Unicode Standard as a Book


81% book - 14% not - 4% other access

Do you have access to a hard copy (book edition)
of the Unicode Standard?


The Unicode Standard and "the book" are not identical. The book represents a coherent snapshot of a particular major version of the standard with the Unicode Annexes and Unicode Character Database as softcopy on the CD-ROM. As soon as errata or update versions are available online, the information contained in the book is no longer guaranteed to be the most recent. However, comments indicated that rather than looking (only) for the latest information, people conceive of the book as a reference that can serve to get a grasp on the whole.  The poll asked about whether respondents had access to a book:
  • A minority of users (14%) work exclusively online. A few people consciously avoid the use of "dead trees". However, a significant proportion of online-only users commented that their use of the standard was limited in some way.
  • 86% of respondents have access to the book; for the most part they are the proud owners of several (personally owned) versions of the standard. We did not ask for information about multiple versions, but nearly everybody listed all the versions that they have. Where the book is not a personal copy, it's a job perk to have a dedicated copy.

Versions of the Unicode Standard in Book Form

  The intent of this question was to simply find out which version of the Unicode Standard were accessible to respondents in book form. Some people responded as intended, citing the latest version in their possession. Many other listed several or even all the versions that they have access to. In many cases, detailed comments explained their usage patterns.
  • Over 70% of those having access to the standard in book form have access to Unicode 4.0. One person keeps a copy at each of multiple work locations.
  • 13% of those owning a book, do not own 4.0. The extreme case is one person who uses a pre-publication 1.0 for Cyrillic characters and double checks the online files only for characters that might have changed.
  • Many people report using an out-of-date copy at a secondary location, like their home; surprisingly many reported that while they have the latest book, while their colleagues have an older version.

19% 1.0 - 25% 2.0 - 48% 3.0 - 71% 4.0

Which version of the Unicode Standard do you have access to?



Reason to get the Latest Book

The Unicode Standard continues to be extended to cover new scripts, and to complete the repertoire of existing scripts and symbol collections. Each version adds additional information to the Unicode Character Database, whether in corrections or improvements of character property assignments, introduction of new properties, or extension of existing properties to cover newly added characters. The text of the standard, including the UAXs is continually revised, both for clarity and to cover additional topics. Approximately every three years, a new edition of the book is released.

It is in the nature of things that the most widely used characters have been encoded early on and more recent additions have focused on more obscure usage. Equally, the most fundamental aspects of character behavior are already described in the earliest versions. Later improvements often had the character of detail fixes, even though sometimes a more thorough restatement has introduced additional clarity.

In this context we asked the open-ended question:

What would be a compelling reason to upgrade?

The comments about what would make respondents choose to purchase a new edition of a book they already own were very much in line with what one would expect from people making the decision of whether to buy a revised version of a popular programming book.
  • Price factors significantly in the upgrade decision. While almost 25% would upgrade as a matter of course, many others mentioned the need to justify the expense; others gave explicit indications of improvements they would like to see.
  • Suggested improvements ranged from significant additions to the standard, from "lots of character", to "I will buy the next version when the the vagaries of usage for Hebrew, Arabic, and the Indic codes have been straightened out, including sorting algorithms."  Many suggested they would upgrade when they "major overhaul" or other types of not-so-incremental changes to make the book more usable or more complete; by having the UAXs and similar specifications bound in or by providing more tutorial material.
  • Weight and size. Somewhere around the transition between Unicode 2.0 and 3.0 we lost those people who want to be able to transport the standard. One correspondent resorted to creating his own book of excerpts; some suggested being able to print from the online chapters; others noted that they can live without the CJK charts (and were unhappy about the price and weight penalties they represent).
  • In terms of purchasing the book, few people outside the US/Europe/Israel reported ownership of a book. It seems impossible to get the books in China. Our own research indicates that the cost to European users is inflated 25% and some vendors cannot ship without substantial delays. In many instances the older versions are cheaper and easier or faster to obtain.

Access to Online Information about the Unicode Standard

  The questions in the next part of the poll were designed to gather information about which of the online resources users typically access, and to what degree. Also included was a question about the CD-ROM that has been included in every copy the book since Unicode 2.0.
  • The online files fail to give their users a complete perspective of the standard. Many of the 'online only' users, the ones that will never use a book, because you "can't search dead trees" as one of them put it, reported at the same time that they never used the PDF files for the text. Some of them did not even report any usage of the UAXs, UTSs, or UTRs.
  • Several users reported that they rely on third-party tools to browse the charts; Unibook was the most frequently mentioned. Some users reported that they use third-party web-sites to get information on Unicode. One person volunteered that he reads the FAQ.
  • A significant number of comments mentioned slow access speed as driving people's choice of access mode. Some users make local copies of the UCD, usually selectively; a few people make local copies of charts or book chapters in PDF. Slow access is one the few reasons people use the CD-ROM. Those that do, often make a hard-disk copy.


The CD-ROM essentially provides a snapshot of certain parts of the Unicode website at the time of the publication of the book. The drawback of this is that the information is stale, the moment the standard is updated, which tends to occur between two and three times between major versions. Nearly 75% of all users never use the CD-ROM or used it only once to check its content. However, the 10-15% that seriously use the CD-ROM do so primarily in order to deal with limited connectivity issues that make ongoing use of the online information impractical, expensive or impossible. 2% suggested that the CD-ROM serves a useful archival function.

Unicode Character Database and Online Charts

These are among the most consistently used parts of the online information about the Unicode Standard, with the charts being used more often online while many users reported that they use a local copy of the UCD. Note that the questionnaire allowed both yes/no answers as well as more specific answer to degree of usage. From the way the answers were formulated and from the comments, we get the impression that users that  used a resource "often" were highly motivated to disclose that fact. Therefore, we show the "yes" answer between the "often" and "sometimes" answer in the bar chart on the right.

Unicode Standard Annexes as well as
Unicode Technical Standards and Reports

Of these, only the Unicode Standard Annexes are formally part of the Unicode Standard. Unicode Technical Standards are formally independent specifications and Unicode Technical Reports contain loosely related additional information. In answer to the question on UAXs, UTSs and UTRs, some people replied in terms of their overall frequency of use, others in terms of which specific titles they tend to use, and many in a combination of both types of answers.

  • Given that the book does not contain the text of these specification, the reported rates of use of UAXs seem low, particularly so for the 'online-only' set of users.
  • About 10% of users report that they use "all" or "most" of these specification, compared to 25% that report they do not use any.
  • The most popular topics were: the Bidirectional Algorithm (UAX #9), followed by Normalization (UAX #15), Collation (UTS #10), Linebreaking (UAX #14).
  • The following topics also received frequent mention: Security, Unicode and XML, Text Boundaries, Compression, followed by LDML (the data format definition for the CLDR).

Online Files Containing the Text of the Standard

These were the least use of all the online resources, presumably because they get updated only as often as the book text which they reflect. A consistent comment was that users found the best use of the online files as being able to point someone to a particular chapter by giving a URL.

  • The rate of use among online-only users was not significantly higher than among those who have access to the book, almost the opposite.


Parts of the Stanard: book, UAXs, UCD, etc.

Parts of the Unicode Standard

(Source: Unicode 4.1 Tutorial)





Chart over online modes of access

If you have access to the book, did you ever use the CD-ROM?

Do you use the online...
..... Unicode character database?
.... code charts?
..... UAXs, UTSs, or UTRs?
..... PDF files for the text?



Poll conducted and interpreted by Asmus Freytag


Asmus Freytag, Ph.D. is president of ASMUS, Inc. a Seattle-based company specializing in consulting services and seminars on topics ranging from software internationalization to implementing Unicode.

He has been a contributor to the Unicode Standard since before the inception of the Unicode Consortium and a co-author of the Unicode Standard for many years. He has written or contributed to several Unicode Technical Reports and Standards. He is a vice-president of the Unicode Consortium and represents the Consortium in several standards groups such as NCITS/L2 and ISO/IEC JTC1/SC2/WG2.

Note: all comments have been summarized or edited. Unicode is a trademark of the Unicode Consortium.

