Page 1 of 1

FYI: Unicode 4.0.0 has been released.

Posted: Mon Aug 04, 2003 10:18 pm
by Faraz.Fazil
Unicode 4.0.0 is a major version of the Unicode Standard. The text of the standard has been extensively rewritten to improve its structure and clarity.

The forthcoming book publication, The Unicode Standard, Version 4.0, together with the online Unicode Standard Annexes and the Unicode Character Database, defines Version 4.0 of the Unicode Standard. The book gives the general principles, requirements for conformance, and guidelines for implementers, followed by character code charts and names. This book can be pre-ordered online.

A complete specification of the contributory files for Unicode 4.0.0 is found on Enumerated Versions. Version 4.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)
Prepublication Version

Some preliminary chapters of the book, as well as the final character code charts, are available online via the navigation links on this page. Both the text and pagination may change slightly prior to final publication. These files may not be printed except for the Table of Contents. A full online version of the book will be posted as soon as it is available.

Major additions to Version 4.0 since Version 3.0 include:

major changes to the introductory and conformance chapters, and extensive revisions to the discussion of punctuation, symbols, and format characters
extensive additions of CJK characters to cover dictionaries and historic usage
many new symbols for mathematical and technical publication
many individual characters such as currency symbols were added to other scripts, including Indic, Khmer, Latin, Greek, Arabic, Syriac
substantially improved specification of conformance requirements, incorporating the character encoding model
encoding of supplementary characters
formalized policies for stability of the standard
clarification of semantics of special characters, including the byte order mark
major expansion of Unicode Character Database properties and of specifications for text boundaries and casing
more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts
more historic scripts, including Linear B, Cypriot, and Ugaritic
tightened definition of encoding terms, including UTF-32
substantial improvements to the script descriptions, particularly for Indic scripts and Khmer.
New Characters

1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use.

These new characters extend the set of modern currency symbols, and represent a greater coverage of minority and historical scripts. The following table shows the allocation of code points in Unicode 4.0.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database.

Graphic
96,248

Format
134

Control
65

Private Use
137,468

Surrogate
2,048

Noncharacter
66

Reserved
878,083

The character repertoire corresponds to ISO/IEC 10646:2003.

Note: The code charts will not be available until near the end of the month. In the meantime, the Beta 4.0.0 charts are available.
Unicode Character Database

Unicode Version 4.0.0 introduced the concept of provisional properties, clarified the relationships between properties, and provided precisely defined fallback properties for characters not explicitly defined in the data files. The documentation was coalesced into UCD.html, with a combined list of Properties.

Other property changes include:
Prefix Format Control. U+06DD arabic end of ayah and U+070F syriac abbreviation mark were reclassified and have significantly different behavior as prefix format control characters. The new characters U+0600..U+0603 were given this behavior as well.

New Properties. The Hangul Syllable Type and identifier Other_ID_Start properties were added. The Unicode Radical Stroke property was classified as informative; all other Unihan properties were classified as provisional. PropertyValueAliases also adds block names.

Numeric Properties. CJK numeric values added; the properties Decimal Number (Nd) and the Numeric Type decimal digit were aligned in value.
Default Ignorables. Added Hangul Filler characters, U+00AD soft hyphen, CGJ, and ZWS

Soft Hyphen. U+00AD soft hyphen was also changed to General Category Cf. Its semantics were clarified: it marks a position for hyphenation, rather than being itself a hyphen character. (The Hyphen property itself was stabilized, and thus not changed to reflect this.)

Modifier Letters. The General Category of U+02B9..U+02BA, U+02C6..U+02CF changed to General Category Lm.

Grapheme_Extend. The halfwidth katakana marks, and most combining marks (except as needed for canonical equivalence) were removed.
Mongolian Vowel Separator. U+180E mongolian vowel separator was changed to General Category Zs.

Deprecated Characters. Two Khmer characters, U+17A3 khmer independent vowel qaq and U+17D3 khmer sign bathamasat, were deprecated. Four others are strongly discouraged.
Enclosing combining marks. The scope has been defined more clearly.
ZWJ. The semantics with cursive scripts has been revised.

Normalization Corrections. There were corrections for characters U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF.
Note: these corrections are in accord with the Unicode Stability Policy.

For more information, see the file UCD.html in the Unicode Character Database.

Conformance

Chapter 3 was substantially improved by incorporating the Unicode Character Encoding Model, resulting in fully specified definitions and conformance requirements of UTF-8, UTF-16, and UTF-32. As a part of this, the related concept of Unicode String is defined, which is a sequence of code units for internal processing; a sequence that is not necessarily a valid Unicode Encoding Form.

Clearer terminology was introduced for code points assignments, including the seven main categories given in the above table. The conformance status of UAXes, UTSes and UTRs was also clarified. In addition:
Identifiers. A structure for ensuring backwards-compatible programming language identifiers was introduced using the new property Other_ID_Start. There is also an alternate definition for complete stability of identifiers.

Bidi. The bidi algorithm was updated and moved to UAX #9 (see below).
Line Breaking and Boundaries. U+00AD soft hyphen was reclassified. Text boundaries were clarified.

Case Folding. The text from UAX #21, ?Case Mappings,? was incorporated and updated for case folding and other new properties. The definition of titlecase uses word boundaries, and there is a clearer definition of string functions:
isUpper(), isLower(), isTitle(), isFold()
toUpper(), toLower(), toTitle(), toFold()
Unicode Standard Annexes

The following Unicode Standard Annex was added:
UAX #29: Text Boundaries
Now contains information on text boundary conditions formerly published in Chapter 5 of The Unicode Standard, Version 3.0.
Provides default definitions for grapheme cluster ('user character'), word, and sentence boundaries

The following Unicode Standard Annexes were updated:

UAX #9, The Bidirectional Algorithm
Now contains information on the bidirectional algorithm formerly published in Chapter 3 of The Unicode Standard, Version 3.0.
Canonically equivalence is now preserved (a data change, not algorithm change)
Shaping is done after reordering, but not across directional boundaries
There were clarifications of: ZWJ, ZWNJ, and intermediate level processing

UAX #14, Line Breaking Properties
Negative numbers and dates with hyphens will not break across lines
Word-Joiner will link any characters (except hard line breaks)
The behavior of soft hyphen is clarified (it marks an opportunity for breaking, not specific graphic appearance)
The rules for GL are relaxed: SP and ZW override GL
There are new property values: NL, WJ

UAX #15: Unicode Normalization Forms
There is a description of Stable Code Points, and the notation NFC(x) and isNFC(x)
Annex 12: Corrigenda was rewritten for clarity, and to describe the use of Normalization Corrections.
Annex 13: Canonical Equivalence was added
UAX #11: East Asian Width
Extended the range for the default property value to 30000?3FFFD.

The following Unicode Technical Report was upgraded in status to a Unicode Standard Annex:

UAX #24: Script Names
Added notes on the stability of Q names, the usage of Mn, Me characters, and scripts with regard to spoofing.
Added Braille.

The following Standard Annexes were superseded as a result of their incorporation into the text of this book:

UAX #13: Unicode Newline Guidelines
UAX #19: UTF-32
UAX #21: Case Mappings
UAX #27: Unicode 3.1
UAX #28: Unicode 3.2

Posted: Tue Aug 05, 2003 11:27 pm
by Faraz.Fazil
Please note that the above info has been taken from unicode.com

For more info please visit: www.unicode.com