diff options
Diffstat (limited to 'libjava/classpath/doc/unicode/ReadMe-2.1.1.txt')
-rw-r--r-- | libjava/classpath/doc/unicode/ReadMe-2.1.1.txt | 344 |
1 files changed, 344 insertions, 0 deletions
diff --git a/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt b/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt new file mode 100644 index 00000000000..506f155a762 --- /dev/null +++ b/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt @@ -0,0 +1,344 @@ +
+UNICODE 2.1 CHARACTER DATABASE
+
+Copyright (c) 1991-1998 Unicode, Inc.
+All Rights reserved.
+
+DISCLAIMER
+
+The Unicode Character Database "UNIDAT21.TXT" is provided as-is by
+Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any
+particular purpose. No warranties of any kind are expressed or implied. The
+recipient agrees to determine applicability of information provided. If this
+file has been purchased on magnetic or optical media from Unicode, Inc.,
+the sole remedy for any claim will be exchange of defective media within
+90 days of receipt.
+
+This disclaimer is applicable for all other data files accompanying the
+Unicode Character Database, some of which have been compiled by the
+Unicode Consortium, and some of which have been supplied by other vendors.
+
+LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA
+
+Recipient is granted the right to make copies in any form for internal
+distribution and to freely use the information supplied in the creation of
+products supporting the Unicode (TM) Standard. This file can be redistributed
+to third parties or other organizations (whether for profit or not) as long
+as this notice and the disclaimer notice are retained.
+
+EXPLANATORY INFORMATION
+
+The Unicode Character Database defines the default Unicode character
+properties, and internal mappings. Particular implementations may choose to
+override the properties and mappings that are not normative. If that is done,
+it is up to the implementer to establish a protocol to convey that
+information. For more information about character properties and mappings,
+see "The Unicode Standard, Worldwide Character Encoding, Version 2.0",
+published by Addison-Wesley. For information about other data files
+accompanying the Unicode Character Database, see the section of the
+Unicode Standard they were extracted from, or the explanatory readme
+files and/or header sections with those files.
+
+The Unicode Character Database has been updated to reflect Version 2.1
+of the Unicode Standard, with two additional characters added to those
+published in Version 2.0:
+
+ U+20AC EURO SIGN
+ U+FFFC OBJECT REPLACEMENT CHARACTER
+
+A number of corrections have also been made to case mappings or other
+errors in the database noted since the publication of Version 2.0. And
+a few normative bidirectional properties have been modified to reflect
+decisions of the Unicode Technical Committee.
+
+The Unicode Character Database is a plain ASCII text file consisting of lines
+containing fields terminated by semicolons. Each line represents the data for
+one encoded character in the Unicode Standard, Version 2.1. Every encoded
+character has a data entry, with the exception of certain special ranges, as
+detailed below.
+
+There are five special ranges of characters that are represented only by
+their start and end characters, since the properties in the file are uniform,
+except for code values (which are all sequential and assigned). The names of CJK
+ideograph characters and Hangul syllable characters are algorithmically
+derivable. (See the Unicode Standard for more information). Surrogate
+characters and private use characters have no names.
+
+The exact ranges represented by start and end characters are:
+
+ The CJK Ideographs Area (U+4E00 - U+9FFF)
+ The Hangul Syllables Area (U+AC00 - U+D7A3)
+ The Surrogates Area (U+D800 - U+DFFF)
+ The Private Use Area (U+E000 - U+F8FF)
+ CJK Compatibility Ideographs (U+F900 - U+FAFF)
+
+The following table describes the format and meaning of each field in a
+data entry in the Unicode Character Database. Fields which contain
+normative information are so indicated.
+
+Field Explanation
+----- -----------
+
+ 0 Code value in 4-digit hexadecimal format.
+ This field is normative.
+
+ 1 Unicode 2.1 Character Name. These names match exactly the
+ names published in Chapter 7 of the Unicode Standard, Version
+ 2.0, except for the two additional characters.
+ This field is normative.
+
+ 2 General Category. This is a useful breakdown into various "character
+ types" which can be used as a default categorization in implementations.
+ Some of the values are normative, and some are informative.
+ See below for a brief explanation.
+
+ 3 Canonical Combining Classes. The classes used for the
+ Canonical Ordering Algorithm in the Unicode Standard. These
+ classes are also printed in Chapter 4 of the Unicode Standard.
+ This field is normative. See below for a brief explanation.
+
+ 4 Bidirectional Category. See the list below for an explanation of the
+ abbreviations used in this field. These are the categories required
+ by the Bidirectional Behavior Algorithm in the Unicode Standard.
+ These categories are summarized in Chapter 4 of the Unicode Standard.
+ This field is normative.
+
+ 5 Character Decomposition. In the Unicode Standard, not all of
+ the decompositions are full decompositions. Recursive
+ application of look-up for decompositions will, in all cases, lead to
+ a maximal decomposition. The decompositions match exactly the
+ decompositions published with the character names in Chapter 7
+ of the Unicode Standard. This field is normative.
+
+ 6 Decimal digit value. This is a numeric field. If the character
+ has the decimal digit property, as specified in Chapter 4 of
+ the Unicode Standard, the value of that digit is represented
+ with an integer value in this field. This field is normative.
+
+ 7 Digit value. This is a numeric field. If the character represents a
+ digit, not necessarily a decimal digit, the value is here. This
+ covers digits which do not form decimal radix forms, such as the
+ compatibility superscript digits. This field is informative.
+
+ 8 Numeric value. This is a numeric field. If the character has the
+ numeric property, as specified in Chapter 4 of the Unicode
+ Standard, the value of that character is represented with an
+ integer or rational number in this field. This includes fractions as,
+ e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
+ Also included are numerical values for compatibility characters
+ such as circled numbers. This field is normative.
+
+ 9 If the characters has been identified as a "mirrored" character in
+ bidirectional text, this field has the value "Y"; otherwise "N".
+ The list of mirrored characters is also printed in Chapter 4 of
+ the Unicode Standard. This field is normative.
+
+ 10 Unicode 1.0 Name. This is the old name as published in Unicode 1.0.
+ This name is only provided when it is significantly different from
+ the Unicode 2.1 name for the character. This field is informative.
+
+ 11 10646 Comment field. This field is informative.
+
+ 12 Upper case equivalent mapping. If a character is part of an
+ alphabet with case distinctions, and has an upper case equivalent,
+ then the upper case equivalent is in this field. See the explanation
+ below on case distinctions. These mappings are always one-to-one,
+ not one-to-many or many-to-one. This field is informative.
+
+ 13 Lower case equivalent mapping. Similar to 12. This field is informative.
+
+ 14 Title case equivalent mapping. Similar to 12. This field is informative.
+
+GENERAL CATEGORY
+
+The values in this field are abbreviations for the following. Some of the
+values are normative, and some are informative. For more information, see
+the Unicode Standard. Note: the standard does not assign information to
+control characters (except for TAB in the Bidirectonal Algorithm).
+Implementations will generally also assign categories to certain control
+characters, notably CR and LF, according to platform conventions.
+
+
+Normative
+ Mn = Mark, Non-Spacing
+ Mc = Mark, Spacing Combining
+ Me = Mark, Enclosing
+
+ Nd = Number, Decimal Digit
+ Nl = Number, Letter
+ No = Number, Other
+
+ Zs = Separator, Space
+ Zl = Separator, Line
+ Zp = Separator, Paragraph
+
+ Cc = Other, Control
+ Cf = Other, Format
+ Cs = Other, Surrogate
+ Co = Other, Private Use
+ Cn = Other, Not Assigned
+
+Informative
+ Lu = Letter, Uppercase
+ Ll = Letter, Lowercase
+ Lt = Letter, Titlecase
+ Lm = Letter, Modifier
+ Lo = Letter, Other
+
+ Pc = Punctuation, Connector
+ Pd = Punctuation, Dash
+ Ps = Punctuation, Open
+ Pe = Punctuation, Close
+ Po = Punctuation, Other
+
+ Sm = Symbol, Math
+ Sc = Symbol, Currency
+ Sk = Symbol, Modifier
+ So = Symbol, Other
+
+BIDIRECTIONAL PROPERTIES
+
+Please refer to the Unicode Standard for an explanation of the algorithm for
+Bidirectional Behavior and an explanation of the sigificance of these categories.
+These values are normative.
+
+Strong types:
+ L Left-Right; Most alphabetic, syllabic, and logographic
+ characters (e.g., CJK ideographs)
+ R Right-Left; Arabic, Hebrew, and
+ punctuation specific to those scripts
+Weak types:
+ EN European Number
+ ES European Number Separator
+ ET European Number Terminator
+ AN Arabic Number
+ CS Common Number Separator
+
+Separators:
+ B Block Separator
+ S Segment Separator
+
+Neutrals:
+ WS Whitespace
+ ON Other Neutrals ; All other characters: punctuation, symbols
+
+CHARACTER DECOMPOSITION TAGS
+
+The decomposition is a normative property of a character. The tags supplied
+with certain decompositions generally indicate formatting information.
+Where no such tag is given, the decomposition is designated as canonical.
+Conversely, the presence of a formatting tag also indicates
+that the decomposition is a compatibility decomposition and not a canonical
+decomposition. In the absence of other formatting information in a
+compatibility decomposition, the tag <compat> is used to distinguish it from
+canonical decompositions.
+
+In some instances a canonical decomposition or a compatibility decomposition
+may consist of a single character. For a canonical decomposition, this
+indicates that the character is a canonical equivalent of another single
+character. For a compatibility decomposition, this indicates that the
+character is a compatibility equivalent of another single character.
+
+The compatibility formatting tags used are:
+
+ <font> A font variant (e.g. a blackletter form).
+ <noBreak> A no-break version of a space or hyphen.
+ <initial> An initial presentation form (Arabic).
+ <medial> A medial presentation form (Arabic).
+ <final> A final presentation form (Arabic).
+ <isolated> An isolated presentation form (Arabic).
+ <circle> An encircled form.
+ <super> A superscript form.
+ <sub> A subscript form.
+ <vertical> A vertical layout presentation form.
+ <wide> A wide (or zenkaku) compatibility character.
+ <narrow> A narrow (or hankaku) compatibility character.
+ <small> A small variant form (CNS compatibility).
+ <square> A CJK squared font variant.
+ <fraction> A vulgar fraction form.
+ <compat> Otherwise unspecified compatibility character.
+
+CANONICAL COMBINING CLASSES
+
+ 0: Spacing, enclosing, reordrant, and surrounding
+ 1: Overlays and interior
+ 6: Tibetan subjoined Letters
+ 7: Nuktas
+ 8: Hiragana/Katakana voiced marks
+ 9: Viramas
+ 10: Start of fixed position classes
+199: End of fixed position classes
+200: Below left attached
+202: Below attached
+204: Below right attached
+208: Left attached (reordrant around single base character)
+210: Right attached
+212: Above left attached
+214: Above attached
+216: Above right attached
+218: Below left
+220: Below
+222: Below right
+224: Left (reordrant around single base character)
+226: Right
+228: Above left
+230: Above
+232: Above right
+234: Double above
+
+Note: some of the combining classes in this list do not currently have
+members but are specified here for completeness.
+
+CASE MAPPINGS
+
+In addition to uppercase and lowercase, because of the inclusion of certain
+composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER
+DZ", there is a third case, called titlecase, which is used where the first
+character of a word is to be capitalized (e.g. UPPERCASE, Titlecase,
+lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D
+WITH SMALL LETTER Z".
+
+The uppercase, titlecase and lowercase fields are only included for characters
+that have a single corresponding character of that type. Composite characters
+(such as "339D;SQUARE CM") that do not have a single corresponding character
+of that type can be cased by decomposition.
+
+The case mapping is an informative, default mapping. Certain languages, such
+as Turkish, German, French, or Greek may have small deviations from the
+default mappings listed in the Unicode Character Database.
+
+MODIFICATION HISTORY
+
+Modifications made in updating the Unicode Character Database for
+the Unicode Standard, Version 2.1 (from Version 2.0) are:
+* Added two characters (U+20AC and U+FFFC).
+* Amended bidi properties for U+0026, U+002E, U+0040, U+2007.
+* Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275,
+ U+03C2, U+1E9B.
+* Changed combining order class for U+0F71.
+* Corrected canonical decompositions for U+0F73, U+1FBE.
+* Changed decomposition for U+FB1F from compatibility to canonical.
+* Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB.
+* Corrected compatibility decompositions for U+2469, U+246A, U+3358.
+
+
+Some of the modifications made in updating the Unicode Character Database
+for the Unicode Standard, Version 2.0 are:
+* Fixed decompositions with TONOS to use correct NSM: 030D.
+* Removed old Hangul Syllables; mapping to new characters are
+ in a separate table.
+* Marked compability decompositions with additional tags.
+* Changed old tag names for clarity.
+* Revision of decompositions to use first-level decomposition, instead
+ of maximal decomposition.
+* Correction of all known errors in decompositions from earlier versions.
+* Added control code names (as old Unicode names).
+* Added Hangul Jamo decompositions.
+* Added Number category to match properties list in book.
+* Fixed categories of Koranic Arabic marks.
+* Fixed categories of precomposed characters to match decomposition where possible.
+* Added Hebrew cantillation marks and the Tibetan script.
+* Added place holders for ranges such as CJK Ideographic Area and the
+ Private Use Area.
+* Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
+ database.
|