1 files changed, 344 insertions, 0 deletions
diff --git a/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt b/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt
new file mode 100644
index 00000000000..506f155a762
--- /dev/null
+++ b/libjava/classpath/doc/unicode/ReadMe-2.1.1.txt
@@ -0,0 +1,344 @@
+
+UNICODE 2.1 CHARACTER DATABASE
+
+Copyright (c) 1991-1998 Unicode, Inc.
+All Rights reserved.
+
+DISCLAIMER
+
+The Unicode Character Database "UNIDAT21.TXT" is provided as-is by
+Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any
+particular purpose. No warranties of any kind are expressed or implied. The
+recipient agrees to determine applicability of information provided. If this
+file has been purchased on magnetic or optical media from Unicode, Inc.,
+the sole remedy for any claim will be exchange of defective media within
+90 days of receipt.
+
+This disclaimer is applicable for all other data files accompanying the
+Unicode Character Database, some of which have been compiled by the
+Unicode Consortium, and some of which have been supplied by other vendors.
+
+LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA
+
+Recipient is granted the right to make copies in any form for internal
+distribution and to freely use the information supplied in the creation of
+products supporting the Unicode (TM) Standard. This file can be redistributed
+to third parties or other organizations (whether for profit or not) as long
+as this notice and the disclaimer notice are retained.
+
+EXPLANATORY INFORMATION
+
+The Unicode Character Database defines the default Unicode character
+properties, and internal mappings. Particular implementations may choose to
+override the properties and mappings that are not normative. If that is done,
+it is up to the implementer to establish a protocol to convey that
+information. For more information about character properties and mappings,
+see "The Unicode Standard, Worldwide Character Encoding, Version 2.0",
+published by Addison-Wesley. For information about other data files
+accompanying the Unicode Character Database, see the section of the
+Unicode Standard they were extracted from, or the explanatory readme
+files and/or header sections with those files.
+
+The Unicode Character Database has been updated to reflect Version 2.1
+of the Unicode Standard, with two additional characters added to those
+published in Version 2.0:
+
+   U+20AC EURO SIGN
+   U+FFFC OBJECT REPLACEMENT CHARACTER
+
+A number of corrections have also been made to case mappings or other
+errors in the database noted since the publication of Version 2.0. And
+a few normative bidirectional properties have been modified to reflect
+decisions of the Unicode Technical Committee.
+
+The Unicode Character Database is a plain ASCII text file consisting of lines
+containing fields terminated by semicolons. Each line represents the data for
+one encoded character in the Unicode Standard, Version 2.1. Every encoded
+character has a data entry, with the exception of certain special ranges, as
+detailed below.
+
+There are five special ranges of characters that are represented only by
+their start and end characters, since the properties in the file are uniform,
+except for code values (which are all sequential and assigned). The names of CJK
+ideograph characters and Hangul syllable characters are algorithmically
+derivable. (See the Unicode Standard for more information). Surrogate
+characters and private use characters have no names.
+
+The exact ranges represented by start and end characters are:
+
+   The CJK Ideographs Area (U+4E00 - U+9FFF)
+   The Hangul Syllables Area (U+AC00 - U+D7A3)
+   The Surrogates Area (U+D800 - U+DFFF)
+   The Private Use Area (U+E000 - U+F8FF)
+   CJK Compatibility Ideographs (U+F900 - U+FAFF)
+
+The following table describes the format and meaning of each field in a
+data entry in the Unicode Character Database. Fields which contain
+normative information are so indicated.
+
+Field	Explanation
+-----	-----------
+
+  0	Code value in 4-digit hexadecimal format.
+  	This field is normative.
+
+  1	Unicode 2.1 Character Name. These names match exactly the
+	names published in Chapter 7 of the Unicode Standard, Version
+	2.0, except for the two additional characters.
+  	This field is normative.
+
+  2	General Category. This is a useful breakdown into various "character
+	types" which can be used as a default categorization in implementations.
+ 	Some of the values are normative, and some are informative.
+ 	See below for a brief explanation.
+
+  3	Canonical Combining Classes. The classes used for the
+	Canonical Ordering Algorithm in the Unicode Standard. These
+	classes are also printed in Chapter 4 of the Unicode Standard.
+        This field is normative. See below for a brief explanation.
+
+  4	Bidirectional Category. See the list below for an explanation of the
+	abbreviations used in this field. These are the categories required
+	by the Bidirectional Behavior Algorithm in the Unicode Standard.
+	These categories are summarized in Chapter 4 of the Unicode Standard.
+	This field is normative.
+
+  5	Character Decomposition. In the Unicode Standard, not all of
+	the decompositions are full decompositions. Recursive
+	application of look-up for decompositions will, in all cases, lead to
+	a maximal decomposition. The decompositions match exactly the
+	decompositions published with the character names in Chapter 7
+	of the Unicode Standard. This field is normative.
+
+  6	Decimal digit value. This is a numeric field. If the character
+	has the decimal digit property, as specified in Chapter 4 of
+	the Unicode Standard, the value of that digit is represented
+	with an integer value in this field. This field is normative.
+
+  7	Digit value. This is a numeric field. If the character represents a
+	digit, not necessarily a decimal digit, the value is here. This
+	covers digits which do not form decimal radix forms, such as the
+	compatibility superscript digits. This field is informative.
+
+  8	Numeric value. This is a numeric field. If the character has the
+	numeric property, as specified in Chapter 4 of the Unicode
+	Standard, the value of that character is represented with an
+	integer or rational number in this field. This includes fractions as,
+	e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
+	Also included are numerical values for compatibility characters
+	such as circled numbers. This field is normative.
+
+  9	If the characters has been identified as a "mirrored" character in
+        bidirectional text, this field has the value "Y"; otherwise "N".
+	The list of mirrored characters is also printed in Chapter 4 of
+	the Unicode Standard. This field is normative.
+
+ 10	Unicode 1.0 Name. This is the old name as published in Unicode 1.0.
+	This name is only provided when it is significantly different from
+	the Unicode 2.1 name for the character. This field is informative.
+
+ 11	10646 Comment field. This field is informative.
+
+ 12	Upper case equivalent mapping. If a character is part of an
+	alphabet with case distinctions, and has an upper case equivalent,
+	then the upper case equivalent is in this field. See the explanation
+	below on case distinctions. These mappings are always one-to-one,
+	not one-to-many or many-to-one. This field is informative.
+
+ 13	Lower case equivalent mapping. Similar to 12. This field is informative.
+
+ 14	Title case equivalent mapping. Similar to 12. This field is informative.
+
+GENERAL CATEGORY
+
+The values in this field are abbreviations for the following. Some of the
+values are normative, and some are informative. For more information, see
+the Unicode Standard. Note: the standard does not assign information to
+control characters (except for TAB in the Bidirectonal Algorithm).
+Implementations will generally also assign categories to certain control
+characters, notably CR and LF, according to platform conventions.
+
+
+Normative
+    Mn = Mark, Non-Spacing
+    Mc = Mark, Spacing Combining
+    Me = Mark, Enclosing
+
+    Nd = Number, Decimal Digit
+    Nl = Number, Letter
+    No = Number, Other
+
+    Zs = Separator, Space
+    Zl = Separator, Line
+    Zp = Separator, Paragraph
+
+    Cc = Other, Control
+    Cf = Other, Format
+    Cs = Other, Surrogate
+    Co = Other, Private Use
+    Cn = Other, Not Assigned
+
+Informative
+    Lu = Letter, Uppercase
+    Ll = Letter, Lowercase
+    Lt = Letter, Titlecase
+    Lm = Letter, Modifier
+    Lo = Letter, Other
+
+    Pc = Punctuation, Connector
+    Pd = Punctuation, Dash
+    Ps = Punctuation, Open
+    Pe = Punctuation, Close
+    Po = Punctuation, Other
+
+    Sm = Symbol, Math
+    Sc = Symbol, Currency
+    Sk = Symbol, Modifier
+    So = Symbol, Other
+
+BIDIRECTIONAL PROPERTIES
+
+Please refer to the Unicode Standard for an explanation of the algorithm for
+Bidirectional Behavior and an explanation of the sigificance of these categories.
+These values are normative.
+
+Strong types:
+	L	Left-Right; Most alphabetic, syllabic, and logographic
+			characters (e.g., CJK ideographs)
+	R	Right-Left; Arabic, Hebrew, and
+			punctuation specific to those scripts
+Weak types:
+	EN	European Number
+	ES	European Number Separator
+	ET	European Number Terminator
+	AN	Arabic Number
+	CS	Common Number Separator
+
+Separators:
+	B	Block Separator
+	S	Segment Separator
+
+Neutrals:
+	WS	Whitespace
+	ON	Other Neutrals ; All other characters: punctuation, symbols
+
+CHARACTER DECOMPOSITION TAGS
+
+The decomposition is a normative property of a character. The tags supplied
+with certain decompositions generally indicate formatting information.
+Where no such tag is given, the decomposition is designated as canonical.
+Conversely, the presence of a formatting tag also indicates
+that the decomposition is a compatibility decomposition and not a canonical
+decomposition. In the absence of other formatting information in a
+compatibility decomposition, the tag <compat> is used to distinguish it from
+canonical decompositions.
+
+In some instances a canonical decomposition or a compatibility decomposition
+may consist of a single character. For a canonical decomposition, this
+indicates that the character is a canonical equivalent of another single
+character. For a compatibility decomposition, this indicates that the
+character is a compatibility equivalent of another single character.
+
+The compatibility formatting tags used are:
+
+	<font>		A font variant (e.g. a blackletter form).
+	<noBreak>	A no-break version of a space or hyphen.
+	<initial>	An initial presentation form (Arabic).
+	<medial>	A medial presentation form (Arabic).
+	<final>		A final presentation form (Arabic).
+	<isolated>	An isolated presentation form (Arabic).
+	<circle>	An encircled form.
+	<super>		A superscript form.
+	<sub>		A subscript form.
+	<vertical>	A vertical layout presentation form.
+	<wide>		A wide (or zenkaku) compatibility character.
+	<narrow>	A narrow (or hankaku) compatibility character.
+	<small>		A small variant form (CNS compatibility).
+	<square>	A CJK squared font variant.
+	<fraction>	A vulgar fraction form.
+	<compat>	Otherwise unspecified compatibility character.
+
+CANONICAL COMBINING CLASSES
+
+  0: Spacing, enclosing, reordrant, and surrounding
+  1: Overlays and interior
+  6: Tibetan subjoined Letters
+  7: Nuktas
+  8: Hiragana/Katakana voiced marks
+  9: Viramas
+ 10: Start of fixed position classes
+199: End of fixed position classes
+200: Below left attached
+202: Below attached
+204: Below right attached
+208: Left attached (reordrant around single base character)
+210: Right attached
+212: Above left attached
+214: Above attached
+216: Above right attached
+218: Below left
+220: Below
+222: Below right
+224: Left (reordrant around single base character)
+226: Right
+228: Above left
+230: Above
+232: Above right
+234: Double above
+
+Note: some of the combining classes in this list do not currently have
+members but are specified here for completeness.
+
+CASE MAPPINGS
+
+In addition to uppercase and lowercase, because of the inclusion of certain
+composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER
+DZ", there is a third case, called titlecase, which is used where the first
+character of a word is to be capitalized (e.g. UPPERCASE, Titlecase,
+lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D
+WITH SMALL LETTER Z".
+
+The uppercase, titlecase and lowercase fields are only included for characters
+that have a single corresponding character of that type. Composite characters
+(such as "339D;SQUARE CM") that do not have a single corresponding character
+of that type can be cased by decomposition.
+
+The case mapping is an informative, default mapping. Certain languages, such
+as Turkish, German, French, or Greek may have small deviations from the
+default mappings listed in the Unicode Character Database.
+
+MODIFICATION HISTORY
+
+Modifications made in updating the Unicode Character Database for
+the Unicode Standard, Version 2.1 (from Version 2.0) are:
+* Added two characters (U+20AC and U+FFFC).
+* Amended bidi properties for U+0026, U+002E, U+0040, U+2007.
+* Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275,
+	U+03C2, U+1E9B.
+* Changed combining order class for U+0F71.
+* Corrected canonical decompositions for U+0F73, U+1FBE.
+* Changed decomposition for U+FB1F from compatibility to canonical.
+* Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB.
+* Corrected compatibility decompositions for U+2469, U+246A, U+3358.
+
+
+Some of the modifications made in updating the Unicode Character Database
+for the Unicode Standard, Version 2.0 are:
+* Fixed decompositions with TONOS to use correct NSM: 030D.
+* Removed old Hangul Syllables; mapping to new characters are
+	in a separate table.
+* Marked compability decompositions with additional tags.
+* Changed old tag names for clarity.
+* Revision of decompositions to use first-level decomposition, instead
+	of maximal decomposition.
+* Correction of all known errors in decompositions from earlier versions.
+* Added control code names (as old Unicode names).
+* Added Hangul Jamo decompositions.
+* Added Number category to match properties list in book.
+* Fixed categories of Koranic Arabic marks.
+* Fixed categories of precomposed characters to match decomposition where possible.
+* Added Hebrew cantillation marks and the Tibetan script.
+* Added place holders for ranges such as CJK Ideographic Area and the
+	Private Use Area.
+* Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in the
+	database.