:tbl_format_thb.txt === SECTION ONE === This file describes in detail the format of the file 'thb' (Table Html and Binding binary), and highlights the 'tht' information that also fed this file. 'tbb' is contained in 'thb' itself, starting at field 'Start tbb'. "FIELD:" indicates the field in C++ struct 'sUnCodeBin'. indicates an unsigned char value, equal to ASCII in that octet. indicates unsigned decimal 2 octet, big-endian (most significant bits in first octet) indicates unsigned decimal 4 octet, big-endian (most significant bits in first octet) === SECTION TWO === Offset|Octets|Description (dec) |(size)| ------+------+----------- 0 5 Letters "THB1A" (Major version 1, minor 'A') FIELD: verMajor 5 1 Null (reserved: 0x0) 6 1 Html bound, version char (usually '3'), '0' marks none. FIELD: verHtml 7 1 UniCode 'level': level '0' means 4 octet coding up to 0xFFFE. (UniCode 4.0 code itself can have 5 nibbles (5*4 bits)) level '1' designates upto 0x7FFFE. This field is merely indicative of the extension of the table itself FIELD: cUniLevel 8 1 UniCode description storage: usually 'c' Field known as DESCSTG. 'a' means no storage, 'b' 40-byte storage, 'c' 60-byte, 'd' 80-byte, 'e' 255-byte FIELD: descStg 9 2 ASCII indicator: whitespace and null (reserved: 0x20 0x00) 11 4 Year/month/day reference, e.g. 991231 means 2099, December 31 FIELD: revisionDate 15 1 Revision, 0 to 9 means draft FIELD: revisionRef 16 16 Revision reference string (the author) FIELD: revReferenceStr 32 1 Null (reserved: 0x00) 33 1 ASCII 'S' marks sequential order, fixed (0x53) 34 1 Start tbb, ASCII 'i' marks initial tabs, fixed (0x69) 35 1 No gaps indicator, ASCII '.' (dot), usually set (0x2E) 36 1 Null (reserved: 0x00) 37 1 Flat indicator. Semicolon (;) indicates flat table, fixed. Other different character is reserved for future use, e.g., allowing extended 'strNote' size (or no strNote at all), or no "String symbol name Locale" 60 octet field 38 2 Number of extended tables. Value: 2 table of symbols and compatibility notes, but no HTML-TRANS; 3 also HTML-TRANS; value 4 and above are for future use FIELD: cExtTbl 40 4 Number of symbols (usually 256) FIELD: n 44 TABLE OF SYMBOLS (section three) ... 4 Number of compatibility notes ... TABLE OF COMPATIBILITY NOTES (section four) ... 4 Number of HTML-TRANS FIELD: nHtmlTrans ... TABLE OF HTML-TRANS (section five) Notes: N1. Table of symbols always present. N2. Table of compatibility notes exist always, size can be zero (if all notes are empty). N3. Table of HTML-TRANS exist always, size can be zero. See section five. === SECTION THREE === TABLE OF SYMBOLS SYMBOLS are fixed width lines. The description below applies to DESCSTG='c', but any other kind of storage would apply. 0 4 UniCode code FIELD: uniCode 4 60 String symbol name This is the case for description storage 'c'. Note: 60th octet must be null (0x00) FIELD: *symbolName 64 2 Two letter category Note: see etc/unicode256.txt FIELD: twoLetterCat 66 1 ASCII '0', or 1-240. 'Canonical Combining Class Values', typically '0'. FIELD: statZero 67 3 Three letter style Note: see unicode_doc/PropertyValueAliases.txt Example: EN means European_Number FIELD: thrLetterStyle 70 1 Null, fixed (0x00) 'strCompat' field is specified in next table (COMPATIBILITY NOTES) 71 4 UniCode code for equivalent character always <0x7F. 0x00 denotates not applicable. Example: 00C0;LATIN CAPITAL LETTER A WITH GRAVE is equivalent to capital A (0041), At tbt this appears in the sixth column (first column is code) as "0041 0300" (in the underlying example). Any nnnn in a string like " nnnn" is not considered as an equivalent character. FIELD: ucCompat 75 10 String equivalent number (10 characters). Example: FRACTION ONE QUARTER has "1/4". Note: 10th character must be 0x00. FIELD: strNumber 85 1 Has open and close, either 'Y' or 'N' FIELD: cIsOpenClose 86 4 Extended Equivalent UniCode code for closing character. 0x00 means not applicable. This is an extension of UniCode, and usually calculated from a text table (unicode256opc.txt) FIELD: extEqClose 90 16 String note for symbol This represents the 12th column of tbt. Example: LATIN SMALL LETTER SHARP S has a note "German". Note: 16th character must be 0x00. FIELD: strNote 106 2 Pad nulls, fixed (reserved: 0x00) 108 4 UniCode code for uppercase letter FIELD: ucUpper 112 4 UniCode code for downcase letter FIELD: ucDown 116 4 Pad nulls, fixed (reserved: 0x00) 120 2 Pad nulls, fixed (reserved: 0x00) 122 4 Reserved: Extended code (reserved: 0xFFFF,FFFF) 126 60 String symbol name Locale This field holds the local description of symbol. FIELD: none 186 1 Pad null, fixed (reserved: 0x00) 187 1 Reference version-letter, usually 'D'. Letter D means symbol present on version 4 of UniCode. Future revisions of UniCode will have different letters, to be defined 188 2 Utilization mask, informative. Each bit of this field indicates: lsbit.0 isalpha in 8859-1 (Lu or Lt) lsbit.1 isdigit ('0' to '9') lsbit.2 isspace (space, form-feed, newline, carriage return, horiz./vert. tab) lsbit.3 isprint (isspace or printable, check APP-I) FIELD: extUtMask Total octets per symbol=190. === SECTION FOUR === TABLE OF COMPATIBILITY NOTES Each compatibility note (referred as strCompat) is a fixed size entry, that contains: 0 4 UniCode code of character 4 2 Applicability string: <> denotes tagging, e.g. for HTML 6 36 String with note itself Examples: MASCULINE ORDINAL INDICATOR => 006F FRACTION ONE HALF => 0031 2044 0032 FIELD:strCompat === SECTION FIVE === TABLE OF HTML-TRANS A detailed useful page can be found on http://www.uni-passau.de/~ramsch/iso8859-1.html Kevin J. Brewer (http://www.bbsinc.com/iso8859.html) also wrote a lot of accurate stuff. Briefly: a) Some characters should be translated to HTML as follow: quotation mark " --> " " --> " ampersand & --> & & --> & less-than sign < --> < < --> < greater-than sign > --> > > --> > b) Other characters above (inclusive) ASCII 160 also Two possibilities for HTML 3.0 representation: i) &#ASCII; where ASCII is the decimal ASCII code ii) &STR; where STR is a more or less compreensive sequence of characters, e.g. & The table HTML-TRANS contains characters (b) only, i.e. above (including) ASCII 160 (0xA0). Other characters should not be translated. Preferred translation is (ii). Note ASCII between 127 and 256 refer to ISO8859-1 (equal to UniCode codes), but not code page (cp)850. Example: In order to translate a text written in cp850 it is better to use first the convertion into UniCode. Note not all symbols in cp850 map into UniCode code<256. The generic behaviour on symbols that cannot be translated is present before the table itself, as follow: 0 4 Letters "HTML" (0x48,0x54,0x4D,0x4C) 4 4 Unmaped symbols behaviour: .I (0x2E,0x49) Ignore, do not dump I= (0x49,0x3D) Ignore, dump as is 'd (0x27,0x64) Dump decimal ASCII 'h (0x27,0x68) Dump hexa ASCII (d (0x28,0x64) Parenthesis dec.ASCII (h (0x28,0x68) Parenthesis hex.ASCII [d (0x5B,0x64) Brackets dec.ASCII [h (0x5B,0x68) Brackets hex.ASCII Plus indication of style: .. (0x2E,0x2E) No style I. Italic B. Bold IB Italic bold S. Strong u. Superscript s. Subscript Default mode is "[hs.". FIELD: sHtmlUnmap The following fields are fixed: 8 1 Semicolon field, fixed (0x3B) 9 1 Regular field, fixed (0x20) 10 1 Pad null, fixed (reserved: 0x00) 11 1 Start ASCII field, fixed (0xA0) FIELD: cHtmAsciiStt 12 4 Start of HTML-TRANS (0x25,0x25,0x25,0x7E) Then the table starts with n entries, specified as follow: 0 9 Translated HTML translation without & and ; All nulls (0x00) indicate no translation. Example: COPYRIGHT SIGN => © is stored as copy (0x63,0x6F,0x70,0x79,0x00...) Note: 9th symbol is not necessarily 0x00. FIELD: sTransHtm 9 1 Compatibility note FIELD: cCompNote Hence, each entry has exactly 10 octets. === SECTION SIX === 'tht' text files description: tht files contain the full HTML translation string (e.g. ©), a free string for describing, and an optional compatibility note. Only the first field (before the semicolon) and second field is parsed into thb files. At begining of this text file, commented lines with hash (#) will apear. Afterwards the following line is mandatory: FIRST ASCII: 0xA0 Afterwards no # is allowed. Empty lines are ignored. Compatibility note (cCompNote): `: (0x60) not used, reserved a: default d: Some distinct notations do not work e: Some browsers are not compatible.