dictionary-format-v6.txt

   1 This is a quick write-up of the old dictionary file format, v6.
   2 v6 is troublesome as it relies on Java serialization and thus
   3 there will be references to Java types.
   4 This hasn't been checked for correctness and likely has some bugs.
   5 Also, I really should have used some standard format for writing this...
   6
   7 ===========================================
   8
   9 Some basic types:
  10
  11 [Short]
  12   2 bytes: big-endian, signed value (note: negative values generally not used here)
  13
  14 [Int]
  15   4 bytes: big-endian, signed value (note: negative values generally not used here)
  16
  17 [Long]
  18   8 bytes: big-endian, signed value (note: negative values generally not used here)
  19
  20
  21 [String]
  22   [Short]: string length
  23   n bytes: string, modified UTF-8, n is value from previous element
  24            note: no zero termination
  25
  26 ======================================================
  27
  28 [Dictionary]
  29
  30 [Int]: version, fixed value 6
  31 [Long]: file creation time (in milliseconds since Jan. 1st 1970)
  32 [String]: dictionary information (human-readable)
  33
  34 list_of([source])
  35 list_of([pair_entry])
  36 list_of([text_entry])
  37 list_of([html_entry]) (since v5)
  38 list_of([index])
  39
  40 [String]: string "END OF DICTIONARY" (length value 17)
  41
  42 ===========================
  43
  44 All list_of entries describe a list of elements.
  45 These elements can have variable size, thus an index (table-of-contents, TOC)
  46 is needed.
  47 To reduce the cost of this table and enable more efficient compression,
  48 multiple entries can be stored in a block that gets one single index entry.
  49 I.e. it is only possible to do random-access to the start of a block,
  50 seeking to elements further inside the block must be done via reading.
  51 Caching should be used to reduce the performance impact of this (so
  52 that when entries 5, 4, 3 etc. of a block are read sequentially,
  53 parsing and decompression is done only once).
  54
  55 These lists have the following base format:
  56
  57 [Int]: number of entries in the list (must be >= 0) (<size>)
  58
  59 <toc size>=<size>*8 + 8 bytes:
  60   table-of-contents.
  61   [Long] offset value for each block of entries.
  62   Followed by a final [Long] offset value to the end of the list data (<end offset>).
  63   Each offset is an absolute file position.
  64
  65 <end offset>-<toc size>-<start of toc> bytes:
  66   entry data
  67
  68 ==========================================================
  69
  70 [source]
  71
  72 [String]: name of source, e.g. "enwiktionary"
  73 [Int]: number of entries from that source (since v3) (I kind of wouldn't rely on that one
  74 being useful/correct...)
  75
  76 ========================================================
  77
  78 [pair entry]
  79
  80 [Short]: source index (see list_of([source])) (since v1)
  81 [Int]: number of pairs in this entry (<num_pairs>)
  82 <num_pairs> times:
  83   [String]: in first language
  84   [String]: in second language (possibly empty)
  85
  86 =================================================
  87
  88 [text_entry]
  89
  90 [Short]: source index (see list_of([source])) (since v1)
  91 [String]: text
  92
  93 ===========================================
  94
  95 [html_entry]
  96
  97 [Short]: source index (see list_of([source])) (since v1)
  98 [String]: title for HTML entry
  99 [Int]: length of decompressed data in bytes (<declen>)
 100 [Int]: length of compressed data in bytes (<len>)
 101 <len> bytes: HTML page data, UTF-8 encoded, gzip compressed
 102
 103 =====================================
 104
 105 [index]
 106
 107 Note: this structure is used for binary search.
 108 It is thus critical that all entries are correctly
 109 sorted.
 110 The sorting is according to libicu, however as Java
 111 and Android versions do not match special hacks
 112 have been added, like ignoring "-" for the comparison
 113 (unless that makes them equal, then they are
 114 compared including the dash).
 115
 116 [String]: index short name
 117 [String]: index long name
 118 [String]: language ISO code (sort order depends on this)
 119 [String]: ICU normalizer rules to apply for sorting/searching
 120 1 byte: swap pair entries (if != 0, this index is for the second language entries in [pair_entry])
 121 [Int]: number of main tokens (?) (since v2)
 122 list_of([index_entry])
 123 [Int]: size of stop list set following (since v4)
 124 Set<String> stop list words (since v4)
 125 uniform_list_of([row])
 126
 127
 128 with uniform_list_of:
 129 [Int]: number of entries in list <num_entries>
 130 [Int]: size of entry <entry_size>
 131 <num_entries>*<entry_size> bytes: data
 132
 133
 134 ================================================
 135
 136 [index_entry]
 137
 138 [String]: token
 139 [Int]: start index into uniform_list_of([row])
 140 [Int]: number of rows covered
 141 1 byte: <has_normalized>
 142 if <has_normalized> != 0:
 143   [String]: normalized token
 144 list_of([Int]) list of indices into list_of(html_entry) (since v6)
 145
 146 =======================================
 147
 148 [row]
 149
 150 1 byte: <type>
 151 [Int]: index
 152
 153 <type> means:
 154 1: index into list_of([pair_entry])
 155 2: index into list_of([index_entry]) (mark as "main word header" entry)
 156 3: index into list_of([text_entry])
 157 4: index into list_of([index_entry]) (mark as "extra info/translation" entry)
 158 5: index into list_of([html_entry])
 159
 160 =======================================
 161
 162 Set<String>
 163
 164 Java serialization of java.util.HashSet.
 165 First part consists always the same 40 bytes:
 166     0xac, 0xed, // magic
 167     0x00, 0x05, // version
 168     0x73, // object
 169     0x72, // class
 170     // Java String "java.util.HashSet"
 171     0x00, 0x11, 0x6a, 0x61, 0x76, 0x61, 0x2e, 0x75, 0x74, 0x69,
 172     0x6c, 0x2e, 0x48, 0x61, 0x73, 0x68, 0x53, 0x65, 0x74,
 173     // serialization ID
 174     0xba, 0x44, 0x85, 0x95, 0x96, 0xb8, 0xb7, 0x34,
 175     0x03, // flags: serialized, custom serialization function
 176     0x00, 0x00, // fields count
 177     0x78, // blockdata end
 178     0x70, // null (superclass)
 179     0x77, 0x0c // blockdata short, 0xc bytes
 180
 181 [Int]: capacity. Not used for anything, but set to >= <num_entries>
 182 [Float]: capacity factor. May affect performance of old QuickDic versions, set to 0.75f
 183 [Int]: <num_entries>
 184 <num_entries> times:
 185     1 byte 0x74: String type
 186     [String]: stop word
 187 1 byte 0x78: blockdata end
 188
 189 Note: Some even older dictionaries wrote out a LinkedHashSet instead of a
 190 HashSet.
 191 That adds the following bytes describing LinkedHashSet before the 0x72 above:
 192     0x72, // class
 193     // Java String "java.util.LinkedHashSet"
 194     0x00, 0x17, 0x6a, 0x61, 0x76, 0x61, 0x2e, 0x75, 0x74, 0x69,
 195     0x6c, 0x2e, 0x4c, 0x69, 0x6e, 0x6b, 0x65, 0x64, 0x48, 0x61,
 196     0x73, 0x68, 0x53, 0x65, 0x74,
 197     // serialization ID
 198     0xd8, 0x6c, 0xd7, 0x5a, 0x95, 0xdd, 0x2a, 0x1e,
 199     0x02, // flags
 200     0x00, 0x00, // fields count
 201     0x78 // blockdata end