dictionary-format.txt

   1 This is a quick write-up of the dictionary file format, v7.
   2 v6 is troublesome as it relies on Java serialization and thus
   3 I won't even attempt to document it.
   4 This is hasn't been checked for correctness and likely has some bugs.
   5 Also, I really should have used some standard format for writing this...
   6
   7 ===========================================
   8
   9 Some basic types:
  10
  11 [String]
  12 [Short]: string length
  13 n bytes: string, modified UTF-8, n is value from previous element
  14          note: no zero termination
  15
  16 [Short]
  17 2 bytes: big-endian, signed value (note: negative values generally not used here)
  18
  19 [Int]
  20 4 bytes: big-endian, signed value (note: negative values generally not used here)
  21
  22 [Long]
  23 8 bytes: big-endian, signed value (note: negative values generally not used here)
  24
  25 ======================================================
  26
  27 [Dictionary]
  28
  29 [Int]: version, fixed value 7
  30 [Long]: file creation time (in milliseconds since Jan. 1st 1970)
  31 [String]: dictionary information (human-readable)
  32
  33 list_of([source])
  34 list_of([pair_entry], block size 64, compressed)
  35 list_of([text_entry])
  36 list_of([html_entry], block size 64, compressed)
  37 list_of([html_entry_data], block size 128, compressed)
  38 list_of([index])
  39
  40 [String]: string "END OF DICTIONARY" (length value 17)
  41
  42
  43 Note: block size and compression flags is only what
  44 the dictionary generation currently uses and seems
  45 a good compromise between performance and size for
  46 wiktionary content. It is not a requirement
  47 of the file format.
  48
  49 ==============================
  50
  51 [varInt]
  52
  53 Much of the data inside the dictionary uses a varInt format
  54 (variable length integer), always stored in big-endian,
  55 which can take on of these forms:
  56 1 byte: value in 0x0 - 0x7F range.
  57 2 bytes: value in 0x80 - 0x3FFF range, offset by 0x8000
  58 3 bytes: value in 0x4000 - 0x1FFFFF range, offset by 0xC00000
  59 4 bytes: value in 0x200000 - 0xFFFFFFF range, offset by 0xE0000000
  60 5 bytes: all other values, including negative, written as is with the value
  61 0xF0 prepended in the first byte.
  62
  63 For decoding, the number of leading 1s in the first byte is the overall
  64 length - 1.
  65 Note that this scheme would allow storing an even larger range values
  66 in the 5-byte variant and can be extended to arbitrary length, however
  67 that is not currently implemented.
  68
  69 ===========================
  70
  71 All list_of entries describe a list of elements.
  72 These elements can have variable size, thus an index (table-of-contents, TOC)
  73 is needed.
  74 To reduce the cost of this table and enable more efficient compression,
  75 multiple entries can be stored in a block that gets one single index entry.
  76 I.e. it is only possible to do random-access to the start of a block,
  77 seeking to elements further inside the block must be done via reading.
  78 Caching should be used to reduce the impact of this.
  79
  80 These lists have the following base format:
  81
  82 [varInt]: number of entries in the list (must be >= 1) (<size>)
  83 [varInt]: compression block size (in entries) (must be >= 1) (<blockSize>)
  84 [varInt]: flags. Currently only bit 0 used, indicating compression is used
  85
  86 <toc size>=<size>/<blockSize>*4 + 4 bytes:
  87 (note division with rounding up if not divisible)
  88 table-of-contents. [Int] offset value for each block of entries.
  89 Followed by a final [Int] offset value to the end of the list data (<end offset>).
  90 Each offset is relative to the start of this block.
  91 Note that currently for simplicity Java int type is used
  92 to process these values, even though negative values make no sense.
  93 This limits the maximum amount of data to around 2GB.
  94
  95 <end offset>-<toc size> bytes: data
  96
  97 If compression is enabled, the data for each block is
  98 deflate compressed.
  99 Note that this is really raw deflate, not e.g. gzip,
 100 so neither header nor footer.
 101 There is also no decompressed size, so to decompress
 102 the start of the next block must be used to find the end
 103 of the compressed data and signal the end of input to the
 104 inflater, and whatever data the inflater produces is the
 105 decompressed input (note: this might be an implementation
 106 detail, you might need the decompressed size to decompress
 107 and parse the contained data, but it prevents things like
 108 the inflater overreading and producing spurious errors).
 109
 110 ==========================================================
 111
 112 [source]
 113
 114 [String]: name of source, e.g. "enwiktionary"
 115 [Int]: number of entries from that source (I kind of wouldn't rely on that one
 116 being useful/correct...)
 117
 118 ========================================================
 119
 120 [pair entry]
 121
 122 [varInt]: source index (see list_of([source]))
 123 [varInt]: number of pairs in this entry (<num_pairs>)
 124 <num_pairs> times:
 125   [String]: in first language
 126   [String]: in second language (possibly empty)
 127
 128 =================================================
 129
 130 [text_entry]
 131
 132 [varInt]: source index (see list_of([source]))
 133 [String]: text
 134
 135 ===========================================
 136
 137 [html_entry]
 138
 139 [varInt]: source index (see list_of([source]))
 140 [String]: title for HTML entry
 141
 142 ====================================
 143
 144 [html_entry_data]
 145
 146 [varInt]: length of data in bytes (<len>)
 147 <len> bytes: HTML page data, UTF-8 encoded
 148
 149 =====================================
 150
 151 [index]
 152
 153 Note: this structure is used for binary search.
 154 It is thus critical that all entries are correctly
 155 sorted.
 156 The sorting is according to libicu, however as Java
 157 and Android versions do not match special hacks
 158 have been added, like ignoring "-" for the comparison
 159 (unless that makes them equal, then they are
 160 compared including the dash).
 161
 162 [String]: index short name
 163 [String]: index long name
 164 [String]: language ISO code (sort order depends on this)
 165 [String]: ICU normalizer rules to apply for sorting/searching
 166 1 byte: swap pair entries (if != 0, this index is for the second language entries in [pair_entry])
 167 [Int]: number of main tokens (?)
 168 list_of([index_entry], block size 32, compressed)
 169 [varInt]: stop list size <stoplist_size>
 170 <stoplist_size> times:
 171     [String]: stop list words
 172 uniform_list_of([row])
 173
 174
 175 with uniform_list_of:
 176 [Int]: number of entries in list <num_entries>
 177 [Int]: size of entry <entry_size>
 178 <num_entries>*<entry_size> bytes: data
 179
 180
 181 ================================================
 182
 183 [index_entry]
 184
 185 [String]: token
 186 [varInt]: start index into uniform_list_of([row])
 187 [varInt]: number of rows covered
 188 1 byte: <has_normalized>
 189 if <has_normalized> != 0:
 190   [String]: normalized token
 191 [varInt]: <num_html_entries>
 192 <num_html_entries> times:
 193   [varInt]: index into list_of([html_entries])/list_of([html_entries_data])
 194
 195 =======================================
 196
 197 [row]
 198
 199 1 byte: highest 3 bits <type>, lowest 5 bits additional high bits for next value (bits 16 - 20)
 200 [Short]: lowest 16 bit of index
 201
 202 <type> means:
 203 1: index into list_of([pair_entry])
 204 2: index into list_of([index_entry]) (mark as "main word header" entry)
 205 3: index into list_of([text_entry])
 206 4: index into list_of([index_entry]) (mark as "extra info/translation" entry)
 207 5: index into list_of([html_entry])/list_of([html_entry_data])