New Transliteration Test Files

The Test_*.html files show the transliteration of characters for given languages. The sample for each language consists of "What Is Unicode" in Thai, followed by other available text. The text is broken apart into sentences for ease of viewing (note: we know of some problems with the sentence rules for Japanese and Chinese). The left column is the original, and the right is the romanization. The program also converts back to the original script. If there is a discrepancy between the source and the reverse transformation, that is indicated by making the background red from that point on.

Note: If you have some more text that you would like added to the sample, just let me know. I am particularly interested in name lists, since they are the typical source.

Standards

The goal is to follow a given standard, such as ISO* or UNGEGN wherever possible. We also need to round-trip, so in some cases, that means adding some additional accent marks to disambiguate characters. And often the source standards are missing some characters, such as characters with combining Hamzas in Arabic. Remember that the goal for these is transliteration (unambiguously representing all the letters in the original), not transcription (representing the best pronunciation).

Notes

  1. For readability, the files have a few other things besides just the transliteration:
  2. I don't think that ISO 11940 is a particularly good way to romanize, but it is at least complete and a standard. So what I am interested in just for now is whether the samples in the file follow it (with the above exceptions).
  3. Some of the files also have a set of characters at the end, one character per row, with a following row listing the hex and name.
  4. The source rules for all of these is in the following URL. So if you want to know the details of how the characters are handled, that is the place to look.