Log in

No account? Create an account
18 January 2006 @ 06:48 pm
Random idea time  
While typing my life away at work, I've been thinking of the balanced-ternary computer. As is usual for me, I've been putting the cart before the horse: in this case, I've been thinking up a balanced-ternary character code.

First off, some terminology. A ternary digit (a unit with value -1, 0, or 1), the smallest unit of information in the computer, is called a trit (by analogy with bit, a portmanteau of binary digit). The chunks of data that the processor generally deals with are 9 trits in size, and because they are analogous to binary bytes are called trytes. A sub-tryte unit of data consisting of 3 trits, analogous to a binary nibble, is, of course, a tribble. ;)

My character code is one tryte per character, but doesn't use all 19683 (3^9) possible values for individual specific characters. Instead, only the two lower-order tribbles are used to specify characters (which still leaves 729, which is plenty for Western scripts). The remaining high-order tribble is used for control codes, which add additional semantics to characters. They have several functions, including some shared with ASCII control codes (which are full-fledged characters) such as end-of-line and end-of-file. Other control codes include record, field, page, and column separators, explicit word breaks and word break prohibitions, hyphenation hints, additional sorting instructions, and character combining.

Most separators (end of line, end of paragraph, end of record, end of field, end of page/form feed, end of column, and end of file) cause a break after the modified character. For example, if an "A" character is given the end-of-line control code, it is the last character in its line and the following character is the first character of the next line. The word break and hyphenation hint, however, allow line breaks before the modified character. Both of those allow lines to break in the middle of a word if a line is too long to display; the main difference is that the display mode must add a hyphen if it breaks a line at a hyphenation hint, but shouldn't at a word break. The word break prohibition code may be applied to characters that normally allow a soft line break before them (such as a space), to prevent that.

Sorting instructions include the nonsorting character, secondary sort, and invisible sorting character codes. A character with the nonsorting code applied is simply skipped during collation. The secondary sort code marks a character as being less important: a character with the secondary sort code applied is skipped if compared against one without the code applied during collation. The invisible sorting charater control code prevents the affected character from affecting display, but not collation.

Character combining is used to provide accented characters and ligatures. Unlike Unicode, where combining is a property of the following diacritic, or old printer-ASCII, where diacritics were (theoretically) formed by a letter BACKSPACE symbol sequence, the combining control code is applied to the base character (such as the O in Ö), which comes first. The instruction basically means "don't print this character yet, something is going to be added to it". The following character can have its own control code as well—e.g. the "non-sorting" code could be applied to the acute accent, so that "á" sorts equivalent to "a"—including the combining code itself, in theory allowing stacked diacritics as in Vietnamese. Besides diacritics, the combining code can be used with another letter (rather than a symbol) following, to create ligatures such as æ and œ (both with the following letter "e"), and the German ß (s+s). This allows these ligatures to sort as their unligated equivalents by default, and allows a hyphenation hint code to be applied to the second letter, so they can be hyphenated between elements of the ligature (particularly important in the case of ß, which in German should be hyphenated "s-s"). If the current typeface (or display mode) does not provide a ligature for a given combining pair, the combining code may be ignored (so a font without æ would show "ae").

As for the character codes proper, I haven't filled out the chart much. However, I've been playing with the idea of uppercase letters having the same code as their lowercase equivalents, but with the signs swapped: a case-insensitive sort would involve taking the absolute value. The problem with this is that not taking the absolute value means one case or the other would sort in the opposite direction!
Tags: ,
Current Mood: geekygeeky