’Trilaas Outside Manila!—An Anthropological Linguistic Followup on Multi-Trill Counting —Claude Searsplainpockets & Helga von Helganschtein y Searsplainpockets SpecGram Vol CLXI, No 2 Contents SpecGram, the Religion—Margo T. Cip, A. M. Grössten, & Strčprst Kskrzkrk

Huffmenglish
A Proposal for Lossless Entropy-Encoded Spelling Reform

Twvx Tthm, Trnbg Txfk of Aluidd Ygtir
ad
Tthh usn trryv Aluuaj, Txfk-aat-Ttug fo Aluidd Ygtir

Some say time is money, but you can always get more money.1 You can’t make, buy, beg, borrow, or steal more time. More of our time is spent using language than any other single activity.2 Heeding the wise council of H. Sanderson Chambers III and his “21st Century Proposal for English Spelling Reform”, we realize that the time is ripe for truly radical spelling reform. Our proposal will make English spelling significantly more difficult, cementing its place as the world language as Chambers suggests, while saving potentially billions of person-years of time spent communicating.3

The rise of texting as a communicative medium has conclusively shown that users of English are ready to embrace some form of textual compression in their everyday lives.4 Rather than the inconsistent, ad hoc shorthand used inconsistently by inconsistent texters, we propose a scientific text compression based on sound computational analysis and optimized for efficiency.

Huffman coding is a well-established, lossless, entropy-encoding algorithm for lossless data compression.5 The result of Huffman coding is an optimally compressed encoding of the input symbols, with the advantageous feature that it is prefix-free, meaning that no encoding for a given symbol is a prefix (or initial substring) of the encoding of another symbol.

We have developed a proprietary6 variant of the algorithm which performs a 26-way split encoding while preserving some or all of the letters found in the highest frequency symbols (words) to be encoded. The use of a 26-way split allows the use of the standard, well-known Latin alphabet, and preserves the meaning of capitalization. The optimal nature of the encoding means that text need not be further compressed to speed transmissionthe compression is already built into the encoding.

We have applied7 this algorithm8 to a 30-million-word corpus of spoken and written conversational English, with over 41 thousand distinct symbols for encoding.9 The results for the most common English words are given below, in frequency order.

you   ao
iz
toto
thete
aaz
andad
thatth
itit
ofof
meme
whatwh
isis
in   in
thists
knowkn
i’mim
forfo
nono
haveha
mymy
don’tdn
justju
notnt
dodo
be   be
onon
youryu
wasws
wewe
it’siz
withwi
soso
butbu
allaq
wellwl
areax
he   he
ohoh
aboutab
rightri
you’reyo
getge
herehz
outou
goinggi
likeli
yeahye
ifif
her   hr
shesh
canca
uptz
wantwa
thinkti
that’sta
nownw
gogo
himalm
ataat
howalw

As you can see,10 the algorithm has preserved some characteristic features of many of the most common words. The relative length of the words and their new encoding (which we have dubbed Huffmenglish) is a good indicator of how well speakers of English have optimized their language over time. The first person singular pronoun should indeed be only one letter long. But allowing the indefinite article to be a single letter gave it too high of a priority!

Of course the compression scheme eventually has to make some difficult choices concerning priority, compression, and similarity to uncompressed English; only the final m of him and the final w of how are preserved. For less frequent terms, the Huffmenglish encoded versions are quite similar in appearance to line noise, which is what one would expect from a properly compressed data stream.11

arab   tymrb
arabicukciz
assyss
assestrkv
dertrryv
duhtxrh
editortxfk
englishyjs
fairnesstreyj
german   tnnp
germanstnnf
grammarianygtir
hindialugnz
jonathantthh
jonestthm
largettug
largerttue
managingtrnbg
mandarin   ukdiz
meeraluuaj
portuguesetyebd
speculatetrtnf
speculativealuidd
tightuig
tightstrout
treytwvx
trouttrnvt
understand   ass
understandstwdq
vanusn
wordunb
wordedukmde
wordsunp
word’stymgv
yesarc
zucchinitrofhi

Of note, the length of the encodings of the names of various languages indicates their relative importance to speakers of English. Not surprisingly, English is the most compressed, while Hindi is among the leastin fact, its encoded version is longer than its unencoded version, indicating that we have collectively been wasting the short form “Hindi” on an infrequently discussed language.

Conversely, longer words, especially the frequent ones, but also those rarely used, were all significantly compressed by the Huffmenglish encoding. In fact, no Huffmenglish word, under the present encoding scheme, is longer than 6 letters.

aaaaarrrrrrggghhh   uknqa
counterintelligenceukmvg
misunderstandings   trtba
overcompensatinguklmo
uncharacteristically   aluhyw
underappreciatedalumup

One disclaimer we must make: our corpus was not exhaustiveas no corpus can beand for reasons of computing power we further restricted our initial encoding to words that appeared more than five times in the corpus. Version 2.0 of the Huffmenglish encoding will include more infrequent words, and other algorithmic improvements. In the mean time, we will extend Huffmenglish by borrowing a natural extension from XML encoding, which allows us to include words for which there is as yet no proper Huffmenglish encoding, the CDATA tag. So, the proper Huffmenglish 1.0 representation of Huffmenglish would thus be the natural form <![CDATA[Huffmenglish]]>. Of course, we expect Huffmenglish to rapidly overtake English as the dominant form of communication, and so we use the two interchangeably in Huffmenglish, encoding both as Yjs.

As noted before, Huffman coding is prefix-free, meaning that no encoding is a prefix of any other encoding. While this may obscure morphological and etymological relationships among wordsas seen in the tables abovewe often must clear away traditional ideological “clutter” to pave the way for important social progress.12 In practical terms, the prefix-free quality means that there is really no need for spaces between words, leading to further communicative efficiency. Punctuation, of course, is still meaningful in Huffmenglish, and so is preserved, as are standard typographic conventions such as italics, fonts, etc.

As a simple example, consider the following hypothetical article title with authorship information:

Huffmenglish, A Proposal for Lossless Entropy-Encoded Spelling Reform; Trey Jones, Managing Editor of Speculative Grammarian and Jonathan van der Meer, Editor-At-Large for Speculative Grammarian

It weighs in at a hefty 194 characters. The Huffmenglish version is a svelte 116 charactersa 40% reduction, and short enough to tweet!

Yjs,AzTcpxfoAlugzzAludycTriwqTmvmTcfm;TwvxTthm,TrnbgTxfkofAluiddYgtiradTthhusntrryvAluuaj,Txfk-aat-TtugfoAluiddYgtir

Clearly, Huffmenglish represents a great advance in the theory and practice of English spelling reform!13


1 Not “you”, as in linguists, but financially savvy people in general can. See Gekko, Gordon (1987), “Greed is Good.” Journal of Wall Street.
2 See Sense, Common (2011), “Duh.” Obviousness Quarterly.
3 See Math, Complicated (2011), “Numbers Don’t Lie.” Annals of the Abstruse.
4 See Knowledge, General (2011), “RUFKM? WTF?” W@† U n33d 2 n0 ∂4iLy.
5 See Science, Computer (2011a), “Algorithms for Dummies.” Geek Gazette.
6 Meaning, “we made most of it up”. See Bubble, Tech (1995), “Buy e-Whatever-dot-com Right Now!” Internet Investment Digest.
7 “We ran”. See Speak, Leet (2011), “Engendering the Sense That That Which You Undertake is Consequential Through the Appropriate Utilization of Brobdingnagian Locutions.” Periodical of Presentation Processes.
8 “our program”. Ibid.
9 “on some words.”. Ibid.
10 See Apparent, Readily (2011), “As Plain as the Nose on Your Face.” Journal of Just Look at the Data.
11 See Science, Computer (2011b), “More Algorithms, Still for Dummies.” The National Nerd Newsletter.
12 See Jet-Pack, Flying (1902), “The Future is Now!” The Chronicle of Futurism.
13 Issues of pronunciation are left as an exercise for the reader.

’Trilaas Outside Manila!An Anthropological Linguistic Followup on Multi-Trill CountingClaude Searsplainpockets & Helga von Helganschtein y Searsplainpockets
SpecGram, the ReligionMargo T. Cip, A. M. Grössten, & Strčprst Kskrzkrk
SpecGram Vol CLXI, No 2 Contents