Huffmenglish
A Proposal for Lossless Entropy-Encoded Spelling Reform
Twvx Tthm, Trnbg Txfk of Aluidd Ygtir
ad
Tthh usn trryv Aluuaj, Txfk-aat-Ttug fo Aluidd Ygtir
Some say time is money, but you can always get more money.1 You can’t make, buy, beg, borrow, or steal more time. More of our time is spent using language than any other single activity.2 Heeding the wise council of H. Sanderson Chambers III and his “21st Century Proposal for English Spelling Reform”, we realize that the time is ripe for truly radical spelling reform. Our proposal will make English spelling significantly more difficult, cementing its place as the world language as Chambers suggests, while saving potentially billions of person-years of time spent communicating.3
The rise of texting as a communicative medium has conclusively shown that users of English are ready to embrace some form of textual compression in their everyday lives.4 Rather than the inconsistent, ad hoc shorthand used inconsistently by inconsistent texters, we propose a scientific text compression based on sound computational analysis and optimized for efficiency.
Huffman coding is a well-established, lossless, entropy-encoding algorithm for lossless data compression.5 The result of Huffman coding is an optimally compressed encoding of the input symbols, with the advantageous feature that it is prefix-free, meaning that no encoding for a given symbol is a prefix (or initial substring) of the encoding of another symbol.
We have developed a proprietary6 variant of the algorithm which performs a 26-way split encoding while preserving some or all of the letters found in the highest frequency symbols (words) to be encoded. The use of a 26-way split allows the use of the standard, well-known Latin alphabet, and preserves the meaning of capitalization. The optimal nature of the encoding means that text need not be further compressed to speed transmission—the compression is already built into the encoding.
We have applied7 this algorithm8 to a 30-million-word corpus of spoken and written conversational English, with over 41 thousand distinct symbols for encoding.9 The results for the most common English words are given below, in frequency order.
you | | ao |
i | z |
to | to |
the | te |
a | az |
and | ad |
that | th |
it | it |
of | of |
me | me |
what | wh |
is | is |
| |
in | | in |
this | ts |
know | kn |
i’m | im |
for | fo |
no | no |
have | ha |
my | my |
don’t | dn |
just | ju |
not | nt |
do | do |
| |
be | | be |
on | on |
your | yu |
was | ws |
we | we |
it’s | iz |
with | wi |
so | so |
but | bu |
all | aq |
well | wl |
are | ax |
| |
he | | he |
oh | oh |
about | ab |
right | ri |
you’re | yo |
get | ge |
here | hz |
out | ou |
going | gi |
like | li |
yeah | ye |
if | if |
| |
her | | hr |
she | sh |
can | ca |
up | tz |
want | wa |
think | ti |
that’s | ta |
now | nw |
go | go |
him | alm |
at | aat |
how | alw |
|
As you can see,10 the algorithm has preserved some characteristic features of many of the most common words. The relative length of the words and their new encoding (which we have dubbed Huffmenglish) is a good indicator of how well speakers of English have optimized their language over time. The first person singular pronoun should indeed be only one letter long. But allowing the indefinite article to be a single letter gave it too high of a priority!
Of course the compression scheme eventually has to make some difficult choices concerning priority, compression, and similarity to uncompressed English; only the final m of him and the final w of how are preserved. For less frequent terms, the Huffmenglish encoded versions are quite similar in appearance to line noise, which is what one would expect from a properly compressed data stream.11
arab | | tymrb |
arabic | ukciz |
ass | yss |
asses | trkv |
der | trryv |
duh | txrh |
editor | txfk |
english | yjs |
fairness | treyj |
| |
german | | tnnp |
germans | tnnf |
grammarian | ygtir |
hindi | alugnz |
jonathan | tthh |
jones | tthm |
large | ttug |
larger | ttue |
managing | trnbg |
| |
mandarin | | ukdiz |
meer | aluuaj |
portuguese | tyebd |
speculate | trtnf |
speculative | aluidd |
tight | uig |
tights | trout |
trey | twvx |
trout | trnvt |
| |
understand | | ass |
understands | twdq |
van | usn |
word | unb |
worded | ukmde |
words | unp |
word’s | tymgv |
yes | arc |
zucchini | trofhi |
|
Of note, the length of the encodings of the names of various languages indicates their relative importance to speakers of English. Not surprisingly, English is the most compressed, while Hindi is among the least—in fact, its encoded version is longer than its unencoded version, indicating that we have collectively been wasting the short form “Hindi” on an infrequently discussed language.
Conversely, longer words, especially the frequent ones, but also those rarely used, were all significantly compressed by the Huffmenglish encoding. In fact, no Huffmenglish word, under the present encoding scheme, is longer than 6 letters.
aaaaarrrrrrggghhh | | uknqa |
counterintelligence | ukmvg |
| |
misunderstandings | | trtba |
overcompensating | uklmo |
| |
uncharacteristically | | aluhyw |
underappreciated | alumup |
|
One disclaimer we must make: our corpus was not exhaustive—as no corpus can be—and for reasons of computing power we further restricted our initial encoding to words that appeared more than five times in the corpus. Version 2.0 of the Huffmenglish encoding will include more infrequent words, and other algorithmic improvements. In the mean time, we will extend Huffmenglish by borrowing a natural extension from XML encoding, which allows us to include words for which there is as yet no proper Huffmenglish encoding, the CDATA tag. So, the proper Huffmenglish 1.0 representation of Huffmenglish would thus be the natural form <![CDATA[Huffmenglish]]>
. Of course, we expect Huffmenglish to rapidly overtake English as the dominant form of communication, and so we use the two interchangeably in Huffmenglish, encoding both as Yjs.
As noted before, Huffman coding is prefix-free, meaning that no encoding is a prefix of any other encoding. While this may obscure morphological and etymological relationships among words—as seen in the tables above—we often must clear away traditional ideological “clutter” to pave the way for important social progress.12 In practical terms, the prefix-free quality means that there is really no need for spaces between words, leading to further communicative efficiency. Punctuation, of course, is still meaningful in Huffmenglish, and so is preserved, as are standard typographic conventions such as italics, fonts, etc.
As a simple example, consider the following hypothetical article title with authorship information:
Huffmenglish, A Proposal for Lossless Entropy-Encoded Spelling Reform; Trey Jones, Managing Editor of Speculative Grammarian and Jonathan van der Meer, Editor-At-Large for Speculative Grammarian
It weighs in at a hefty 194 characters. The Huffmenglish version is a svelte 116 characters—a 40% reduction, and short enough to tweet!
Yjs,AzTcpxfoAlugzzAludycTriwqTmvmTcfm;TwvxTthm,TrnbgTxfkofAluiddYgtiradTthhusntrryvAluuaj,Txfk-aat-TtugfoAluiddYgtir
Clearly, Huffmenglish represents a great advance in the theory and practice of English spelling reform!13
1 Not “you”, as in linguists, but financially savvy people in general can. See Gekko, Gordon (1987), “Greed is Good.” Journal of Wall Street.
2 See Sense, Common (2011), “Duh.” Obviousness Quarterly.
3 See Math, Complicated (2011), “Numbers Don’t Lie.” Annals of the Abstruse.
4 See Knowledge, General (2011), “RUFKM? WTF?” W@† U n33d 2 n0 ∂4iLy.
5 See Science, Computer (2011a), “Algorithms for Dummies.” Geek Gazette.
6 Meaning, “we made most of it up”. See Bubble, Tech (1995), “Buy e-Whatever-dot-com Right Now!” Internet Investment Digest.
7 “We ran”. See Speak, Leet (2011), “Engendering the Sense That That Which You Undertake is Consequential Through the Appropriate Utilization of Brobdingnagian Locutions.” Periodical of Presentation Processes.
8 “our program”. Ibid.
9 “on some words.”. Ibid.
10 See Apparent, Readily (2011), “As Plain as the Nose on Your Face.” Journal of Just Look at the Data.
11 See Science, Computer (2011b), “More Algorithms, Still for Dummies.” The National Nerd Newsletter.
12 See Jet-Pack, Flying (1902), “The Future is Now!” The Chronicle of Futurism.
13 Issues of pronunciation are left as an exercise for the reader.