The Devil’s Dictionary of Linguistics and Phonetics—David Krystal—Compiled by Adam Baker SpecGram Vol CLXXV, No 1 Contents Cryptolinguistic Puzzle—Mary Shapiro

French Love, Poodles and Google Translate:
A New Methodology to Build Language Families

Isabelle Tellier

There are many legends about Machine Translation. One of the most famous states that, in its first age in the fifties, when for obvious Cold War reasons it focused on English-Russian translations, an artificial device provided with the (biblical) sentence “The spirit is willing but the flesh is weak” was asked to translate it into Russian and then back into English and gave “The vodka is strong but the meat is rotten”. Another similar one evokes the sentence “Out of sight, out of mind” which, translated into Chinese (or Japanese) and then back into English, supposedly became “Invisible idiot”...

Of course, these stories are urban legends1 and machine translation systems perform much better now. They do not rely any longer on manually written (i.e. full of mistakes) sets of rules but on statistics (i.e. hard science) computed on large aligned corpora. To confirm this obvious scientific progress, we made some very serious experiments with Google Translate (the experiments were conducted in May 2015, at least with Google Translate’s French version, as the experimenter lives in France). The website was provided with a French sentence and asked for its translation into every one of the other 89 available languages. For each of them, a manual copy/paste operation was performed and the translation back into French was obtained. In the following, we carefully analyze the results and show that they suggest the creation of radically new language families.

Translation Results

The initial chosen sentence was: “l’amour, c’est l’infini mis à la portée des caniches” which, in English, means “love is the infinite within the reach of poodles”. This (rather cynical) definition of love is a famous quote extracted from Voyage au bout de la nuit (Journey to the End of the Night, 1932) by the French writer Louis-Ferdinand Céline.2 The 89 back and forth translations (French → language → French) gave rise to the following results (the results are merged when no more than 5% of edit operations distinguish one sentence from another one, and the order in which languages are presented is explained below):

To fully understand these sentences if you haven’t mastered French, you can of course use Google Translate, but it is without any guarantee. Otherwise, you mainly have to know that, while “à la portée de” in French means “within the reach of”, “une portée de” means “a litter/brood of”, which can unfortunately also be quite appropriate for poodles. So, in fact, many of these sentences mean something like “love is an infinite litter of poodles”. It is nevertheless not the case for the Yoruba translation, which states, on the contrary, “love isn’t limited to the litter of poodles”! Other shades of meaning are expressed by some of these sentences. For example, the translation from Kazakh claims that “infinite love is within the reach of poodles” whereas, for the Irish version, it is “the limit of love”, which has this property. For the Persian one, “Friend is infinitely available poodles” while Polish people seem to believe that “love is infinite in the box”. At least, Google Translate most of the time remarkably captured the cynical flavor of the original sentence. We note some other interesting facts:

Language Families

The previous experiment was just a preliminary. It suggested unexpected similarities between languages that had to be investigated further.

Let us first note that the 89 languages other than French available in Google Translate are not representative of all human languages. Figure 1 shows how they are distributed among the “classical” language families. The Indo-European family is over-represented among them.


Figure 1: Language Families Available in Google Translate

The question that naturally arose from the initial results was whether languages belonging to the same family tended to lead to similar final translations of our sentence when used as an intermediary. To test this hypothesis, we applied a hierarchical clustering based on the edit distance to our 89 final sentences and compared the results with the reference clustering based on “classical” language families. The confusion matrix obtained when 15 clusters (the number of represented families) were required is given in Figure 2. In this matrix, “real” family assignments are in rows while discovered clusters are in columns. The numbers of this matrix sum to 89; all languages are represented in it.

Clusters →
vs.
Families ↓
Indo‑European
Afro‑Asiatic
No Class
No Class
Altaic
Austroasiatic
Austronesian
Afro‑Asiatic 4100000
Altaic 4011100
Austroasiatic 1000010
Austronesian 7000001
Caucasian 1000000
Constructed 1000000
Creole / Pidgin 1000000
Dravidian 3000000
Hmong‑Mien 1000000
Indo‑European 41010000
Isolate 1000000
Niger‑Congo 4000000
Sino‑Tibetan 1000000
Tai‑Kadai 2000000
Uralic 2000000
Figure 2: Confusion Matrix for a
Hierarchical Clustering Based on
Edit Distance, Part 1


Clusters →
vs.
Families ↓
Dravidian
No Class
No Class
No Class
No Class
Niger‑Congo
Sino‑Tibetan
Uralic
Afro‑Asiatic 00000000
Altaic 00000000
Austroasiatic 00000000
Austronesian 00000000
Caucasian 00000000
Constructed 00000000
Creole / Pidgin 00000000
Dravidian 10000000
Hmong‑Mien 00000000
Indo‑European 01110000
Isolate 00000000
Niger‑Congo 00001100
Sino‑Tibetan 00000010
Tai‑Kadai 00000000
Uralic 00000001
Figure 2: Confusion Matrix for a
Hierarchical Clustering Based on
Edit Distance, Part 2

This result is not very convincing. Among the 89 sentences, only 49 are correctly “clustered”, i.e. grouped together as expected. In fact, 74 have been assigned to the same cluster (the first cluster), among which 41 come from Indo-European languages. All other clusters contain only 1 language (except one with 2), that should thus be considered “isolates”.

To have a better view of the situation, we also produced a “dendrogram” (or family tree) based on the edit distance between every pair of sentences. It is shown on Figure 3. In this figure:

The order in which the languages are displayed in this family tree is the one that was used to introduce our initial translations. It is now possible to better analyze the language families it suggests.


Figure 3: Family Tree of Languages Based on Edit Distance

So, in the end, we have defined eight new and quite well balanced language families, which, added with two isolates, cover 90 languages. Our methodology used only strictly scientific techniques: statistical machine translation and hierarchical clustering. Our results question the validity of all previous works in this domain and have opened a new area in linguistic studies.



1 Hutchins John (1995): “ ‘The whisky was invisible’, or Persistent myths of MT,” MT News International, p.17-18.

2 Céline Louis-Ferdinand (1932): Voyage au bout de la nuit, Gallimard, Paris.

The Devil’s Dictionary of Linguistics and PhoneticsDavid KrystalCompiled by Adam Baker
Cryptolinguistic PuzzleMary Shapiro
SpecGram Vol CLXXV, No 1 Contents