Normalized Independent Basis-Vector–Based Languages
and the Dream of Just Universal Enough Languages
A Reply to Nsiwander-Sic

Jonathan van der Meer
Center for Computational Bioinformatics and Linguistics
I. Juana Pelota-Grande
Centre den Geometrik Linguistiken

In a recent article, Nsiwander-Sic proposed a fresh approach to problem of endangered languages, which focuses on Small Subset Diversity Conservation (SSDC) and the notion of the Necessary Linguistic Subset (NLS). (Nsiwander-Sic 2013)

Nsiwander-Sic proposes two methods for achieving this aim:

At least two feasible, reasonable approaches to selecting an appropriate subset of the world’s languages are evident.

Method 1: We might simply survey linguists directly, and additionally perform a meta-analysis of their publications, to see which languages are commonly used to illustrate important linguistic properties and differences. The remaining languages may be judged effectively redundant.

Method 2: Somewhat more formally, we might instead perform a study constructing a ‘DRE differences matrix’ in which all the important noted parameter values of the world’s languages are encoded. This matrix could then be analyzed to determine the minimum subset of languages necessary to conserve important linguistic diversity.

We support the second, formal and data-driven approach over the more touchy-feely survey of the first method. (Cf. Phlogiston 2007.)

In fact, we feel that Nsiwander-Sic doesn’t really go far enough with this approach. Such a matrix could be used to find a minimal basis set of language vectors, in terms of which all other languages could be represented. For example, no one would be surprised to discover Romanian was best represented as “0.72 It + 0.06 Fr - 0.01 Sp + 0.07 Ru + 0.04 Cz - 0.02 Pol + 0.001 Basq”.

Of course, the most perspicacious representation would be one step further abstracted, and derived from a factor analysis of the language vectors, allowing researchers to find a set of normalized independent basis vectors of unit length, spanning the sum total of language space. Cf. earlier work on OBV-IUS (Pelota-Grande 2005) and VIS-UAL (van der Meer 2005) linguistic-topologic typologies.

Conlangers could finally be put to good use (cf., CA 2009) in service of real linguistics by constructing composite languagesbuilt from the relevant component languageswith the right mix of relevant linguistic features.

For the small number of readers with insufficient linear algebra to follow the argument, the idea would be to construct a small number of languages (Nsiwander-Sic’s NLS) that not only provide all the necessary linguistic diversity for linguistic examples (Nsiwander-Sic’s SSDC), but do so in proper proportion, and in a completely non-overlapping way (Nsiwander-Sic’s Non-Redundant Linguistic Diversity, or NRLD, done right).

One of the new basis vector languages, for example, might be comprised of “0.43 Basque + 0.22 Dyirbal - 0.21 English - 0.13 Pirahã + 0.02 Hindi + 0.01 Mandarin”. An obvious naming schemeusing weighted sorting and polarity-appropriate orthographic inversionsuggests we should call this language something euphonious like Bas-Dyir-eNG-pIR-Hin-Man. We would expect the language to be just ergative enough, to have just enough noun classes, to be not too SVO, to have not too much allophonic variation, to have a just interesting enough writing system, and to have just enough tone, to be useful to linguists.

Why should the Basque learn Finnish, as Nsiwander-Sic suggests, when they could learn a much more useful (to linguists) language like Bas-Dyir-eNG-pIR-Hin-Man? It also obviates the need for the discussion of whether the Basque should learn Finnish, or the Finnish should learn Basque, which inevitably leads to hegemonic oppression based on mere numerical superiority, a rookie mistake Nsiwander-Sic should have sidestepped.

Nsiwander-Sic’s proposal is a good first approximation but, as we noted earlier, it does not go far enough. Our extension to a basis-vector–based approach goes just far enough.


