Regular Isomorphisms of Categorization in the Apathetic Informant

by Angus Æ. Balderdash, Esq.

Unfortunately, it is often the case that when working with data sets containing particularly uncommon kinds of data, the number of qualified consultants available to provide native-speaker interpretations of the data is quite low. In such cases, it is often necessary to work with consultants who have one or more sub-optimal characteristics: poor work ethic, lack of attention to detail, weak fashion sense, surly attitude, inclination toward insubordination, poor personal hygiene, difficulty following instructions--the list is all but endless.

In some such cases, these characteristics, such as lack of attention to detail, difficulty following instructions, and refusal to floss can lead to fairly low quality native-speaker intuitions. Of course, many of the tasks involved in such interpretation and analysis can be quite repetitive and tedious, and so a certain number of mistakes in judgment tasks are inevitable.

However, when your informant simply doesn't care, has no financial or professional stake in the project, and is simply providing their services as a favor to a friend of a friend of a friend, the quality can reach a surprising nadir.

But not to despair! Native speaker intuitions, that ill-defined but highly-valued commodity of linguists everywhere, can be surprisingly flexible, and may provide insight even when they fail. Competence that comes from only a vague familiarity with certain subject matter may still be valuable, though quirky.

As a case study in this art of meta-linguistic interpretation, consider the story of Roscoe P., an informant brought in on a large project involving the ethnolinguistic categorization of surnames from the world over as training data for a complex computational linguistics machine learning task.

Roscoe's area of expertise was South Asian names, and he was given a large data set to review and categorize. The data set had been crudely pre-sorted by a prototype statistical categorization tool, and names were assigned to reviewers based on the ethnolinguistic groups most likely to be found in each of the pre-sorted buckets. Roscoe was instructed, like all reviewers, to give a finely detailed categorization of names that fell into his area of expertise, but also to give rough categorizations, when possible, for names that fell outside his sphere of expert knowledge, in order to facilitate re-assignment to other subject matter experts to fully classify the data.

Roscoe was one of the most difficult consultants to work with on this project. He lacked completely in attention to detail, once providing a spread sheet in which the category for each surname was assigned to the preceding name in the list. Of course, once the error was discovered it was fairly easy to repair the corrupted spread sheet, and the data was quite usable. Fortunately, the data had not been imported into our master database before the problem had been detected and corrected. Indeed, Roscoe's work was routinely given four to five times the level of meta-review that the other experts received.

Other undesirable traits, including glacial review speed and failure to follow precise instructions, led to many difficulties. However, as Roscoe was the project leader's husband's boss's boss's sister's son, he clearly could not be replaced without adversely impacting the project leader's ability to make timely mortgage payments.

One of Roscoe's most frustrating and potentially damaging traits was his inability to judge the level of his own expertise. For example, in addition to being moderately expert in South Asian names, Roscoe was sure that he was almost equally expert in Baltic names and languages, and also names and languages indigenous to Latin America (though oddly not Hispanic names or the Spanish language--a fact that tipped us off to his true lack of expertise).

Before we were fully aware of Roscoe's inaccurate self-assessment, he had unfortunately been given explicit permission to assign mediumly-detailed categories to Baltic and Native American name groups, as well as the default instructions to assign rough categories for other less familiar ethnolinguistic groups. After the full realization of his self-deception sank in, the team largely felt compelled to simply dispose of his category judgments, and begin anew with a new consultant.

The misaligned spread sheet provided a kind of foreshadowing of Roscoe's ability to provide completely incorrect but morbidly regular judgments. By a stroke of good luck, a linguist on the project noticed not only Roscoe's errors, but also some surprising regularities in his mistakes.

It turns out that the ethnicities Roscoe assigned the surnames had a fairly predictable though bizarre semantics:

Roscoe's Category     True Category
Jewish German
Ukrainian Hawai'ian
Latvian English
Lithuanian Polish
Estonian Can't be bothered to look it up
Hispanic Ends in -o
Aztec Russian
Mayan Slovenian
Tupi Reviewed after lunch during food coma
Guarani Incan
Incan Kzinti
Yanomami Anagram of Anagram
Carib Either Greek or Nigerian
Tarahumara Both Estonian and Cantonese
Zapotec Contains an x
Arawak Makes a good name for a dog
Nahuatl Gothic
Quechua Is pronounceable when spelled backwards
French Sino-Kiowan
Elbonian Name of a comic strip character

Obviously, some of these categories are more useful to the project than others. For example, we sent "Jewish" names (Müller, Weiß, Schröder) to our German expert, "Aztec" names (Gorbechev, Ivanova, Bogochevsky) to our Soviet Union expert, and used "Arawak" names (Spike, Spot, Lucky) to choose a name for the project mascot.

Other groups like "Hispanic" (Apollo, Tokyo, Ho), "Zapotec" (Alexander, Xochitl, Apotaxis), and "Yanomami" (Ramanag, Garaman, Arangam) were essentially useless, but identifying them helped sort the wheat from the chaff.

In the end, a careful meta-analysis of Roscoe's mediocre linguistic judgments has borne considerable fruit for this project. Not only were we able to turn crappy "native" judgments into vaguely valuable data, we were able to turn a crappy "native" speaker into a valuable research subject.

Roscoe's unique talent for self-deception in the area of his own competence, coupled with his incredible arrogance, makes him an informant like few others. While complex questions, fuzzy concepts, and intricate data cause other, more reflective consultants to throw up their hands and admit that not all data can be analyzed, categorized, or judged for grammaticality, Roscoe is always certain of his answer, despite the fact that he is so often incorrect.

Roscoe's responses, unfiltered by self-doubt or contemplation, provide a special view into his mental processes. Because of our meta-analysis of Roscoe's categorization errors, we have determined that his innate pattern matching ability is quite well-developed: the groups he defines are quite meaningful, just very poorly labelled. Since Roscoe is unremarkable in any other mental attributes, we naturally assumed this ability is common to all humans.

Our analysis to date shows that most individuals probably have the ability to build complex, accurate on-the-fly categorization procedures, based on incidental data they have encountered. In most cases, self-doubt censors the use of such categorization procedures, though they can be quite good at creating meaningful groupings. Rather than self-censor, or in Roscoe's case, completely fail to self-censor, a more useful, balanced approach may be to identify intuitive groupings and formally research their constituency, thereby increasing one's "native" competence.

Finally, we hypothesize that such intuitive, informal, haphazardly-constructed processes extend beyond categorization procedures, and include all aspects of linguistic processing. In order to test this supposition, we have begun teaching Roscoe Swahili, more or less against his will. His arrogant, incorrect self-assessment is that he has a real "knack for languages" and that Swahili offers little challenge.

His progress is quite slow, though he thinks he is rapidly improving. His semi-coherent phonology and stunted grasp of syntax are shedding valuable light onto the L2 acquisition process, revealing semi-complete internal models that more self-aware language learners would never dare to use to produce utterances that others might hear.

The moral of the story of the Apathetic Informant is this: If life gives you a lemon-headed informant, put his head in a meta-analytic vise and squeeze it until it gives useful juice, which you may use to secure several large lemonade-flavored research grants.

