SpecGram—Generative Speech Recognition: A competence model of ASR

10 PRINT "Hello, world!"
20 END

— BASIC

Speech recognition has long posed a problem for speech scientists, phoneticians, and commercial speech researchers. In this paper I make observations on new developments in the generative program, which has great contributions to make towards speech recognition. My investigations on this matter are based on the foundational remarks on ASR made by Chomsky & Ladefoged (1968), to wit:

Here Chomsky clearly anticipates the link between the generative program and speech recognition technology that has been recently confirmed by speech researchers. Fundamental to Chomsky’s thought on this matter, however, is the distinction between competence and performance in speech recognition.

“Recognizing a word W involves any number of complex factors C_1...i:

class HELLO_WORLD

creation
  make
feature
  make is
  local
      io:BASIC_IO
  do
      !!io
      io.put_string("%N Hello, world!")
  end -- make
end -- class HELLO_WORLD

— Eiffel

interspeaker variability, speech rate, regional difference in accent, memory capacity, the speaker’s state of mind, and many other factors related to performance. Crucially, none of these other factors has to do with the speaker’s intention IS, i.e. the expression of the speaker’s thought TS. As such, all performance factors should be excluded from consideration, until such time as the competence has been studied thoroughly.” (1968: 345, fn. 2)

Following the recent developments of the minimalist program, we recognize that every successful word recognition involves a pairing of a sound image I and a lexical meaning L. Well-formed <I, L> pairs are said to converge in derivation, while ill-formed sound-meaning pairs crash. Strong acoustic features must be checked before Spell-Out, while weak acoustic features can be checked after Spell-Out, before the lexical form (LF), of course.

The distinction between strong and weak acoustic features accounts for typological variation in speech recognition parametrically. The weak

print [Hello world!]

— Logo

vowel features of English are not checked early in the derivation, leading to the perception of schwa in weak positions. Contrast with Spanish, where vowel features must be checked before Spell-Out, regardless of their prosodic status. Given this typological result, it is not unreasonable to expect the minimalist program to provide definitive answers to all cases of vowel reduction, at all times, and in all places (with some idealization of the data). These minimalist insights into vowel reduction can be applied directly to work in ASR.

The model was tested with the benchmark TIMIT corpus. I and several colleagues first observed the well-formed <I, L> pairs in the training data, and then intuited the response of the recognition grammar for the test data. Preliminary results indicate a robust model. My intuitions indicate that competence-based ASR can achieve in excess of 99% success in recognizing words spoken in the benchmark TIMIT corpus. Several other colleagues have reported similar intuitions. (Idiolectal variation accounts for variable estimates of competence ASR efficacy; intuitions range from 98% to 100%, σ 1.3%). Even when we imagined

:- write('Hello world'),nl.

— Prolog

the presence of noise in the data, the drop-off in performance was modest (97-99%, σ 1.2%); similar results were obtained when the recordings were imagined over a noisy com system.

These results show the clear advantage in ASR of disregarding interspeaker variability, speaker emotional status, speech rate, and other performance factors. When freed from concerns that are ultimately non-linguistic, our speech recognition system achieves results unparalleled in other research programs.

	Speech Disorders as Indicators of Potential for Lyrical Success—Ozzie Tchomzkij
	Re-Rating the World’s Languages—Waxaklahun Ubah K’awil and José Felipe Hernandez y Fernandez
	SpecGram Vol CLI, No 2 Contents

Generative Speech Recognition: A competence model of ASR

by Stanislaus Gorky

Generative Speech Recognition:
A competence model of ASR