Generative Speech Recognition:
A competence model of ASR
by Stanislaus Gorky
10 PRINT "Hello, world!"
20 END
— BASIC
|
|
|
Speech recognition has long posed a problem for speech scientists,
phoneticians, and commercial speech researchers. In this paper I make
observations on new developments in the generative program, which has
great contributions to make towards speech recognition. My
investigations on this matter are based on the foundational remarks on
ASR made by Chomsky & Ladefoged (1968), to wit:
The development of a speech recognition system SRS has fundamentally
to do with the study of any language L spoken by a human H at time T.
(1968: 16)
Here Chomsky clearly anticipates the link between the generative
program and speech recognition technology that has been recently
confirmed by speech researchers. Fundamental to Chomsky’s thought on
this matter, however, is the distinction between competence and
performance in speech recognition.
“Recognizing a word W involves any number of complex factors C1...i:
class HELLO_WORLD
creation
make
feature
make is
local
io:BASIC_IO
do
!!io
io.put_string("%N Hello, world!")
end -- make
end -- class HELLO_WORLD
— Eiffel
|
|
|
interspeaker variability, speech rate, regional difference in accent,
memory capacity, the speaker’s state of mind, and many other factors
related to performance. Crucially, none of these
other factors has to do with the speaker’s intention IS, i.e. the
expression of the speaker’s thought TS. As such, all performance
factors should be excluded from consideration, until such time as the
competence has been studied thoroughly.” (1968: 345, fn. 2)
Following the recent developments of the minimalist program, we
recognize that every successful word recognition involves a pairing of
a sound image I and a lexical meaning L. Well-formed <I, L> pairs are said
to converge in derivation, while ill-formed sound-meaning pairs crash.
Strong acoustic features must be checked before Spell-Out, while weak
acoustic features can be checked after Spell-Out, before the lexical
form (LF), of course.
The distinction between strong and weak acoustic features accounts for
typological variation in speech recognition parametrically. The weak
print [Hello world!]
— Logo
|
|
|
vowel features of English are not checked early in the derivation,
leading to the perception of schwa in weak positions. Contrast with
Spanish, where vowel features must be checked before Spell-Out,
regardless of their prosodic status. Given this typological result,
it is not unreasonable to expect the minimalist program to provide
definitive answers to all cases of vowel reduction, at all times,
and in all places (with some idealization of the data). These minimalist
insights into vowel reduction can be applied directly to work in ASR.
The model was tested with the benchmark TIMIT corpus. I and several
colleagues first observed the well-formed <I, L> pairs in the training
data, and then intuited the response of the recognition grammar for
the test data. Preliminary results indicate a robust model. My
intuitions indicate that competence-based ASR can achieve in excess of
99% success in recognizing words spoken in the benchmark TIMIT corpus.
Several other colleagues have reported similar intuitions. (Idiolectal
variation accounts for variable estimates of competence ASR efficacy;
intuitions range from 98% to 100%, σ 1.3%). Even when we imagined
:- write('Hello world'),nl.
— Prolog
|
|
|
the presence of noise in the data, the drop-off in performance was
modest (97-99%, σ 1.2%); similar results were obtained when the
recordings were imagined over a noisy com system.
These results show the clear advantage in ASR of disregarding
interspeaker variability, speaker emotional status, speech rate, and
other performance factors. When freed from concerns that are
ultimately non-linguistic, our speech recognition system achieves
results unparalleled in other research programs.