Performance Through Consistency: 
MS-TDNNs for Large Vocabulary 
Continuous Speech Recognition 
Joe Tebelskis and Alex Waibel 
School of Computer Science 
Carnegie Mellon University 
Pittsburgh, PA 15213 
Abstract 
Connectionist speech recognition systems are often handicapped by 
an inconsistency between training and testing criteria. This prob- 
lem is addressed by the Multi-State Time Delay Neural Network 
(MS-TDNN), a hierarchical phoneme and word classifier which uses 
DTW to modulate its connectivity pattern, and which is directly 
trained on word-level targets. The consistent use of word accu- 
racy as a criterion during both training and testing leads to very 
high system performance, even wilh limited training data. Until 
now, the MS-TDNN has been applied primarily to small vocabu- 
lary recognition and word spotting tasks. In this paper we apply 
the architecture to large vocabulary continuous speech recognition, 
and demonstrate that our MS-TDNN outperforms all other sys- 
tems that have been tested on the CMU Conference Registration 
database. 
1 INTRODUCTION 
Neural networks hold great promise in the area of speech recognition. But in order 
to fulfill their promise, they must be used "properly". One obvious conditim} of 
"proper" use is that both training and testing should use a consistent error crite- 
rion. Unfortunately, in speech recognition, this ol>vious condition is often violatcd: 
networks are frequently trained using phoneme-level criteria (phoneme classifical ion 
696 
MS-TDNN's for Large Vocabulary Continuous Speech Recognition 697 
or acoustic prediction), while the testing criterion is word recognition accuracy. If 
phoneme recognition were perfect, then word recognition would also be perfect; but 
of course this is not the case, and the errors which are inevitably made are optimized 
for the wrong criterion, resulting in suboptimal word recognition accuracy. 
The Multi-State Time Delay Neural Network (MS-TDNN) has recently been pro- 
posed as a solution to this problem [1]. The MS-TDNN is a hierarchically structured 
classifier which consistently uses word accuracy as a criterion for both training and 
testing. It has so far yielded excellent results on small vocabulary recognition [1, 2, 3] 
and a word spotting task [4]. In the present paper, we review the MS-TDNN archi- 
tecture, discuss its application to large vocabulary continuous speech recognition, 
and present some favorable experimental results on this task. 
2 
RESOLVING INCONSISTENCIES: 
MS-TDNN ARCHITECTURE 
In this section we motivate the design of our MS-TDNN by showing how a series of 
intermediate designs resolve successive inconsistencies. 
The preliminary system architecture, shown in Figure l(a), simply consists of a 
phoneme classifier, in this case a TDNN, whose outputs are copied into a DTW 
matrix, in which continuous speech recognition is performed. Many existing systems 
are based on this type of approach. While this is fine for bootstrapping purposes, 
the design is ultimately suboptimal because the training criterion is inconsistcnt 
with the testing criterion: phoneme classification is not word classification. 
To address this inconsistency we must train the network explicitly to perform word 
classification. To this end, we define a word layer with one unit for each word in the 
vocabulary; the idea is illustrated in Figure 1 (b) for the word "cat". We correlate the 
activation of the word unit with the associated DTW score by establishing connec- 
tions from the DTW alignment path to the word unit. Also, we give the phonemes 
within a word independently trainable weights, to enhance word discrimination (for 
example, to discriminate "cat" from "mat" it may be useful to give special emphasis 
to the first phoneme); these weights are tied over all frames in which the phoneme 
occurs. Thus a word unit is an ordinary unit, except that its connectivity to the 
preceding layer is determined dynamically, and its net input should be normalized 
by the total duration of the word. The word unit is trained on a target of 1 or 
0, depending if the word is correct or incorrect for the current segment of speech, 
and the resulting error is backpropagated through the entire network. Thus, word 
discrimination is treated very much like phoneme discrimination. 
Although network (b) resolves the original inconsistency, it now suffers from a sec- 
ondary one - namely, that the weights leading to a word unit are used during 
training but ignored during testing, since DTW is still performed entirely in the 
DTW layer. We resolve this inconsistency by "pushing down" these weights one 
level, as shown in Figure 1(c). Now the phoneme activations are no longer directly 
copied into the DTW layer, but instead are modulated by a weight and bias before 
being stored there (DTW units are linear); and the word unit has constant weights, 
and no bias. During word-level training, error is still backpropagated from targets 
at the word level, but biases and weights are modified only at the DTW level and 
698 Tebelskis and Waibel 
 in 
<l teeee eee.) 
 ,, 
[Speech:CAT" I 
(a) 
-CAT  
Speech: "CAT" 
Test 
(b) 
Word 
DTW 
TDNN 
"CAT" 
.eeeee..) 
/ ,. 
Speech: CAT" 
"CAT" 
2 
Speech: "CAT' 
(c) (d) 
st 
Train 
Figure 1' Resolving inconsistencies: (a) TDNN+DTW. (b) Adding word layer. 
(c) Pushing down weights. (d) Linear word units, for continuous speech recognition. 
MS-TDNN's for Large Vocabulary Continuous Speech Recognition 699 
below. Note that this transformed network is not exactly equivalent to the pre- 
vious one, but it preserves the properties that there are separate learned weights 
associated with each phoneme, and there is an effective bias for each word. 
Network (c) is still flawed by a minor inconsistency, arising from its sigmoidal word 
unit. The problem does not exist for isolated word recognition, since any mono- 
tonic function (sigmoidal or otherwise) will correlate the highest word activation 
with the highest DTW score. However, for continuous speech recognition, which 
concatenates words into a sentence, the optimal sum of sigmoids may not corre- 
spond to the optimal sigmoid of a sum, leading to an inconsistency between word 
and sentence recognition. Linear word units, as shown in (d), resolve this prob- 
lem; in practice we have found that linear word units perform slightly better than 
sigmoidal word units. 
The resulting architecture is called a "Multi-State TDNN" because it integrates 
the DTW alignment of multiple states into a TDNN to perform word classification. 
While an MS-TDNN for small vocabulary recognition can be based on word models 
with non-shared states [4], a large vocabulary MS-TDNN must be based on shared 
units of speech, such as phonemes. In our system, the TDNN (first three layers) 
is shared by all words in the vocabulary, while each word requires only one non- 
shared weight and bias for each of its phonemes. Thus the number of parameters in 
the MS-TDNN remains moderate even for a large vocabulary, and it can make the 
most of limited training data. Moreover, new words can be added to the vocabulary 
without retraining, by simply defining a new DTW layer for each new word, with 
incoming weights and biases initialized to 1.0 and 0.0, respectively. 
Given constant weights under the word layer, word level training is really just an- 
other way of viewing DTW level training; but the former is conceptually simpler 
because there is a single binary target for each word, which makes word level dis- 
crimination very straightforward. For a large vocabulary, discriminating against all 
incorrect words would be very expensive, so we discriminate against only a small 
number of close matches (typically 1). 
Word level training yields better word classification than phoneme level training. 
In one experiment, for example, we bootstrapped our system with phoneme level 
training (as shown in Figure la), and found that word recognition accuracy asymp- 
toted at 71% on a test set. We then continued training from the word level (as 
shown in Figure ld), and found that word accuracy improved to 81% on the test 
set. It is worth noting that in an intermediate experiment, even when we held the 
DTW layer's incoming weights and biases constant (at 1.0 and 0.0 respectively), 
thus adding no new trainable parameters to the system, we found that word level 
training still improved the word accuracy from 71% to 75% on the test set, as a 
consequence of word-level discrimination. 
3 BALANCING THE TRAINING SET 
The MS-TDNN must be first bootstrapped with phoneme level training. In our 
early experiments we had difficulty bootstrapping the TDNN, not only because our 
training set was unbalanced, but also because the vast majority of phoneroes wcre 
being trained on a target of 0, so that the negative training was overwhelming and 
700 Tebelskis and Waibel 
A 
E 
O 
N 
T 
Y 
Z 
N O N A T Y E T 
0 0 0  0 0 0 0 
0 0 0 0 0 0  0 
0  0 0 0 0 0 0 
t 0  0 0 0 0 0 
0 0 0 0  0 0  
0 0 0 0 0  0 0 
0 0 0 0 0 0 0 0 
A 
E 
O 
N 
T 
Y 
Z 
Scale backprop error by: 
N O N A T Y E T 
-1/7 -1/7 -U7 1.0 -1/7 -1/7 -1/7 -1/7 
-1/7 -1/7 -1/7 -1/7 -1/7 -1/7 1.0 -U7 
1/2 -1/6 1/2-1/6-1/6-1/6-1/6 -1 
-1/6 -U6-U6-1/6 1/2-1-1/6 1/2 
-1/8 -1/8 -1/8 -1/8 -1/8 -1 -1/8 -1/8 
Figure 2: Balancing the training set: "No not yet". 
defeating the positive training. In order to address these problems, we normalized 
the amount of error backpropagated from each phoneme unit so that the relative 
influence of positive and negative training was balanced out over the entire training 
set. 
This apparently novel technique is illustrated in Figure 2. Given the utterance "No 
not yet", for example, we observe that there are two frames each of "N" and "T", 
one frame of several other phoneroes, and zero of others. Bascd on these counts, 
we compute a backpropagation scaling factor for each phoneme in each frame, as 
shown in the bottom half of the figure. 
We found that this technique was indispensible when bootstrapping with the 
squared error criterion, E = y',(T/ - y})2. In subsequent experiments, we found 
that it was still somewhat helpful but no longer necessary when training with the 
McClelland error function, E = -  log(1 - (5 r} - Y)2), or with the Cross Entropy 
error function, E = - y' (75 logY}) + (1 - 7})log(1 - Y). We attribute this difference 
to the fact that the sum squared error function is merely a quadratic error function, 
whereas the latter two functions tend towards infinite error as the difference between 
the target and actual activation approaches its maximum value, compensating more 
forcefully for the fiat behavior encouraged by all the negative training. 
4 EXPERIMENTAL RESULTS 
We have trained and tested our MS-TDNN on two recordings of 200 English sen- 
tences from the CMU Conference Registration database, recorded by one male 
speaker using a close-speaking microphone. Our speaker-dependent testing results 
MS-TDNN's for Large Vocabulary Continuous Speech Recognition 701 
Table 1: Comparison of speech recognition systems applied to the CMU Conference 
Registration Database. HMM-n = HMM with n mixture densities [5]. LPNN = 
Linked Predictive Neural Network [6]. HCNN = Hidden Control Neural Network 
[7]. LVQ = Learned Vector Quantization [8]. TDNN corresponds to MS-TDNN 
without word-level training. (Perplexity 402(a) used only 41 test sentences; 402(b) 
used 204 test sentences.) 
System 7 lll 402(a) 402(b) 
HMM-1 55% 
HMM-5 96% 71% 58% 
HMM-10 97% 75% 66% 
LPNN 97% 60% 41% 
HCNN 75% 
LVQ 98% 84% 74% 61% 
TDNN 98% 78% 72% 64% 
MS-TDNN 98% 82% 81% 70% 
are given in Table 1, along with comparative results from several other systems. 
It can be seen that the MS-TDNN has outperformed all other systems that have 
been compared on this database. This particular MS-TDNN used 16 reelscale spec- 
tral input units, 20 hidden units, 120 phoneme units (40 phoneroes with a 3-state 
model), 5,487 DTW units, and 402 word units, with 3 and 5 delays respectively in 
the first two layers of weights, giving a total of 24,074 weights; it used symmetric 
(-1..1) unit activations and inputs, and linear DTW units and word units. Word 
level training was performed using the Classification Figure of Merit (CFM) error 
function, E = (1 + (Y,- Ye))2, in which the correct word (with activation Yc) is 
explicitly discriminated from the best incorrect word (with activation Y,); CFM 
was somewhat better than MSE for word level training, although the opposite was 
true for phoneme level training. Negative word level training was performed only if 
the two words were sufficiently confusable, in order to avoid disrupting the network 
on behalf of words that had already been well-learned. 
More recently we have begun experiments on the speaker independent Resource 
Management database, containing nearly 4000 training sentences. To date we have 
primarily focused on bootstrapping the phoneme level TDNN on this database, 
without doing much word level training; but early experiments suggest we may rea- 
sonably expect another 4% improvement from word level training. Our preliminary 
results are shown in Table 2, compared against two other systems: an early version 
of Sphinx [9], and a simple but large MLP which has been trained as a phoneme 
classifier [10]. Each of these systems uses context independent phoneme models, 
with multiple states per phoneme, and includes differenced coefficients in its input 
representation. It can be seen that our TDNN outperforms this version of Sphinx 
while using a comparable number of parameters, but is outperformed by the MLP 
which has an order of magnitude more parameters. (We note that the MLP also uses 
phonological models to enhance its performance, and uses online training with ran- 
702 Tebelskis and Waibel 
Table 2: 
database. 
Context-independent systems applied to the Resource 
perplexity 
System parameters 60 1000 
Early Sphinx 35,000 76% 36% 
MLP (ICSI-SRI) 300,000 94% 75% 
TDNN 42,000 79% 43% 
MS-TDNN 7,5,000 (+4%) 
Management 
dom sampling rather than updating the weights after each sentence.) In any case, 
we suggest that the MLP might further improve its performance by incorporating 
word level training. 
5 REMAINING INCONSISTENCIES 
While the MS-TDNN was designed for consistency, it is not yet entirely consistent. 
For example, the MS-TDNN's training algorithm assumes that the network con- 
nectivity is fixed; but in fact the connectivity at the word level varies, depending 
on the DTW alignment path during the current iteration. We presume that this 
is a negligible factor, however, by the time the training has asymptoted and the 
segmentation has stabilized. 
A more serious inconsistency arises during discriminative training. In our MS- 
TDNN, negative training is performed at known word boundaries; this is incon- 
sistent because word boundaries are in fact unknown during testing. It would be 
better to discriminate against words found by a free alignment, as suggested by Hild 
[3]. Unfortunately this is an expensive operation, and it proved impractical for our 
system. 
6 CONCLUSION 
We have shown that the performance of a connectionist speech recognition sys- 
tem can be improved by resolving inconsistencies in its design. Specifically, by 
introducing word level training into a TDNN phoneme classifier (thus defining an 
MS-TDNN), the training and testing criteria become consistent, enhancing the sys- 
tem's word recognition accuracy. We applicd our MS-TDNN architecture to the 
task of large vocabulary continuous speech recognition, and found that it outper- 
forms all other systems that have been evaluated on the CMU Conference Reg- 
istration database. In addition, preliminary results suggest that the MS-TDNN 
may perform well on the large vocabulary Resource Management database, using 
a relatively small number of free parameters. Our future work will focus on this 
investigation. 
MS-TDNN's for Large Vocabulary Continuous Speech Recognition 703 
Acknowledgements 
The authors gratefully acknowledge the support of DARPA and the National Science 
Foundation. 
References 
[10] 
[1] P. Ilaffner, M. Franzini, and A. Waibel. Integrating Time Alignment and 
Connectionist Networks for Iligh Performance Continuous Speech Recognition. 
In Proc. International Conference on Acoustics, Speech, and Signal Processing 
(ICA SSP), 1991. 
[2] P. Ilaffner. Connectionist Word-Level Classification in Speech Recognition. In 
Proc. ICASSP, 1992. 
[3] H. Hild and A. Waibel. Connected Letter Recognition with a Multi-State Time 
Delay Neural Network. In Advances in Neural Information Processing Systems 
5, Morgan Kaufmann Publishers, 1993. 
[4] T. Zeppenfeld and A. Waibel. A Hybrid Neural Network, Dynamic Program- 
ming Word Spotter. In Proc. ICASSP, 1992. 
[5] O. Schmidbauer. An LVQ Based Reference Model for Speaker-Independent and 
-Adaptive Speech Recognition. Technical Report, Carnegie Mellon University, 
1991. 
[6] J. Tebelskis, A. Waibel, B. Petek, and O. Schmidbauer. Continuous Speech 
Recognition using Linked Predictive Neural Networks. In Proc. ICASSP, 1991. 
[7] B. Petek and J. Tebelskis. Context-Dependent Hidden Control Neural Network 
Architecture for Continuous Speech Recognition. In Proc. ICASSP, 1992. 
[8] O. Schmidbauer and J. Tebelskis. An LVQ Based Reference Model for Speaker 
Adaptive Speech Recognition. In Proc. ICASSP, 1992. 
[9] K. F. Lee. Large Vocabulary Speaker-Independent Continuous Speech Recog- 
nition: The SPIlINX System. PhD Thesis, Carnegie Mellon University, 1988. 
M. Cohen, Il. Franco, N. Morgan, D. Bumelhart, and V. Abrash. Context- 
Dependent Multiple Distribution Phonetic Modeling with MLP's. In Advances 
in Neural Information Processing Systems 5, Morgan Kaufmann Publishers, 
1993. 
