CONTENTS
LIST OF FIGURES
LIST OF TABLES
PREFACE
1 FUNDAMENTALS OF SPEECH RECOGNITION
1.1 Introduction
1.2 The Paradigm for Speech Recognition
1.3 Outline
1.4 A Brief History of Speech-Recognition Research
2 THE SPEECH SIGNAL: PRODUCTION, PERCEPTION, AND
ACOUSTIC-PHONETICCHARACTERIZATION
2.1 Introduction
2.1.1 The Process of Speech Production and Perception in HumanBeings
2.2 The Speech-Production Process
2.3 Representing Speech in the Time and Frequency Domains
2.4 Speech Sounds and Features
2.4.1 TheVowels
2.4.2 Diphthongs
2.4.3 Semivowels
2.4.4 Nasal Consonants
2.4.5 Unvoiced Fricatives
2.4.6 Voiced Fricatives
2.4.7 Voiced and Unvoiced Stops
2.4.8 Review Exercises
2.5 Approaches to Automatic Speech Recognition by Machine
2.5.1 Acoustic-Phonetic Approach to Speech Recognition
2.5.2 Statistical Pattem-Recognition Approach to SpeechRecognition
2.5.3 Artificial Intelligence (AI) Approaches to SpeechRecognition
2.5.4 Neural Networks and Their Application to SpeechRecognition
2.6 Summary
3 SIGNAL PROCESSING AND ANALYSIS METHODS FOR SPEECH
RECOGNITION
3.1 Introduction
3.1.1 Spectral Analysis Models
3.2 The Bank-of-Filters Front-End Processor
3.2.1 Types of Filter Bank Used for Speech Recognition
3.2.2 Implementations of Filter Banks
3.2.3 Summary of Considerations for Speech-Recognition Filter
Banks
3.2.4 Practical Examples of Speech-Recognition Filter Banks
3.2.5 Generalizations of Filter-Bank Analyzer
3.3 Linear Predictive Coding Model for Speech Recognition
3.3.1 The LPC Model
3.3.2 LPC Analysis Equations
3.3.3 The Autocorrelation Method
3.3.4 The Covariance Method
3.3.5 Review Exercise
3.3.6 Examples of LPC Analysis
3.3.7 LPC Processor for Speech Recognition
3.3.8 Reviev Exercises
3.3.9 Typical LPC Analysis Parameters
3.4 Vector Quantization
3.4.1 Elements of a Vector Quantization Implementation
3.4.2 The VQ Training Set
3.4.3 The Similarity or Distance Measure
3.4.4 Clustering the Training Vectors
3.4.5 Vector Classification Procedure
3.4.6 Comparison of Vector and Scalar Quantizers
3.4.7 Extensions of Vector Quantization
3.4.8 SummaryoftheVQMethod
3.5 Auditory-Based Spectral Analysis Models
3.5.1 TheEIHModel
3.6 Summary
4 PATTERN-COMPARISON TECHNIQUES
4.1 Introduction
4.2 Speech (Endpoint) Detection
4.3 Distortion Measures--Mathematical Considerations
4.4 Distortion Measures-Perceptual Considerations
4.5 Spectral-Distortion Measures
4.5.1 Log Spectral Distance
4.5.2 Cepstral Distances
4.5.3 Weighted Cepstral Distances and Liftering
4.5.4 Likelihood Distortions
4.5.5 Variations of Likelihood Distortions
4.5.6 Spectral Distotion Using a Warped Frequency Scale
4.5.7 Altemative Spectral Representations and DistortionMeasures
4.5.8 Summary of Distortion Measures-ComputationalConsiderations
4.6 Incorporation of Spectral Dynamic Features into the DistortionMeasure
4.7 Time Alignment and Normalization
4.7.1 Dynamic Programming--Basic Considerations
4.7.2 Time-Normalization Constraints
4.7.3 Dynamic Time-Warping Solution
4.7.4 Other Considerations in Dynamic Time Warping
4.7.5 Multiple Time-Alignment Paths
4.8 Summary
5 SPEECH RECOGNITION SYSTEM DESIGN AND IMPLEMENTATION
ISSUES
5.1 Introduction
5.2 Application of Source-Coding Techniques tp Recognition
5.2.1 Vector Quantization and Pattem Comparison Without TimeAlignment
5.2.2 Centroid Computation for VQ Codebook Design
5.2.3 Vector Quantizers with Memory
5.2.4 Segmental Vector Quantization
5.2.5 Use of a Vector Quantizer as a Recognition Preprocessor
5.2.6 Vector Quantization for Efficient Pattem Matching
5.3 Template Training Methods
5.3.1 Casual Training
5.3.2 Robust Training
5.3.3 Clustering
5.4 Performance Analysis and Recognition Enhancements
5.4.1 Choice of Distortion Measures
5.4.2 Choice of Clustering Methods and kNN Decision Rule
5.4.3 Incorporation of Energy Information
5.4.4 Effects of Signal Analysis Parameters
5.4.5 Performance of Isolated Word-Recognition Systems
5.5 Template Adaptation to New Talkers
5.5.1 Spectral Transformation
5.5.2 Hierarchical Spectral Clustering
5.6 Discriminative Methods in Speech Recognition
5.6.1 Determination of Word Equivalence Classes
5.6.2 Discriminative Weighting Functions
5.6.3 Discriminative Training for Minimum Recognition Error
5.7 Speech Recognition in Adverse Environments
5.7.1 Adverse Conditions in Speech Recognition
5.7.2 Dealing with Adverse Conditions
5.8 Summary
6 THEORY AND IMPLEMENTATION OF HIDDEN MARKOV MODELS
6.1 Introduction
6.2 Discrete-Time Markov Processes
6.3 Extensions to Hidden Markov Models
6.3.1 Coin-Toss Models
6.3.2 The Um-and-Ball Model
6.3.3 Elements of an HMM
6.3.4 HMM Generator of Observations
6.4 The Three Basic Problems for HMMs
6.4.1 Solution to Problem 1-Probability Evaluation
6.4.2 Solution to Problem 2--"Optimal" State Sequence
6.4.3 Solution to Problem 3--Parameter Estimation
6.4.4 Notes on the Reestimation Procedure
6.5 TypesofHMMs
6.6 Continuous Observation Densities in HMMs
6.7 Autoregressive HMMs
6.8 Variants on HMM Structures-Null Transitions and TiedStates
6.9 Inclusion of Explicit State Duration Density in HMMs
6.10 Optimization Criterion-ML, MMI, and MDI
6.11 Comparisons of HMMs
6.12 Implementation Issues for HMMs
6.12.1 Scaling
6.12.2 Multiple Observation Sequences
6.12.3 Initial Estimates of HMM Parameters
6.12.4 Effects of Insufficient Training Data
6.12.5 ChoiceofModel
6.13 Improving the Effectiveness of Model Estimates
6.13.1 Deleted Interpolation
6.13.2 Bayesian Adaptation
6.13.3 Corrective Training
6.14 Model Clustering and Splitting
6.15 HMM System for Isolated Word Recognition
6.15.1 Choice of Model Parameters
6.15.2 Segmental K-Means Segmentation into States
6.15.3 Incorporation of State Duration into the HMM
6.15.4 HMM Isolated-Digit Performance
6.16 Summary
7 SPEECH RECOGNITION BASED ON CONNECTED WORD MODELS
7.1 Introduction
7.2 General Notation for the Connected Word-Recognition
Problem
7.3 The Two-Level Dynamic Programming (Two-Level DP)
Algorithm
7.3.1 Computation of the Two-Level DP Algorithm
7.4 The Level Building (LB) Algorithm
7.4.1 Mathematics of the Level Building Algorithm
7.4.2 Multiple Level Considerations
7.4.3 Computation of the Level Building Algorithm
7.4.4 Implementation Aspects of Level Building
7.4.5 Integration of a Grammar Network
7.4.6 Examples of LB Computation of Digit Strings
7.5 The One-Pass (One-State) Algorithm
7.6 Multiple Candidate Strings
7.7 Summary of Connected Word Recognition Algorithms
7.8 Grammar Networks for Connected Digit Recognition
7.9 Segmental K-Means Training Procedure
7.10 Connected Digit Recognition Implementation
7.10.1 HMM-Based System for Connected Digit Recognition
7.10.2 Performance Evaluation on Connected Digit Stririgs
7.11 Summary
8 LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
8.1 Introduction
8.2 Subword Speech Units
8.3 Subword Unit Models Based on HMMs
8.4 Training of Subword Units
8.5 Language Models for Large Vocabulary Speech
Recognition
8.6 Statistical Language Modeling
8.7 Perplexity of the Language Model
8.8 Overall Recognition System Based on Subword Units
8.8.1 Control of Word Insertion/Word Deletion Rate
8.8.2 Task Semantics
8.8.3 System Performance on the Resource Management Task
8.9 Context-Dependent Subword Units
8.9.1 Creation of Context-Dependent Diphones and Triphones
8.9.2 Using Interword Training to Create CD Units
8.9.3 Smoothing and Interpolation of CD PLU Models
8.9.4 Smoothing and Interpolation of Continuous Densities
8.9.5 Implementation Issues Using CD Units
8.9.6 Recognition Results Using CD Units
8.9.7 Position Dependent Units
8.9.8 Unit Splitting and Clustering
8.9.9 Other Factors for Creating Additional Subword Units
8.9.10 Acoustic Segment Units
8.10 Creation of Vocabulary-lndependent Units
8.11 Semantic Postprocessor for Recognition
8.12 Summary
9 TASK ORIENTED APPLICATIONS OF AUTOMATIC SPEECH
RECOGNITION
9.1 Introduction
9.2 Speech-Recognizer Performance Scores
9.3 Characteristics of Speech-Recognition Applications
9.3.1 Methods of Handling Recognition Errors
9.4 Broad Classes of Speech-Recognition Applications
9.5 Command-and-Control Applications
9.5.1 Voice Repertory Dialer
9.5.2 Automated Call-Type Recognition
9.5.3 Call Distribution by Voice Commands
9.5.4 Directory Listing Retrieval
9.5.5 Credit Card Sales Validation
9.6 Projections for Speech Recognition