Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Robust automatic speech recognition
Nội dung xem thử
Mô tả chi tiết
Robust
Automatic Speech
Recognition
%
Robust Automatic
Speech Recognition
A Bridge to Practical
Applications
Robust Automatic
Speech Recognition
A Bridge to Practical
Applications
Jinyu Li
Li Deng
Reinhold Haeb-Umbach
Yifan Gong
ELSEV IER
a m s t i;r i)a m • B o s r o N • HEinr.i.BF.R(; • l d n d o n
NE:W Y ORK • O X FO R I5 • PARIS • S A N DIV.GO
SA N FRA NCISCO ) • SIN G A P O R E • SY D N E Y • TO K Y O
At^riDH Prnt » «I itnpnni ol FJvvki
Acadomic Press is an imprint oí lilscvier
225 Wyman Street.W aliham.M A 02451. USA
The Uoulevard. I.angford I-ane. Kidlingion. O xlord()X 5 KiB. UK
© 2016 lilscvier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or hy any means, electronic
or mechanical, including pholcKopying. recording, or any information storage and retrieval system,
without permission in writing from [he publisher. IXitails on how to seek permission, further information
about the l^lblishe^*s permissions policies and our arrangements with organizations such as the
Copyright ('learance Center and the Copyright Licensing Agency, can be found at our website:
www .elsevicr.com/permissions.
This btx)k and the individual contributions contained in it are protected under copyright by the Publisher
(other than as may be noted herein).
Noticcs
Knowledge and best practice in this field are constantly changing. As new research and experience
broaden our understanding, changes in research methods, professional prácticos, or medical treatment
may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and
using any information, methods, com pounds, or experiments described herein. In using such information
or methixJs they should be mindful of their own safety and the safety o f others, including parties for
whom they have a professional responsibility.
To the fullest extent of the law. neither the Publisher nor the authors, contributors, or editors, assume any
liability for any injury and/or dam age to persons or property as a matter of products liability, negligence
or otherwise, or from any use or operation o f any methods, products, instructions, or ideas contained in
the material herein.
l.ibrary of Congruss CaUlogin^'-in-Publication Data
A catalog record for this bcx)k is available fn>m the Library of Congress
KritLsh Library Cataloguing tn Publication Data
A catalogue record for this book is availid)te from the British Library
ISBN; 978-0-12-80239S-3
l-or information on all Academic Press publications
visit our website at http://store.elsevier.com/
Typeset by Sl*i (ilobal. India
www.spi-global.com
Primed in USA
p i i ^ Working together
to grow libraries in
SSSoiJS developing countries
www.eiscvicr.com • www.bookaid.org
Contents
A boul ihc A u th o rs ..................................................................................................................... ix
1 js l o f I-ig u rcs............................................................................................................................. xi
List o i'T a b lcs................................................................................................................................ xiii
A cro n y m s...................................................................................................................................... xv
N o tatio n s....................................................................................................................................... xix
CHAPTER 1 Introduction...................................................................................1
1.1 AuUimalic Spcoch R cco g n iiio n .................................................................... 1
1.2 Rohuslncss lo N oisy linv iro n m cn is............................................................2
1.3 i;xisling Surveys in ihe A re a ........................................................................ 2
1.4 Hook Siruciure O v erv iew .............................................................................. .“i
R eleren ce s........................................................................................................... 6
CHAPTER 2 Fundamentals of speech recognition................................. 9
2.1 Iniroduction: C 'om poncnisol Spccch R cco g n iiio n.................................9
2 .2 (iaussian M ixture M o d els............................................................................11
2 .3 H idden M arkov M odels and the V ariants.................................................. 1
2.3.1 How to Param eteri/e an H M M ....................................................... 13
2.3.2 1-1'ikient Likelihood iivaluation to r the H M M ...................... 14
2.3.3 l-M A lgorithm lo L eam the HM M P aram eters..........................17
2.3.4 How the HM M R epresents Temporal D ynam ics
o f S p e e c h ................................................................................................IS
2.3.5 CiM M -HM M s for Spcech M odeling and R ecognition.........19
2.3.6 H idden D ynam ic M odels for Speech M odehng and
R cco g n iiio n .......................................................................................... 20
2 .4 D eep 1 n a m in g and l>ecp N eural N etw o rk s...........................................21
2.4.1 Iniroduclion...........................................................................................21
2.4.2 A B rief H istorical P erspective........................................................23
2.4.3 The Basics o f D eep N eural N etw orks......................................... 23
2.4.4 A liernative D eep Learning A rchiiectures...................................27
2 .5 Sum m ary............................................................................................................31
R eferences.........................................................................................................32
CHAPTER 3 Background of robust speech recognition................... 41
3.1 Standard T.valuation D atab ases..................................................................41
3 .2 M odeling D islonions o f Speech in A couslic 1-nvironm enis.........43
3 .3 Im pact o f A coustic D istortion on (iaussian M o d elin g .....................46
3 .4 Im pact o f A coustic D istortion on DNN M odeling ............................ 50
3 .5 A Cieneral I-ramework fo r Robust Speech R cco g n iiio n .................. 55
Contents
3 .6 C ategorizing Robust A SR Techniques: An O v erv iew ........................57
3.6.1 C om pensaiion in I-’eature D om ain vs. M odel D o m ain ......... 57
3.6.2 C om pensaiion U sing Prior K nowledge about
A coustic D isto rtio n ............................................................................ 58
3.6.3 C om pensation with lixplicit vs. Im plicit D istortion
M o d elin g ............................................................................................... 59
3.6.4 C om pensation w ith D eterm inistic vs. U ncertainty
P rocessing............................................................................................. 59
3.6.5 C om pensation with Disjoint vs. Joint M odel T ra in in g ........60
3 .7 vSummary............................................................................................................60
R elerences..........................................................................................................61
CHAPTER 4 Processing in the feature and model dom ains............6 5
4.1 1-ealure-vSpace A pproaches............................................................................66
4.1.1 N oise-R esistant I'e a tu re s..................................................................67
4.1.2 1-eature M om ent N orm alization.....................................................74
4.1.3 I-'eature C om pensation.......................................................................79
4 .2 M odel-Space A p p ro ach es............................................................................85
4.2.1 Cieneral M odel A daptation for G M M ..........................................85
4.2.2 (ieneral M odel A daptation for D N N ............................................88
4.2.3 Robustness via B elter M odeling.................................................... 91
4 .3 Sum m ary............................................................................................................94
R eferences......................................................................................................... 98
CHAPTER 5 Compensation with prior know ledge.......................... 107
5.1 [.earning from Stereo D a la ........................................................................10«
.S. 1.1 lim pirical Cepslral C o m p en saiio n ...............................................lOX
5.1.2 SPI.IC H .................................................................................................lo y
.S. I ..1 DNN for N oise R em oval U sing .Stereo D ata............................112
5 .2 L earning from M ulti-linvironm eni D ata ............................................... 116
5.2.1 O nline M odel C o m b in atio n .......................................................... 116
5.2.2 N on-N egalive M atrix I'aeto rizalio n ........................................... 119
5.2..1 Variahle-l’aram eter M odeling.......................................................122
5 .3 S u m m ary ......................................................................................................... I2X
R eferen ces....................................................................................................... 131
CHAPTER 6 Explicit distortion m odeling...............................................137
6.1 Parallel M odel C o m b in atio n .................................................................... 139
6 .2 Vector Taylor .Series..................................................................................... 141
6.2.1 VTS M odel A daptation...................................................................142
6.2.2 D istortion Hstim ation in V T S .......................................................14.1
6 .2 3 VTS I-'eature lin h an cem en t........................................................... 146
Contents vii
6.2.4 Im provem ents over V T S ................................................................150
6.2.5 VTS Ibr the l)N N -l!ased Aeim siic M o d el............................... 152
6 .3 Sam pling-B ased M ethods.......................................................................... 154
6.3.1 D ata-D riven I’M f .............................................................................154
6..3.2 U nscentcd T ran sfo rm ......................................................................154
6.3.3 M ethods B eyond the (iaussian A .ssum ption.............................156
6 .4 A coustic I a c to ri/a tio n ............................................................................... 156
6.4.1 A coustic l actori/ation I 'ram ew ork..............................................157
6.4.2 A coustic i-actorization for (iM M .................................................157
6.4.3 A coustic I'actori/.ation for D N N .................................................. 160
6 .5 S um m ary......................................................................................................... 162
R eferences...................................................................................................... 165
CHAPTER 7 Uncertainty processing.........................................................171
7.1 M odel-D om ain U n cen ain ty ........................................................................172
7 .2 I cature-D om ain U ncertainty.................................................................... 173
7.2.1 O bservation U n certainty................................................................ 173
7 .3 Joint U ncertainly D eco d in g ......................................................................176
7.3.1 iT ont-lind J U l) ..................................................................................176
7.3.2 M odel JU D ......................................................................................... 178
7 .4 M issing-I'cature A pproaches....................................................................179
7 .5 S um m ary.........................................................................................................182
R eferences...................................................................................................... 183
CHAPTERS Joint model tra in in g ............................................................. 187
8.1 Speaker A daptive and Source N orm aii/.aiion
Training............................................................................................................1S9
8 .2 M odel Space Noise Adaptive T ra in in g ................................................. 190
8 .3 Joint Training for D N N ...............................................................................195
8.3.1 Joint I ront-l:nd and DNN Mt>del T ra in in g ............................. 195
8.3.2 Joint A daptive T ra in in g ................................................................. 195
8 .4 S um m ary.........................................................................................................198
R eferences..................................................................................................... 2(X)
CHAPTER 9 Reverberant speech recognition....................................20 3
9.1 In troduction...................................................................................................203
9 .2 A coustic Im pulse R esp o n se.....................................................................206
9 .3 A M odel of Reverberated Speech in D ifferent
D om ains.......................................................................................................... 211
9 .4 The lilfect ol Reverberation on ASR P erfo rm an ce..........................213
9 .5 I.inear I'iltering A pproaches.................................................................... 214
9 .6 M agnitude or Pow er Spectrum linhancem ent....................................217
viii Contents
9 .7 I-'ealurc D om ain A pproaches....................................................................2IX
9.7.1 R everberation Robust I-eatures................................................... 218
9.7.2 I-eature N orm alization................................................................... 219
9.7.3 M odel-B ased l eature linhancem eni..........................................219
9.7.4 D ata-D riven I'nhancem ent............................................................ 221
9 .8 Acoustic M odel Domain A pproaches................................................... 225
9 .9 The RHVHRB C hallenge...........................................................................228
9 .1 0 To l*robe F u rth e r.......................................................................................... 231
9.11 S um m ary.........................................................................................................231
R eferences..................................................................................................... 233
CHAPTER 10 Multi-Channel processing................................................. 239
10.1 Intro d u ctio n .................................................................................................. 239
1 0.2 The A coustic Beam form ing P ro b lem ....................................................241
10.3 l-undam entals o f D ata-D ependent B eam fo rm in g ............................. 245
10.3.1 Signal M odel and O bjective l-u n ctio n s..................................245
10.3.2 (len eralized Sidelobe C an c eller............................................... 248
10.3.3 R elative Transfer I'unctions.......................................................2.50
1 0.4 M ulti-C hannel Speech R eco g n itio n ...................................................... 253
10.4.1 A SR on B eam form ed S ig n als................................................... 253
10.4.2 M ulti-Stream A S R ........................................................................254
10.5 To l*robe F u rth e r......................................................................................... 256
10.6 S um m ary........................................................................................................257
R eferences..................................................................................................... 257
CHAPTER 11 Summary and future directions........................................261
11.1 Robust M ethods in the lira o fC .M M .................................................... 262
11.2 Robust M ethods in the lira o f D N N ......................................................26X
11.3 M ulti-Channel Input and Robustness lo R everberation...................271
1 1 .4 H pilogue......................................................................................................... 272
R eferences.....................................................................................................275
In d e x ..............................................................................................................................................281
About the Authors
J in y u Li received Ph.D. degree from Cieorgia Insiiiuie o f Technology, U.S.A.
iTom 2(XK) to 2(K)3. he was a R esearcher al Intel C hina R esearch C enter and a
R esearch M anager at i l ’iyiek, C hina. C urrently, he is a i’rincipal A pplied Scientist
at M icrosoft, w orking as a technical lead lo design and im prove speech m odeling
algorithm s and lechnologies that ensure industry staie-of-lhe-art speech recognition
accuracy for M icrosoft products. H is m ajor research interests co v er several topics in
speech recognition and m achine learning, including noise robustness, deep learning,
discrim inative training, and feature extraction. He has authored over 60 papers and
aw arded over 10 patents.
Li D eng received Ph.D. degree from the U niversity o f W isconsin-M adison, U.S.A.
He w as a professor (1989-1999) at the U niversity o f W aterloo. C anada. In 1999. he
jo in ed M icrosoft Research, w here he currently leads R& D o f application-focused
deep learning as Partner R esearch M anager of its D eep Learning Technology Center.
He is also an A fliliate l*rofessor al U niversity o f W ashington. He is a l ellow o f the
A coustical Society o f A m erica, I'ellow o f the ILLL. and 1‘ellow o f the International
S peech C om m unication AsscKialion. He served as Ldilor-in-C hief for the ILHI:
Signal Processing M agazine and for the ll:LL /A C M T ransactions on A udio, Spcech
and Language Processing (2(X)9-2014). H is technical work has been focused on deep
learning for speech, language, im age, and m ultim odal processing, and for other areas
o f m achine intelligence involving big data. He received num erous aw ards including
the ILLIÍ SPS Best Paper A wards, ILI-I- O utstanding lingineer A ward, and APSIPA
Industrial D istinguished L eader Award.
R ein h o ld H aeb -U m b ac h is a professor w ith the U niversity o f I’aderborn. Ciermany.
H is m ain research interests are in the fields o f statistical signal processing and patiem
recognition, w ith applications lo speech enhancem ent, acoustic beam form ing and
source separation, as w ell as autom atic speech recognition. A fter having w orked in
industrial research laboratories for m ore than 10 years, he jo in ed academ ia as a full
p rofessor o f C om m unications Fingineering in 2(X)1. He has published m ore than 150
p ap ers in peer review ed jo u rn als and conferences. He is the co-ed ito r o f the book
R o b u st Speech Recognition o f U ncertain o r M issing D ata— Theory a n d A pplications
(Springer, 2011).
Y ifan G o n g received Ph.D. (w ith highest honors) from the U niversity o f Henri
P oincare, France. He served the N ational Scieniitic R esearch C enter (C N R S) and
IN R IA . I'rance, as R esearch lüngineer and then joined CN R S as S enior Research
Scientist. H e w as a V isiting Research Fellow al the C om m unications R esearch
C en ier o f C anada. A s Senior M em ber of Technical Staff, he w orked for Texas
Instrum ents at the Speech Technologies Lab, where he developed speech m odeling lechnologies robust against noisy environm ents, designed system s, algorithm s.
About the Authors
and softw are for speech and speaker recognition, and delivered m em ory- and
C P U -efticienl recognizers for m obile devices.
He jo in ed M icrosoft in 2(X)4. and is currently a Principal A pplied Science M anager
in the areas o f speech m odeling, com puting infrastructure, and speech model
developm ent for speech products. His research interests include autom atic speech
recognition/interpretation, signal processing, algorithm developm ent, and engineering process/infrastructure and m anagem ent. He has authored over 130 publications
and aw arded over 30 patents. vSpecific contribution includes stochastic trajectory
m odeling, source norm alization H M M training, jo in t com pensation o f additive and
convolutional noises, and variable param eter HM M . In these areas, he gave tutorials
and oth er invited presentations in international conferences. He has been serving as
m em ber o f technical com m ittee and session chair for many international conferences,
and with IHHI: Signal l^cx;essing Spoken Language Technical Com m ittees from
1998 to 2(X)2 and since 2013.
List of Figures
Fig. 1,1 From thoughts to speech. 3
Fig. 2,1 Illustration of the C D -D N N -H M M and its three core com ponents. 24
Fig. 2 ,2 Illustration of the CNN in w hich the convolution is applied along
frequency bands. 28
Fig. 3.1 A m odel of acoustic e nvironm ent distortion in the discrete-tim e dom ain
relating the clean speech sam ple x \m \ to the distorted speech sam ple
y |m |. 43
Fig. 3 .2 Cepstral distrib utio n of word oh in A urora 2. 47
Fig. 3 .3 The im pact of noise, w ith varying m ean values from 5 in (a) to 25 in
(d), in the log-M el-filter-bank dom ain. The clean speech has a m ean
value of 25 and a standard deviation of 10. The noise has a standard
deviation of 2. 48
Fig. 3.4 Im pact of noise w ith d ifferent standard deviation values in the
log-fvlel-filter-tiank dom ain. The clean speech has a m ean value of 25
and a standard deviation of 10. The noise has a m ean of 10, 49
Fig, 3 ,5 Percentage of saturated activations at each layer on a 6 x 2 k DNN, 51
Fig. 3 .6 Average and m axim um of ||d ia g (v '+ '. * (1 - v '+ ') ) ( A ') '| | ^ across layers
c n a 6 x 2 k D N N . 51
Fig. 3 .7 t-SNE plot of a clean utterance and the corresponding noisy one with
lO d B SNR of restaurant noise from the training set of Aurora 4. 52
Fig. 3 .8 t-SNE plot of a clean utterance and the corresponding noisy one with
11 dB SNR of restaurant noise from the test set of Aurora 4. 54
Fig. 3 .9 Noise-robust m ethods in feature and m odel dom ain. 57
Fig. 4.1 Com parison of the NflFCC, RASTA-PLP, and PNCC feature extraction. 68
Fig. 4 .2 C om putation of the m odulation spectral of a speech signal. 69
Fig. 4 .3 Frequency response of RASTA. 70
Fig. 4 .4 Illustration of the tem poral stru cture norm alization fram ew ork. 71
Fig. 4 .5 An exam ple of frequency response of C M N when T = 2 0 0 at a fram e
rate of 10 Hz. 75
Fig. 4 .6 An exam ple of the W iener filte rin g gain G w ith respect to the spectral
density Sxx and S „„. 82
Fig. 4 .7 Two-stage W iener filter in advanced front-end. 83
Fig. 4 .8 C om plexity reduction fo r tw o stage W iener filter. 84
Fig. 4 .9 Illustration of network structures o f d ifferent adaptation m ethods.
Shaded nodes denote nonlinear units, unshaded nodes for linear units.
Red dashed links (gray dashed links in p rin t versions) indicate the
transform ations that are introduced d urin g adaptation. 89
Fig. 4 .1 0 The illustration of support vector m achines. 92
Fig. 4.11 The fram ew ork to com bine generative and discrim inative classifiers. 93
Fig. 5.1 Generate clean feature from noisy feature w ith DNN. 112
Fig. 5.2 Speech separation w ith DNN. 115
Fig. 5.3 Linear m odel com bination fo r D N N . 119
x ií List of Figures
Fig, 5 ,4 Variable-param eter DNN. 125
Fig. 5.5 Variable-output D N N . 126
Fig. 5 ,6 Variable-activation DNN. 128
Fig. 6.1 Parallel model com bination. 139
Fig. 6 .2 VTS model adaptation, 146
Fig. 6 .3 VTS feature enhancem ent. 148
Fig. 6 .4 Cepstral distribution of word oh in Aurora 2 after VTS feature
e nhancem ent (fVTS). 150
Fig, 6 .5 A coustic factorization fram ework. 158
Fig. 6 .6 The flow chart of factorized adaptation for a DNN at the o utp ut layer. 161
Fig. 6 .7 The flow chart of factorized training or adaptation for a D N N at the
input layer. 162
Fig. 8.1 Speaker adaptive training. 190
Fig. 8 .2 Noise adaptive training. 191
Fig. 8 .3 Joint training of front-end and D N N m odel, 196
Fig. 8 .4 An exam ple of jo in t training of front-end and DNN m odels, 197
Fig, 8 .5 A daptive training of DNN. 198
Fig. 9.1 Hands-free autom atic speech recognition in a reverberant enclosure;
the source signal travels via a d ire ct path and via single or m ultiple
reflections to the m icrophone. 205
Fig, 9 .2 A typical acoustic im pulse response for a sm all room w ith short
distance between source and sensor (0.5 m ), This im pulse response
has the param eters r6o = 25 0 m s and C ^ = 3 1 d B . The im pulse
response is taken from the REVERB challenge data, 207
Fig. 9 .3 A typical acoustic im pulse response for a large room w ith large
distance between source and sensor (2 m). This im pulse response has
the param eters 76o=700m s and C bo=6.6dB . The im pulse response is
taken from the REVERB challenge data. 207
Fig. 9 .4 Spectrogram of a clean speech signal (top), a m ildly reverberated signal
(/■bo=250ms, m iddle) and a severely reverberated signal (Í6 0 = 7 0 0 m s ,
bottom ). The dashed lines indicated the word boundaries. 213
Fig. 9 .5 Principle structure of a denoising autoencoder. 223
Fig. 10.1 U niform linear array w ith a source in the far field, 242
Fig. 10.2 Sam ple beam patterns of a Delay-Sum Beam form er steered toward
fío = 0. 243
Fig, 10.3 Block diagram of a generalized sidelobe canceller with fixed
beam form er (FBF) Wo, blocking m atrix B, and
noise cancellation filters q. 249
List of Tables
Table 4 1
Table 4.2
Table 5.1
Table 5.2
Table 5.3
Table 6.1
Table 7.1
Table 8.1
Table 9.1
Table 10.1
Table 11.1
Table 11.2
Table 11.3
D efinitions of a Subset of C om m only Used Sym bols and Notations,
G rouped in Five Separate General Categories xix
Feature- and M odel-D om ain M ethods O riginally Proposed for G M M s
in C hapter 4, Arranged C hronologically 95
Feature- and M odel-D om ain M ethods O riginally Proposed for D N N s
in C hapter 4, Arranged C hronologically 9 7
D ifference Between VPDNN and Linear DN N M odel C om bination 126
C om pensation w ith Prior Knowledge M ethods O riginally Proposed for
G M M s in Chapter 5, Arranged C hronologically 129
C om pensation w ith Prior Knowledge M ethods O riginally Proposed for
D N N s in C hapter 5, Arranged C hronologically 130
Distortion M odeling M ethods in C hapter 6, Arranged C hronologically 163
U nce rta in ty Processing M ethods in C hapter 7, Arranged
C hronologically 182
Joint M odel Training M ethods in C hapter 8, Arranged C hronologically 199
A pproaches to the Recognition of R everberated Speech, Arranged
C hronologically 2 32
A pproaches to Speech Recognition in the Presence o f M ulti-C hannel
Recordings 2 56
Representative M ethods O riginally Proposed for G M M s, Arranged
A lphabetically in Terms of the N am es of th e M ethods 2 63
Representative M ethods O riginally Proposed for D N N s, Arranged
A lphabetically 2 69
The C ounterparts of G M M -based R obustness M ethods for D N N -based
Robustness M ethods 2 70
xiii