Siêu thị PDFTải ngay đi em, trời tối mất

Thư viện tri thức trực tuyến

Kho tài liệu với 50,000+ tài liệu học thuật

© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Robust automatic speech recognition
PREMIUM
Số trang
311
Kích thước
58.9 MB
Định dạng
PDF
Lượt xem
1982

Robust automatic speech recognition

Nội dung xem thử

Mô tả chi tiết

Robust

Automatic Speech

Recognition

%

Robust Automatic

Speech Recognition

A Bridge to Practical

Applications

Robust Automatic

Speech Recognition

A Bridge to Practical

Applications

Jinyu Li

Li Deng

Reinhold Haeb-Umbach

Yifan Gong

ELSEV IER

a m s t i;r i)a m • B o s r o N • HEinr.i.BF.R(; • l d n d o n

NE:W Y ORK • O X FO R I5 • PARIS • S A N DIV.GO

SA N FRA NCISCO ) • SIN G A P O R E • SY D N E Y • TO K Y O

At^riDH Prnt » «I itnpnni ol FJvvki

Acadomic Press is an imprint oí lilscvier

225 Wyman Street.W aliham.M A 02451. USA

The Uoulevard. I.angford I-ane. Kidlingion. O xlord()X 5 KiB. UK

© 2016 lilscvier Inc. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or hy any means, electronic

or mechanical, including pholcKopying. recording, or any information storage and retrieval system,

without permission in writing from [he publisher. IXitails on how to seek permission, further information

about the l^lblishe^*s permissions policies and our arrangements with organizations such as the

Copyright ('learance Center and the Copyright Licensing Agency, can be found at our website:

www .elsevicr.com/permissions.

This btx)k and the individual contributions contained in it are protected under copyright by the Publisher

(other than as may be noted herein).

Noticcs

Knowledge and best practice in this field are constantly changing. As new research and experience

broaden our understanding, changes in research methods, professional prácticos, or medical treatment

may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and

using any information, methods, com pounds, or experiments described herein. In using such information

or methixJs they should be mindful of their own safety and the safety o f others, including parties for

whom they have a professional responsibility.

To the fullest extent of the law. neither the Publisher nor the authors, contributors, or editors, assume any

liability for any injury and/or dam age to persons or property as a matter of products liability, negligence

or otherwise, or from any use or operation o f any methods, products, instructions, or ideas contained in

the material herein.

l.ibrary of Congruss CaUlogin^'-in-Publication Data

A catalog record for this bcx)k is available fn>m the Library of Congress

KritLsh Library Cataloguing tn Publication Data

A catalogue record for this book is availid)te from the British Library

ISBN; 978-0-12-80239S-3

l-or information on all Academic Press publications

visit our website at http://store.elsevier.com/

Typeset by Sl*i (ilobal. India

www.spi-global.com

Primed in USA

p i i ^ Working together

to grow libraries in

SSSoiJS developing countries

www.eiscvicr.com • www.bookaid.org

Contents

A boul ihc A u th o rs ..................................................................................................................... ix

1 js l o f I-ig u rcs............................................................................................................................. xi

List o i'T a b lcs................................................................................................................................ xiii

A cro n y m s...................................................................................................................................... xv

N o tatio n s....................................................................................................................................... xix

CHAPTER 1 Introduction...................................................................................1

1.1 AuUimalic Spcoch R cco g n iiio n .................................................................... 1

1.2 Rohuslncss lo N oisy linv iro n m cn is............................................................2

1.3 i;xisling Surveys in ihe A re a ........................................................................ 2

1.4 Hook Siruciure O v erv iew .............................................................................. .“i

R eleren ce s........................................................................................................... 6

CHAPTER 2 Fundamentals of speech recognition................................. 9

2.1 Iniroduction: C 'om poncnisol Spccch R cco g n iiio n.................................9

2 .2 (iaussian M ixture M o d els............................................................................11

2 .3 H idden M arkov M odels and the V ariants.................................................. 1

2.3.1 How to Param eteri/e an H M M ....................................................... 13

2.3.2 1-1'ikient Likelihood iivaluation to r the H M M ...................... 14

2.3.3 l-M A lgorithm lo L eam the HM M P aram eters..........................17

2.3.4 How the HM M R epresents Temporal D ynam ics

o f S p e e c h ................................................................................................IS

2.3.5 CiM M -HM M s for Spcech M odeling and R ecognition.........19

2.3.6 H idden D ynam ic M odels for Speech M odehng and

R cco g n iiio n .......................................................................................... 20

2 .4 D eep 1 n a m in g and l>ecp N eural N etw o rk s...........................................21

2.4.1 Iniroduclion...........................................................................................21

2.4.2 A B rief H istorical P erspective........................................................23

2.4.3 The Basics o f D eep N eural N etw orks......................................... 23

2.4.4 A liernative D eep Learning A rchiiectures...................................27

2 .5 Sum m ary............................................................................................................31

R eferences.........................................................................................................32

CHAPTER 3 Background of robust speech recognition................... 41

3.1 Standard T.valuation D atab ases..................................................................41

3 .2 M odeling D islonions o f Speech in A couslic 1-nvironm enis.........43

3 .3 Im pact o f A coustic D istortion on (iaussian M o d elin g .....................46

3 .4 Im pact o f A coustic D istortion on DNN M odeling ............................ 50

3 .5 A Cieneral I-ramework fo r Robust Speech R cco g n iiio n .................. 55

Contents

3 .6 C ategorizing Robust A SR Techniques: An O v erv iew ........................57

3.6.1 C om pensaiion in I-’eature D om ain vs. M odel D o m ain ......... 57

3.6.2 C om pensaiion U sing Prior K nowledge about

A coustic D isto rtio n ............................................................................ 58

3.6.3 C om pensation with lixplicit vs. Im plicit D istortion

M o d elin g ............................................................................................... 59

3.6.4 C om pensation w ith D eterm inistic vs. U ncertainty

P rocessing............................................................................................. 59

3.6.5 C om pensation with Disjoint vs. Joint M odel T ra in in g ........60

3 .7 vSummary............................................................................................................60

R elerences..........................................................................................................61

CHAPTER 4 Processing in the feature and model dom ains............6 5

4.1 1-ealure-vSpace A pproaches............................................................................66

4.1.1 N oise-R esistant I'e a tu re s..................................................................67

4.1.2 1-eature M om ent N orm alization.....................................................74

4.1.3 I-'eature C om pensation.......................................................................79

4 .2 M odel-Space A p p ro ach es............................................................................85

4.2.1 Cieneral M odel A daptation for G M M ..........................................85

4.2.2 (ieneral M odel A daptation for D N N ............................................88

4.2.3 Robustness via B elter M odeling.................................................... 91

4 .3 Sum m ary............................................................................................................94

R eferences......................................................................................................... 98

CHAPTER 5 Compensation with prior know ledge.......................... 107

5.1 [.earning from Stereo D a la ........................................................................10«

.S. 1.1 lim pirical Cepslral C o m p en saiio n ...............................................lOX

5.1.2 SPI.IC H .................................................................................................lo y

.S. I ..1 DNN for N oise R em oval U sing .Stereo D ata............................112

5 .2 L earning from M ulti-linvironm eni D ata ............................................... 116

5.2.1 O nline M odel C o m b in atio n .......................................................... 116

5.2.2 N on-N egalive M atrix I'aeto rizalio n ........................................... 119

5.2..1 Variahle-l’aram eter M odeling.......................................................122

5 .3 S u m m ary ......................................................................................................... I2X

R eferen ces....................................................................................................... 131

CHAPTER 6 Explicit distortion m odeling...............................................137

6.1 Parallel M odel C o m b in atio n .................................................................... 139

6 .2 Vector Taylor .Series..................................................................................... 141

6.2.1 VTS M odel A daptation...................................................................142

6.2.2 D istortion Hstim ation in V T S .......................................................14.1

6 .2 3 VTS I-'eature lin h an cem en t........................................................... 146

Contents vii

6.2.4 Im provem ents over V T S ................................................................150

6.2.5 VTS Ibr the l)N N -l!ased Aeim siic M o d el............................... 152

6 .3 Sam pling-B ased M ethods.......................................................................... 154

6.3.1 D ata-D riven I’M f .............................................................................154

6..3.2 U nscentcd T ran sfo rm ......................................................................154

6.3.3 M ethods B eyond the (iaussian A .ssum ption.............................156

6 .4 A coustic I a c to ri/a tio n ............................................................................... 156

6.4.1 A coustic l actori/ation I 'ram ew ork..............................................157

6.4.2 A coustic i-actorization for (iM M .................................................157

6.4.3 A coustic I'actori/.ation for D N N .................................................. 160

6 .5 S um m ary......................................................................................................... 162

R eferences...................................................................................................... 165

CHAPTER 7 Uncertainty processing.........................................................171

7.1 M odel-D om ain U n cen ain ty ........................................................................172

7 .2 I cature-D om ain U ncertainty.................................................................... 173

7.2.1 O bservation U n certainty................................................................ 173

7 .3 Joint U ncertainly D eco d in g ......................................................................176

7.3.1 iT ont-lind J U l) ..................................................................................176

7.3.2 M odel JU D ......................................................................................... 178

7 .4 M issing-I'cature A pproaches....................................................................179

7 .5 S um m ary.........................................................................................................182

R eferences...................................................................................................... 183

CHAPTERS Joint model tra in in g ............................................................. 187

8.1 Speaker A daptive and Source N orm aii/.aiion

Training............................................................................................................1S9

8 .2 M odel Space Noise Adaptive T ra in in g ................................................. 190

8 .3 Joint Training for D N N ...............................................................................195

8.3.1 Joint I ront-l:nd and DNN Mt>del T ra in in g ............................. 195

8.3.2 Joint A daptive T ra in in g ................................................................. 195

8 .4 S um m ary.........................................................................................................198

R eferences..................................................................................................... 2(X)

CHAPTER 9 Reverberant speech recognition....................................20 3

9.1 In troduction...................................................................................................203

9 .2 A coustic Im pulse R esp o n se.....................................................................206

9 .3 A M odel of Reverberated Speech in D ifferent

D om ains.......................................................................................................... 211

9 .4 The lilfect ol Reverberation on ASR P erfo rm an ce..........................213

9 .5 I.inear I'iltering A pproaches.................................................................... 214

9 .6 M agnitude or Pow er Spectrum linhancem ent....................................217

viii Contents

9 .7 I-'ealurc D om ain A pproaches....................................................................2IX

9.7.1 R everberation Robust I-eatures................................................... 218

9.7.2 I-eature N orm alization................................................................... 219

9.7.3 M odel-B ased l eature linhancem eni..........................................219

9.7.4 D ata-D riven I'nhancem ent............................................................ 221

9 .8 Acoustic M odel Domain A pproaches................................................... 225

9 .9 The RHVHRB C hallenge...........................................................................228

9 .1 0 To l*robe F u rth e r.......................................................................................... 231

9.11 S um m ary.........................................................................................................231

R eferences..................................................................................................... 233

CHAPTER 10 Multi-Channel processing................................................. 239

10.1 Intro d u ctio n .................................................................................................. 239

1 0.2 The A coustic Beam form ing P ro b lem ....................................................241

10.3 l-undam entals o f D ata-D ependent B eam fo rm in g ............................. 245

10.3.1 Signal M odel and O bjective l-u n ctio n s..................................245

10.3.2 (len eralized Sidelobe C an c eller............................................... 248

10.3.3 R elative Transfer I'unctions.......................................................2.50

1 0.4 M ulti-C hannel Speech R eco g n itio n ...................................................... 253

10.4.1 A SR on B eam form ed S ig n als................................................... 253

10.4.2 M ulti-Stream A S R ........................................................................254

10.5 To l*robe F u rth e r......................................................................................... 256

10.6 S um m ary........................................................................................................257

R eferences..................................................................................................... 257

CHAPTER 11 Summary and future directions........................................261

11.1 Robust M ethods in the lira o fC .M M .................................................... 262

11.2 Robust M ethods in the lira o f D N N ......................................................26X

11.3 M ulti-Channel Input and Robustness lo R everberation...................271

1 1 .4 H pilogue......................................................................................................... 272

R eferences.....................................................................................................275

In d e x ..............................................................................................................................................281

About the Authors

J in y u Li received Ph.D. degree from Cieorgia Insiiiuie o f Technology, U.S.A.

iTom 2(XK) to 2(K)3. he was a R esearcher al Intel C hina R esearch C enter and a

R esearch M anager at i l ’iyiek, C hina. C urrently, he is a i’rincipal A pplied Scientist

at M icrosoft, w orking as a technical lead lo design and im prove speech m odeling

algorithm s and lechnologies that ensure industry staie-of-lhe-art speech recognition

accuracy for M icrosoft products. H is m ajor research interests co v er several topics in

speech recognition and m achine learning, including noise robustness, deep learning,

discrim inative training, and feature extraction. He has authored over 60 papers and

aw arded over 10 patents.

Li D eng received Ph.D. degree from the U niversity o f W isconsin-M adison, U.S.A.

He w as a professor (1989-1999) at the U niversity o f W aterloo. C anada. In 1999. he

jo in ed M icrosoft Research, w here he currently leads R& D o f application-focused

deep learning as Partner R esearch M anager of its D eep Learning Technology Center.

He is also an A fliliate l*rofessor al U niversity o f W ashington. He is a l ellow o f the

A coustical Society o f A m erica, I'ellow o f the ILLL. and 1‘ellow o f the International

S peech C om m unication AsscKialion. He served as Ldilor-in-C hief for the ILHI:

Signal Processing M agazine and for the ll:LL /A C M T ransactions on A udio, Spcech

and Language Processing (2(X)9-2014). H is technical work has been focused on deep

learning for speech, language, im age, and m ultim odal processing, and for other areas

o f m achine intelligence involving big data. He received num erous aw ards including

the ILLIÍ SPS Best Paper A wards, ILI-I- O utstanding lingineer A ward, and APSIPA

Industrial D istinguished L eader Award.

R ein h o ld H aeb -U m b ac h is a professor w ith the U niversity o f I’aderborn. Ciermany.

H is m ain research interests are in the fields o f statistical signal processing and patiem

recognition, w ith applications lo speech enhancem ent, acoustic beam form ing and

source separation, as w ell as autom atic speech recognition. A fter having w orked in

industrial research laboratories for m ore than 10 years, he jo in ed academ ia as a full

p rofessor o f C om m unications Fingineering in 2(X)1. He has published m ore than 150

p ap ers in peer review ed jo u rn als and conferences. He is the co-ed ito r o f the book

R o b u st Speech Recognition o f U ncertain o r M issing D ata— Theory a n d A pplications

(Springer, 2011).

Y ifan G o n g received Ph.D. (w ith highest honors) from the U niversity o f Henri

P oincare, France. He served the N ational Scieniitic R esearch C enter (C N R S) and

IN R IA . I'rance, as R esearch lüngineer and then joined CN R S as S enior Research

Scientist. H e w as a V isiting Research Fellow al the C om m unications R esearch

C en ier o f C anada. A s Senior M em ber of Technical Staff, he w orked for Texas

Instrum ents at the Speech Technologies Lab, where he developed speech m odel￾ing lechnologies robust against noisy environm ents, designed system s, algorithm s.

About the Authors

and softw are for speech and speaker recognition, and delivered m em ory- and

C P U -efticienl recognizers for m obile devices.

He jo in ed M icrosoft in 2(X)4. and is currently a Principal A pplied Science M anager

in the areas o f speech m odeling, com puting infrastructure, and speech model

developm ent for speech products. His research interests include autom atic speech

recognition/interpretation, signal processing, algorithm developm ent, and engineer￾ing process/infrastructure and m anagem ent. He has authored over 130 publications

and aw arded over 30 patents. vSpecific contribution includes stochastic trajectory

m odeling, source norm alization H M M training, jo in t com pensation o f additive and

convolutional noises, and variable param eter HM M . In these areas, he gave tutorials

and oth er invited presentations in international conferences. He has been serving as

m em ber o f technical com m ittee and session chair for many international conferences,

and with IHHI: Signal l^cx;essing Spoken Language Technical Com m ittees from

1998 to 2(X)2 and since 2013.

List of Figures

Fig. 1,1 From thoughts to speech. 3

Fig. 2,1 Illustration of the C D -D N N -H M M and its three core com ponents. 24

Fig. 2 ,2 Illustration of the CNN in w hich the convolution is applied along

frequency bands. 28

Fig. 3.1 A m odel of acoustic e nvironm ent distortion in the discrete-tim e dom ain

relating the clean speech sam ple x \m \ to the distorted speech sam ple

y |m |. 43

Fig. 3 .2 Cepstral distrib utio n of word oh in A urora 2. 47

Fig. 3 .3 The im pact of noise, w ith varying m ean values from 5 in (a) to 25 in

(d), in the log-M el-filter-bank dom ain. The clean speech has a m ean

value of 25 and a standard deviation of 10. The noise has a standard

deviation of 2. 48

Fig. 3.4 Im pact of noise w ith d ifferent standard deviation values in the

log-fvlel-filter-tiank dom ain. The clean speech has a m ean value of 25

and a standard deviation of 10. The noise has a m ean of 10, 49

Fig, 3 ,5 Percentage of saturated activations at each layer on a 6 x 2 k DNN, 51

Fig. 3 .6 Average and m axim um of ||d ia g (v '+ '. * (1 - v '+ ') ) ( A ') '| | ^ across layers

c n a 6 x 2 k D N N . 51

Fig. 3 .7 t-SNE plot of a clean utterance and the corresponding noisy one with

lO d B SNR of restaurant noise from the training set of Aurora 4. 52

Fig. 3 .8 t-SNE plot of a clean utterance and the corresponding noisy one with

11 dB SNR of restaurant noise from the test set of Aurora 4. 54

Fig. 3 .9 Noise-robust m ethods in feature and m odel dom ain. 57

Fig. 4.1 Com parison of the NflFCC, RASTA-PLP, and PNCC feature extraction. 68

Fig. 4 .2 C om putation of the m odulation spectral of a speech signal. 69

Fig. 4 .3 Frequency response of RASTA. 70

Fig. 4 .4 Illustration of the tem poral stru cture norm alization fram ew ork. 71

Fig. 4 .5 An exam ple of frequency response of C M N when T = 2 0 0 at a fram e

rate of 10 Hz. 75

Fig. 4 .6 An exam ple of the W iener filte rin g gain G w ith respect to the spectral

density Sxx and S „„. 82

Fig. 4 .7 Two-stage W iener filter in advanced front-end. 83

Fig. 4 .8 C om plexity reduction fo r tw o stage W iener filter. 84

Fig. 4 .9 Illustration of network structures o f d ifferent adaptation m ethods.

Shaded nodes denote nonlinear units, unshaded nodes for linear units.

Red dashed links (gray dashed links in p rin t versions) indicate the

transform ations that are introduced d urin g adaptation. 89

Fig. 4 .1 0 The illustration of support vector m achines. 92

Fig. 4.11 The fram ew ork to com bine generative and discrim inative classifiers. 93

Fig. 5.1 Generate clean feature from noisy feature w ith DNN. 112

Fig. 5.2 Speech separation w ith DNN. 115

Fig. 5.3 Linear m odel com bination fo r D N N . 119

x ií List of Figures

Fig, 5 ,4 Variable-param eter DNN. 125

Fig. 5.5 Variable-output D N N . 126

Fig. 5 ,6 Variable-activation DNN. 128

Fig. 6.1 Parallel model com bination. 139

Fig. 6 .2 VTS model adaptation, 146

Fig. 6 .3 VTS feature enhancem ent. 148

Fig. 6 .4 Cepstral distribution of word oh in Aurora 2 after VTS feature

e nhancem ent (fVTS). 150

Fig, 6 .5 A coustic factorization fram ework. 158

Fig. 6 .6 The flow chart of factorized adaptation for a DNN at the o utp ut layer. 161

Fig. 6 .7 The flow chart of factorized training or adaptation for a D N N at the

input layer. 162

Fig. 8.1 Speaker adaptive training. 190

Fig. 8 .2 Noise adaptive training. 191

Fig. 8 .3 Joint training of front-end and D N N m odel, 196

Fig. 8 .4 An exam ple of jo in t training of front-end and DNN m odels, 197

Fig, 8 .5 A daptive training of DNN. 198

Fig. 9.1 Hands-free autom atic speech recognition in a reverberant enclosure;

the source signal travels via a d ire ct path and via single or m ultiple

reflections to the m icrophone. 205

Fig, 9 .2 A typical acoustic im pulse response for a sm all room w ith short

distance between source and sensor (0.5 m ), This im pulse response

has the param eters r6o = 25 0 m s and C ^ = 3 1 d B . The im pulse

response is taken from the REVERB challenge data, 207

Fig. 9 .3 A typical acoustic im pulse response for a large room w ith large

distance between source and sensor (2 m). This im pulse response has

the param eters 76o=700m s and C bo=6.6dB . The im pulse response is

taken from the REVERB challenge data. 207

Fig. 9 .4 Spectrogram of a clean speech signal (top), a m ildly reverberated signal

(/■bo=250ms, m iddle) and a severely reverberated signal (Í6 0 = 7 0 0 m s ,

bottom ). The dashed lines indicated the word boundaries. 213

Fig. 9 .5 Principle structure of a denoising autoencoder. 223

Fig. 10.1 U niform linear array w ith a source in the far field, 242

Fig. 10.2 Sam ple beam patterns of a Delay-Sum Beam form er steered toward

fío = 0. 243

Fig, 10.3 Block diagram of a generalized sidelobe canceller with fixed

beam form er (FBF) Wo, blocking m atrix B, and

noise cancellation filters q. 249

List of Tables

Table 4 1

Table 4.2

Table 5.1

Table 5.2

Table 5.3

Table 6.1

Table 7.1

Table 8.1

Table 9.1

Table 10.1

Table 11.1

Table 11.2

Table 11.3

D efinitions of a Subset of C om m only Used Sym bols and Notations,

G rouped in Five Separate General Categories xix

Feature- and M odel-D om ain M ethods O riginally Proposed for G M M s

in C hapter 4, Arranged C hronologically 95

Feature- and M odel-D om ain M ethods O riginally Proposed for D N N s

in C hapter 4, Arranged C hronologically 9 7

D ifference Between VPDNN and Linear DN N M odel C om bination 126

C om pensation w ith Prior Knowledge M ethods O riginally Proposed for

G M M s in Chapter 5, Arranged C hronologically 129

C om pensation w ith Prior Knowledge M ethods O riginally Proposed for

D N N s in C hapter 5, Arranged C hronologically 130

Distortion M odeling M ethods in C hapter 6, Arranged C hronologically 163

U nce rta in ty Processing M ethods in C hapter 7, Arranged

C hronologically 182

Joint M odel Training M ethods in C hapter 8, Arranged C hronologically 199

A pproaches to the Recognition of R everberated Speech, Arranged

C hronologically 2 32

A pproaches to Speech Recognition in the Presence o f M ulti-C hannel

Recordings 2 56

Representative M ethods O riginally Proposed for G M M s, Arranged

A lphabetically in Terms of the N am es of th e M ethods 2 63

Representative M ethods O riginally Proposed for D N N s, Arranged

A lphabetically 2 69

The C ounterparts of G M M -based R obustness M ethods for D N N -based

Robustness M ethods 2 70

xiii

Tải ngay đi em, còn do dự, trời tối mất!