Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Quality of Telephone-Based Spoken Dialogue Systems phần 4 docx
Nội dung xem thử
Mô tả chi tiết
124
two or more stimuli. In either case, the judgment will reflect some type of
implicit or explicit reference.
The question of reference is an important one for the quality assessment
and evaluation of synthesized speech. In contrast to references for speech
recognition or speech understanding, it refers however to the perception of
the user. When no explicit references are given to the user, he/she will make
use of his/her internal references in the judgment. Explicit references can
be either topline references, baseline references, or scalable references. Such
references can be chosen on a segmental (e.g. high-quality or coded speech
as a topline, or concatenations of co-articulatory neutral phones as a baseline),
prosodic (natural prosody as a topline, and original durations and flat melody as
a baseline), voice characteristic (target speaker as a topline for a personalized
speech output), or on an overall quality level, see van Bezooijen and van Heuven
(1997).
A scalable reference which is often used for the evaluation of transmitted
speech in telephony is calibrated signal-correlated noise generated with the
help of a modulated noise reference unit, MNRU (ITU-T Rec. P.810, 1996).
Because it is perceptively not similar to the degradations of current speech
synthesizers, the use of an MNRU often leads to reference conditions outside
the range of systems to be assessed (Salza et al., 1996; Klaus et al., 1997). Timeand-frequency warping (TFW) has been developed as an alternative, producing
a controlled “wow and flutter” effect by speeding up and slowing down the
speech signal (Johnston, 1997). It is however still perceptively different from
the one produced by modern corpus-based synthesizers.
The experimental design has to be chosen to equilibrate between test conditions, speech material, and voices, e.g. using a Graeco Latin Square or a
Balanced Block design (Cochran and Cox, 1992). The length of individual test
sessions should be limited to a maximum which the test subjects can tolerate
without fatigue. Speech samples should be played back with a high-quality test
management equipment in order not to introduce additional degradations to the
ones under investigation (e.g. the ones stemming from the synthesized speech
samples, and potential transmission degradations, see Chapter 5). They should
be calibrated to a common level, e.g. -26dB below the overload point of the
digital system which is the recommended level for narrow-band telephony. On
the acoustic side, this level should correspond to a listening level of 79 dB SPL.
The listening set-up should reflect the situation which will be encountered in
the later real-life application. For a telephone-based dialogue service, handset
or hands-free terminals should be used as listening user interfaces. Because of
the variety of different telephone handsets available, an ‘ideal’ handset with a
frequency response calibrated to the one of an intermediate reference system,
IRS (ITU-T Rec. P.48, 1988), is commonly used. Test results are finally
analyzed by means of an analysis of variance (ANOVA) to test the significance
Assessment and Evaluation Methods 125
of the experiment factors, and to find confidence intervals for the individual
mean values. More general information on the test set-up and administration
can be found in ITU-T Rec. P.800 (1996) or in Arden (1997).
When the speech output module as a whole is to be evaluated in ins functional context, black box test methods using judgment scales are commonly
applied. Different aspects of global quality such as intelligibility, naturalness,
comprehensibility, listening-effort, or cognitive load should nevertheless be
taken into account. The principle of functional testing will be discussed in
more detail in Section 5.1. The method which is currently recommended by the
ITU-T is a standard listening-only test, with stimuli which are representative
for SDS-based telephone services, see ITU-T Rec. P.85 (1994). In addition
to the judgment task, test subjects have to answer content-related questions so
that their focus of attention remains on a content level during the test. It is
recommended that the following set of five-point category scales2
is given to
the subjects in two separate questionnaires (type Q and I):
Acceptance: Do you think that this voice could be used for such an information service by telephone? Yes; no. (Q and I)
Overall impression: How do you rate the quality of the sound of what you
have just heard? Excellent; good; fair; poor; bad. (Q and I)
Listening effort: How would you describe the effort you were required to
make in order to understand the message? Complete relaxation possible, no
effort required; attention necessary, no appreciable effort required; moderate
effort required; effort required; no meaning understood with any feasible
effort. (I)
Comprehension problems: Did you find certain words hard to understand?
Never; rarely; occasionally; often; all of the time. (I)
Articulation: Were the sounds distinguishable? Yes, very clear; yes, clear
enough; fairly clear; no, not very clear; no, not at all. (I)
Pronunciation: Did you notice any anomalies in pronunciation? No; yes,
but not annoying; yes, slightly annoying; yes, annoying; yes, very annoying.
(Q)
Speaking rate: The average speed or delivery was: Much faster than preferred; faster than preferred; preferred; slower than preferred; much slower
than preferred. (Q)
2
A brief discussion on scaling is given in Section 3.8.6.
126
Voice pleasantness: How would you describe the voice? Very pleasant;
pleasant; fair; unpleasant; very unpleasant. (Q)
An example for a functional test based on this principle is described in Chapter 5.
Other approaches include judgments on naturalness and intelligibility, e.g. the
SAM overall quality test (van Bezooijen and van Heuven, 1997).
In order to obtain analytic information on the individual components of a
speech synthesizer, a number of specific glass box tests have been developed.
They refer to linguistic aspects like text pre-processing, grapheme-to-phoneme
conversion, word stress, morphological decomposition, syntactic parsing, and
sentence stress, as well as to acoustic aspects like segmental quality at the word
or sentence level, prosodic aspects, and voice characteristics. For a discussion
of the most important methods see van Bezooijen and van Heuven (1997) and
van Bezooijen and Pols (1990). On the segmental level, examples include the
diagnostic rhyme test (DRT) and the modified rhyme test (MRT), the SAM
Standard Segmental Test, the CLuster IDentification test (CLID), the Bellcore
test, and tests with semantically unpredictable sentences (SUS). Prosodic evaluation can be done either on a formal or on a functional level, and using different
presentation methods and scales (paired comparison or single stimulus, category judgment or magnitude estimation). Mariniak and Mersdorf (1994) and
Sonntag and Portele (1997) describe methods for assessing the prosody of synthetic speech without interference from the segmental level, using test stimuli
that convey only intensity, fundamental frequency, and temporal structure (e.g.
re-iterant intonation by Mersdorf (2001), or artificial voice signals, sinusoidal
waveforms, sawtooth signals, etc.). Other tests concentrate on the prosodic
function, e.g. in terms of illocutionary acts (SAM Prosodic Function Test), see
van Bezooijen and van Heuven (1997).
A specific acoustic aspect is the voice of the machine agent. Voice characteristics are the mean pitch level, mean loudness, mean tempo, harshness, creak,
whisper, tongue body orientation, dialect, accent, etc. They help the listener
to make an idea of the speakers mood, personality, physical size, gender, age,
regional background, socio-economic status, health, and identity. This information is not consciously used by the listener, but helps him to infer information,
and may have practical consequences as to the listener’s attitude towards the
machine agent, and to his/her interpretation of the agent’s message. A general
aspect of the voice which is often assessed is voice pleasantness, e.g. using
the approach in ITU-T Rec. P.85 (1994). More diagnostic assessment of voice
characteristics is mainly restricted to the judgment of natural speech, see van
Bezooijen and van Heuven (1997). However, these authors state that the effect
of voice characteristics on the overall quality of services is still rather unclear.
Several comparative studies between different evaluation methods have been
reported in the literature. Kraft and Portele (1995) compared five German
Assessment and Evaluation Methods 127
synthesis systems using a cluster identification test for segmental intelligibility,
a paired-comparison test for addressing general acceptance of the sentence level,
and a category rating test on the paragraph level. The authors conclude that
each test yielded results in its own right, and that a comprehensive assessment
of speech synthesis systems demands cross-tests in order to relate individual
quality aspects to each other. Salza et al. (1996) used a single stimulus rating
according to ITU-T Rec. P.85 (1994) (but without comprehension questions)
and a paired comparison technique. They found good agreement between the
two methods in terms of overall quality. The most important aspects used by
the subjects to differentiate between systems were global impression, voice,
articulation and pronunciation.
3.8 SDS Assessment and Evaluation
At the beginning of this chapter it was stated that the assessment or system components, in the way it was described in the previous sections, is not
sufficient for addressing the overall quality of an SDS-based service. Analytical measures of system performance are a valuable source of information in
describing how the individual parts of the system fulfill their task. They may
however sometimes miss the relevant contributors to the overall performance
of the system, and to the quality perceived by the user. For example, erroneous
speech recognition or speech understanding may be compensated for by the
discourse processing component, without affecting the overall system quality.
For this reason, interaction experiments with real or test users are indispensable
when the quality of an SDS and of a telecommunication service relying on it
are to be determined.
In laboratory experiments, both types of information can be obtained in parallel: During the dialogue of a user with the system under test, interaction
parameters can be collected. These parameters can partly be measured instrumentally, from log files which are produced by the dialogue system. Other
parameters can only be determined with the help of experts who annotate a
completed dialogue with respect to certain characteristics (e.g. task fulfillment,
contextual appropriateness of system utterances, etc.). After each interaction,
test subjects are given a questionnaire, or they are interviewed in order to collect
judgments on the perceived quality features.
In a field test situation with real users, instrumentally logged interaction
parameters are often the unique source of information for the service provider
in order to monitor the quality of the system. The amount of data which can
be collected with an operating service may however become very large. In
this case, it is important to define a core set of metrics which describe system
performance, and to have tools at hand which automatize a large part of the
data analysis process. The task of the human evaluator is then to interpret this
data, and to estimate the effect of the collected performance measures on the