Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Quality of Telephone-Based Spoken Dialogue Systems phần 3 ppsx
Nội dung xem thử
Mô tả chi tiết
Quality of Human-Machine Interaction over the Phone 75
has later been modified to better predict the effects of ambient noise, quantizing distortion, and time-variant impairments like lost frames or packets. The
current model version is described in detail in ITU-T Rec. G.107 (2003).
The idea underlying the E-model is to transform the effects of individual impairments (e.g. those caused by noise, echo, delay, etc.) first to an intermediate
‘transmission rating scale’. During this transformation, instrumentally measurable parameters of the transmission path are transformed into the respective
amount of degradation they provoke, called ‘impairment factors’. Three types
of impairment factors, reflecting three types of degradations, are calculated:
All types of degradations which occur simultaneously to the speech signal,
e.g. a too loud connection, quantizing noise, or a non-optimum sidetone,
are expressed by the simultaneous impairment factor Is.
All degradations occurring delayed to the speech signals, e.g. the effects of
pure delay (in a conversation) or of listener and talker echo, are expressed
by the delayed impairment factor Id.
All degradations resulting from low bit-rate codecs, partly also under transmission error conditions, are expressed by the effective equipment impairment factor Ie,eff. Ie,eff takes the equipment impairment factors for the
error-free case, Ie, into account.
These types of degradations do not necessarily reflect the quality dimensions
which can be obtained in a multidimensional auditory scaling experiment. In
fact, such dimensions have been identified as “intelligibility” or “overall clarity”, “naturalness” or “fidelity”, loudness, color of sound, or the distinction
between background and signal distortions (McGee, 1964; McDermott, 1969;
Bappert and Blauert, 1994). Instead, the impairment factors of the E-model have
been chosen for practical reasons, to distinguish between parameters which can
easily be measured and handled in the network planning process.
The different impairment factors are subtracted from the highest possible
transmission rating level Ro which is determined by the overall signal-to-noise
ratio of the connection. This ratio is calculated assuming a standard active
speech level of -26 dB below the overload point of the digital system, cf. the
definition of the active speech level in ITU-T Rec. P.56 (1993), and taking the
SLR and RLR loudness ratings, the circuit noise Nc and N for, as well as the
ambient room noise into account. An allowance for the transmission rating level
is made to reflect the differences in user expectation towards networks differing
from the standard wireline one (e.g. cordless or mobile phones), expressed
by a so-called ‘advantage of access’ factor A. For a discussion of this factor
see Möller (2000). In result, the overall transmission rating factor R of the
connection can be calculated as
76
This transmission rating factor is the principal output of the E-model. It reflects
the overall quality level of the connection which is described by the input parameters discussed in the last section. For normal parameter settings
R can be transformed to an estimation of a mean user judgment on a 5-point
ACR quality scale defined in ITU-T Rec. P.800 (1996), using the fixed S-shaped
relationship
Both the transmission rating factor R and the estimated mean opinion score
MOS give an indication of the overall quality of the connection. They can be
related to network planning quality classes defined in ITU-T Rec. G. 109 (1999),
see Table 2.5. For the network planner, not only the overall R value is important,
but also the single contributions (Ro, Is, Id and Ie,eff), because they provide
an indication on the sources of the quality degradations and potential reduction
solutions (e.g. by introducing an echo canceller). Other formulae exist for
relating R to the percentage of users rating a connection good or better (%GoB)
or poor or worse (%PoW).
The exact formulae for calculating Ro, Is, Id, and Ie,eff are given in ITU-T
Rec. G.107 (2003). For Ie and A, fixed values are defined in ITU-T Appendix
I to Rec. G.113 (2002) and ITU-T Rec. G.107 (2003). Another example of a
network planning model is the SUBMOD model developed by British Telecom
(ITU-T Suppl. 3 to P-Series Rec., 1993), which is based on ideas from Richards
(1973).
If the network has already been set up, it is possible to obtain realistic measurements of major parts of the network equipment. The measurements can be
Quality of Human-Machine Interaction over the Phone 77
performed either off-line (intrusively, when the equipment is put out of network
operation), or on-line in operating networks (non-intrusive measurement). In
operating networks, however, it might be difficult to access the user interfaces;
therefore, standard values are taken for this part of the transmission chain. The
measured input parameters or signals can be used as an input to the signal-based
or network planning models (so-called monitoring models). In this way, it becomes possible to monitor quality for the specific network under consideration.
Different models and model combinations can be envisaged, and details can
be found in the literature (Möller and Raake, 2002; ITU-T Rec. P.562, 2004;
Ludwig, 2003).
From the principles used by the models, the quality aspects which may be
predicted become obvious. Current signal-based measures predict only oneway voice transmission quality for specific parts of the transmission channel
that they have been optimized for. These predictions usually reach a high
accuracy because adequate input parameters are available. In contrast to this,
network planning models like the E-model base their predictions on simplified
and perhaps imprecisely estimated planning values. In addition to one-way
voice transmission quality, they cover conversational aspects and to a certain
extent the effects caused by the service and its context of use. All models which
have been described in this section address HHI over the phone. Investigations
on how they may be used in HMI for predicting ASR performance are described
in Chapter 4, and for synthesized speech in Chapter 5.
2.4.2 SDS Specification
The specification phase of an SDS may be of crucial importance for the
success of a service. An appropriate specification will give an indication of
the scale of the whole task, increases the modularity of a system, allows early
problem spotting, and is particularly suited to check the functionality of the
system to be set up. The specification should be initialized by a survey of user
requirements: Who are the potential users, and where, why and how will they
use the service?
Before starting with an exact specification of a service and the underlying
system, the target functionality has to be clarified. Several authors point out that
system functionality may be a very critical issue for the success of a service.
For example, Lamel et al. (1998b) reported that the prototype users of their
French ARISE system for train information did not differentiate between the
service functionality (operative functions) and the system responses which may
be critically determined by the technical functions. In the case that the system
informs the user about its limitations, the system response may be appropriate
under the given constraints, but completely dissatisfying for the user. Thus,
78
systems which are well-designed from a technological and from an interaction
point of view may be unusable because of a restricted functionality.
In order to design systems and services which are usable, human factor issues
should be taken into account early in the specification phase (Dybkjær and
Bernsen, 2000). The specification should cover all aspects which potentially
influence the system usability, including its ease of use, its capability to perform
a natural, flexible and robust dialogue with the user, a sufficient task domain
coverage, and contextual factors in the deployment of the SDS (e.g. service
improvement or economical benefit). The following information needs to be
specified:
Application domain and task. Although developers are seeking applicationindependent systems, there are a number of principle design decisions which
are dependent on the specific application under consideration. Within a domain, different tasks may require completely differing solutions, e.g. an
information task may be insensible to security requirements whereas the
corresponding reservation may require the communication of a credit card
number and thus may be inappropriate for the speech modality. The application will also determine the linguistic aspects of the interaction (vocabulary,
syntax, etc.).
User and task requirements. They may be determined from recordings of
human services if the corresponding situation exists, or via interviews in
case of new tasks which have no prior history in HHI.
Intended user group.
Contextual factors. They may be amongst the most important factors influencing user’s satisfaction with SDSs, and include service improvement
(longer opening hours, introduction of new functionalities, avoid queues,
etc.) and economical benefits (e.g. users pay less for an SDS service than
for a human one), see Dybkjær and Bernsen (2000).
Common knowledge which will have to be shared between the human user
and the SDS. This knowledge will arise from the application domain and
task, and will have to be specified in terms of an initial vocabulary and language model, the required speech understanding capability, and the speech
output capability.
Common knowledge which will have to be shared between the SDS and the
underlying application, and the corresponding interface (e.g. SQL).
Knowledge to be included in the user model, cf. the discussion of user
models in Section 2.1.3.4.