Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Quality of Telephone-Based Spoken Dialogue Systems phần 7 potx
Nội dung xem thử
Mô tả chi tiết
270
all relate to the system’s output voice (dimensions intelligibility, friendliness
and voice naturalness). The friendliness of the system thus seems to be highly
related to its voice. The final dimension ‘clarity of information’ does not form
a cluster with any of the other questions.
These clusters can now be interpreted in the QoS taxonomy. The ‘personal
impression’ cluster is mainly related to comfort, the ‘pleasantness’ question
(B24) to user satisfaction as well. Cluster 2 (dialogue smoothness, B19 and
B21) forms one aspect of communication efficiency. The global quality aspects
covered by questions B0 and B23 (Cluster 3) mainly relate to user satisfaction.
The strong influence of the ‘perceived system understanding’ question (B5) on
this dimension has already been noted. This question is however located in the
speech input/output quality category of the QoS taxonomy. Cluster 4 is related
to system behavior (B9, B10 and B11), and can be attributed to dialogue cooperativity, question B10 also to dialogue symmetry. The questions addressing
interaction flexibility (B13 and B14) belong to the dialogue symmetry category.
‘Naturalness’ (B12 and B18) is once again related to both dialogue cooperativity and dialogue symmetry. These two categories cannot be clearly separated
with respect to the user questions. Questions B15, B17 and B20 all reflect communication efficiency. Cluster 8, related to informativeness (B1, B2 and B4),
is attributed to the dialogue cooperativity category. This is not true for Cluster
9 (B6 and B8): Whereas B8 is part of dialogue cooperativity, B6 fits best to
the comfort category. Cluster 10 (B7, B16 and B22) is mainly related to the
speech output quality category. However, question B16 also reflects the agent
personality aspect, and thus the comfort category. The stand-alone question B3
is part of the dialogue cooperativity category.
A similar analysis can be used for the judgments on the part C questions
of experiment 6.3, namely questions C1 to C18 (the rest of the questions have
either free answer possibilities or are related to the user’s expectations about
what is important for the system). A hierarchical cluster analysis leads to the
dendrogram which is shown in Figure 6.3.
Most clusters are related to the higher levels of the QoS taxonomy. The first
cluster comprises C1, C9, C12, C13, C14 and C18: These questions are related
to user satisfaction (overall impression, C1 and C9), the system’s utility (C12,
C13), task efficiency (reliability of task results, C14) and acceptability (C18).
The second cluster (C8, C11) relates to the usability and the ease of using
the system. Question C8 will also address the meta-communication handling
capability. Cluster 3 (C2, C3) reflects the system personality (politeness, clarity
of expression). Cluster 4 (C10, C16) is once again related to usability and user
satisfaction (ease of use, degree of enjoyment). The fifth cluster captures the
system’s interaction capabilities (initiative and guidance; C4 and C7). Cluster
6 describes the system’s task (task success, C5) and meta-communication (C6)
capabilities. The final two questions (C15, C17) reflect the added value provided
Quality of Spoken Dialogue Systems 271
Figure 6.3. Hierarchical cluster analysis of part C question ratings in experiment 6.3. Dendrogram using average linkage between groups.
by the service, and are thus also related to the service efficiency category.
Also the part C questions have been associated with the categories of the QoS
taxonomy, see Figure 6.1 and Tables 6.5 and 6.6.
Similar to the factor analysis, the cluster analysis shows that many questions
of part B and part C of the experiment 6.3 questionnaire group into categories
which have been previously postulated by the QoS taxonomy. Part B questions can mainly be associated with the lower levels of the taxonomy, up to
communication efficiency, comfort and, to some extent, task efficiency. On the
other hand, part C questions mostly reflect the higher levels of the taxonomy,
namely service efficiency, usability, utility and acceptability. User satisfaction
is covered by both part B and part C questions. The relationship shown in
Figure 6.1 will be used in Section 6.2.4 to identify subjective ratings which can
be associated to specific quality aspects.
The results of multidimensional analyses give some indications on the relevance of individual quality aspects for the user, in that they show which dimensions of the perceptual space can be distinguished. The relevance may
additionally be investigated by directly asking the users which characteristics
of a system they rate as important or not important. This was done in Question
4 (4.1-4.15) of experiment 6.2, and Questions A8 and C22 of experiment 6.3.
The data from experiment 6.2, which will be discussed here, have been ranked
with respect to the number of ratings in the most positive category and
in case of equality to the accumulated positive answers to the statements (two
categories close to the “agree” label, and minus the accumulated number
272
Quality of Spoken Dialogue Systems 273
of negative answers (two categories close to the “disagree” label, and
The resulting rank order is depicted in Table 6.7.
The rank order shows that manner, transparency and relevance, and partly
also meta-communication handling and interaction control seem to be of major
importance to the users. The result may be partly linked to the particularities
of the BoRIS system (repetition capability, modification capability), but the
three major aspects – manner, transparency and relevance – will be of general
importance for other applications as well. They are all related to the basic
communicative and functional capabilities of the system (service aspects have
not been addressed by questions 4.1 to 4.15). The highest ranking is observed
for the speech input and output capabilities, which is the basic requirement for
the interaction with an SDS. The overall system quality seems to be largely affected by a relatively low intelligibility of the TTS speech output. Transparency
subsumes the transparency of how to use the system, as well as its functional
capabilities. This quality aspect seems to reflect whether the user knows what
to say to the system at each step in the dialogue, in which format, as well as the
system’s navigation (modification, repetition and dialogue continuation) capabilities. It may result in discomfort and stress if the system is not transparent
enough. Relevance can be defined on an utterance level (relevance of each
utterance in the immediate dialogue context) or on a global information (task)
level. In the qualitative interview, it turned out that the global information level
seems to pose problems with the current BoRIS version, due, in part, to database
problems, but also due to the low detail of information provided by the current
system version.
The user’s background knowledge and the level of experience play a role in
the judgement of overall quality. The qualitative interview of experiment 6.2
shows that test subjects who had no specific idea about such a system rated it
generally better than persons with a specific idea. In the questionnaire, high
expectations resulted mainly in more positive quality judgments after using the
system. This could clearly be observed for the judgments of the female test
subjects.
6.2.3 Multidimensional Analysis of Interaction Parameters
Apart from the users’ quality judgments, also the interaction parameters
will be related to each other. Such relations – if they are known – can be
used to define meaningful evaluation metrics, and to interpret the influences of
individual system components. This section will give a brief overview about
relationships which are reported in the literature and present the results of a
factor and cluster analysis of the data collected in experiment 6.3. A deeper
analysis with respect to the QoS taxonomy follows in the subsequent section.
274
A number of analyses report the obvious relationship between dialogue duration DD and turn-related parameters. For example, Polifroni et al. (1992)
found out that the overall number of user queries correlates highly with DD
The correlation between DD and the number of unanswered user
queries was considerably lower The different problem-solving
strategies applied in the case of misunderstandings probably have a significant
impact on the duration of the interactions. Sikorski and Allen (1997) investigated the correlation between dialogue duration and recognition accuracy. The
correlation turned out to be unexpectedly low The authors indicate
three potential reasons for this finding:
A robust parsing strategy, which makes it more important which words are
correctly recognized than how many.
Misunderstandings, i.e. the system taking an action based on erroneous
understanding, seems to be more detrimental to task success than nonunderstanding, where both the system and the user are aware of the situation.
A system which is robust in this respect (i.e. one that tries to form an interpretation even when there is low confidence in the input) can create a high
variance in the effectiveness of an interaction, and thus in the length of the
interaction.
A certain amount of nondeterminism (random behavior) in the system implementation, which could not be compensated for by the small number of
test subjects.
Thus, the dialogue strategy may be a determining factor of dialogue duration,
although the number of turns remains an important predictor.
Several parameters indicate speech input performance on different levels.
Gerbino et al. (1993) compared absolute figures for correctly understood sentences in a field test (30.4% correct, 21.3% failed, 39.7% incorrect) to the ones
in a laboratory situation (72.2% correct, 11.3% failed, 16.5% incorrect). Obviously, the field test situation was considerably more difficult for the recognizer
than a laboratory situation. For the field test situation, the figures can be compared to the recognition accuracy (SA = 14.0%, WA = 52.4%). It turns out
that the understanding error rate is approximately in the middle of the word and
sentence error rates.
The relation between ASR performance (WA) and speech understanding
performance (CA) was also investigated by Boros et al. (1996). Both measures can differ considerably, because WA does not make a difference between
functional words and filler words. Thus, perfect CA can be reached without
perfect WA. On the other hand, CA may become lower than WA when words
which are relevant for understanding are missing in the system’s interpretation.
Results from a test corpus recorded over the public telephone network how-