Quality of Telephone-Based Spoken Dialogue Systems phần 7 potx

270

all relate to the system’s output voice (dimensions intelligibility, friendliness

and voice naturalness). The friendliness of the system thus seems to be highly

related to its voice. The final dimension ‘clarity of information’ does not form

a cluster with any of the other questions.

These clusters can now be interpreted in the QoS taxonomy. The ‘personal

impression’ cluster is mainly related to comfort, the ‘pleasantness’ question

(B24) to user satisfaction as well. Cluster 2 (dialogue smoothness, B19 and

B21) forms one aspect of communication efficiency. The global quality aspects

covered by questions B0 and B23 (Cluster 3) mainly relate to user satisfaction.

The strong influence of the ‘perceived system understanding’ question (B5) on

this dimension has already been noted. This question is however located in the

speech input/output quality category of the QoS taxonomy. Cluster 4 is related

to system behavior (B9, B10 and B11), and can be attributed to dialogue cooperativity, question B10 also to dialogue symmetry. The questions addressing

interaction flexibility (B13 and B14) belong to the dialogue symmetry category.

‘Naturalness’ (B12 and B18) is once again related to both dialogue cooperativity and dialogue symmetry. These two categories cannot be clearly separated

with respect to the user questions. Questions B15, B17 and B20 all reflect communication efficiency. Cluster 8, related to informativeness (B1, B2 and B4),

is attributed to the dialogue cooperativity category. This is not true for Cluster

9 (B6 and B8): Whereas B8 is part of dialogue cooperativity, B6 fits best to

the comfort category. Cluster 10 (B7, B16 and B22) is mainly related to the

speech output quality category. However, question B16 also reflects the agent

personality aspect, and thus the comfort category. The stand-alone question B3

is part of the dialogue cooperativity category.

A similar analysis can be used for the judgments on the part C questions

of experiment 6.3, namely questions C1 to C18 (the rest of the questions have

either free answer possibilities or are related to the user’s expectations about

what is important for the system). A hierarchical cluster analysis leads to the

dendrogram which is shown in Figure 6.3.

Most clusters are related to the higher levels of the QoS taxonomy. The first

cluster comprises C1, C9, C12, C13, C14 and C18: These questions are related

to user satisfaction (overall impression, C1 and C9), the system’s utility (C12,

C13), task efficiency (reliability of task results, C14) and acceptability (C18).

The second cluster (C8, C11) relates to the usability and the ease of using

the system. Question C8 will also address the meta-communication handling

capability. Cluster 3 (C2, C3) reflects the system personality (politeness, clarity

of expression). Cluster 4 (C10, C16) is once again related to usability and user

satisfaction (ease of use, degree of enjoyment). The fifth cluster captures the

system’s interaction capabilities (initiative and guidance; C4 and C7). Cluster

6 describes the system’s task (task success, C5) and meta-communication (C6)

capabilities. The final two questions (C15, C17) reflect the added value provided

Quality of Spoken Dialogue Systems 271

Figure 6.3. Hierarchical cluster analysis of part C question ratings in experiment 6.3. Dendrogram using average linkage between groups.

by the service, and are thus also related to the service efficiency category.

Also the part C questions have been associated with the categories of the QoS

taxonomy, see Figure 6.1 and Tables 6.5 and 6.6.

Similar to the factor analysis, the cluster analysis shows that many questions

of part B and part C of the experiment 6.3 questionnaire group into categories

which have been previously postulated by the QoS taxonomy. Part B questions can mainly be associated with the lower levels of the taxonomy, up to

communication efficiency, comfort and, to some extent, task efficiency. On the

other hand, part C questions mostly reflect the higher levels of the taxonomy,

namely service efficiency, usability, utility and acceptability. User satisfaction

is covered by both part B and part C questions. The relationship shown in

Figure 6.1 will be used in Section 6.2.4 to identify subjective ratings which can

be associated to specific quality aspects.

The results of multidimensional analyses give some indications on the relevance of individual quality aspects for the user, in that they show which dimensions of the perceptual space can be distinguished. The relevance may

additionally be investigated by directly asking the users which characteristics

of a system they rate as important or not important. This was done in Question

4 (4.1-4.15) of experiment 6.2, and Questions A8 and C22 of experiment 6.3.

The data from experiment 6.2, which will be discussed here, have been ranked

with respect to the number of ratings in the most positive category and

in case of equality to the accumulated positive answers to the statements (two

categories close to the “agree” label, and minus the accumulated number

272

Quality of Spoken Dialogue Systems 273

of negative answers (two categories close to the “disagree” label, and

The resulting rank order is depicted in Table 6.7.

The rank order shows that manner, transparency and relevance, and partly

also meta-communication handling and interaction control seem to be of major

importance to the users. The result may be partly linked to the particularities

of the BoRIS system (repetition capability, modification capability), but the

three major aspects – manner, transparency and relevance – will be of general

importance for other applications as well. They are all related to the basic

communicative and functional capabilities of the system (service aspects have

not been addressed by questions 4.1 to 4.15). The highest ranking is observed

for the speech input and output capabilities, which is the basic requirement for

the interaction with an SDS. The overall system quality seems to be largely affected by a relatively low intelligibility of the TTS speech output. Transparency

subsumes the transparency of how to use the system, as well as its functional

capabilities. This quality aspect seems to reflect whether the user knows what

to say to the system at each step in the dialogue, in which format, as well as the

system’s navigation (modification, repetition and dialogue continuation) capabilities. It may result in discomfort and stress if the system is not transparent

enough. Relevance can be defined on an utterance level (relevance of each

utterance in the immediate dialogue context) or on a global information (task)

level. In the qualitative interview, it turned out that the global information level

seems to pose problems with the current BoRIS version, due, in part, to database

problems, but also due to the low detail of information provided by the current

system version.

The user’s background knowledge and the level of experience play a role in

the judgement of overall quality. The qualitative interview of experiment 6.2

shows that test subjects who had no specific idea about such a system rated it

generally better than persons with a specific idea. In the questionnaire, high

expectations resulted mainly in more positive quality judgments after using the

system. This could clearly be observed for the judgments of the female test

subjects.

6.2.3 Multidimensional Analysis of Interaction Parameters

Apart from the users’ quality judgments, also the interaction parameters

will be related to each other. Such relations – if they are known – can be

used to define meaningful evaluation metrics, and to interpret the influences of

individual system components. This section will give a brief overview about

relationships which are reported in the literature and present the results of a

factor and cluster analysis of the data collected in experiment 6.3. A deeper

analysis with respect to the QoS taxonomy follows in the subsequent section.

274

A number of analyses report the obvious relationship between dialogue duration DD and turn-related parameters. For example, Polifroni et al. (1992)

found out that the overall number of user queries correlates highly with DD

The correlation between DD and the number of unanswered user

queries was considerably lower The different problem-solving

strategies applied in the case of misunderstandings probably have a significant

impact on the duration of the interactions. Sikorski and Allen (1997) investigated the correlation between dialogue duration and recognition accuracy. The

correlation turned out to be unexpectedly low The authors indicate

three potential reasons for this finding:

A robust parsing strategy, which makes it more important which words are

correctly recognized than how many.

Misunderstandings, i.e. the system taking an action based on erroneous

understanding, seems to be more detrimental to task success than nonunderstanding, where both the system and the user are aware of the situation.

A system which is robust in this respect (i.e. one that tries to form an interpretation even when there is low confidence in the input) can create a high

variance in the effectiveness of an interaction, and thus in the length of the

interaction.

A certain amount of nondeterminism (random behavior) in the system implementation, which could not be compensated for by the small number of

test subjects.

Thus, the dialogue strategy may be a determining factor of dialogue duration,

although the number of turns remains an important predictor.

Several parameters indicate speech input performance on different levels.

Gerbino et al. (1993) compared absolute figures for correctly understood sentences in a field test (30.4% correct, 21.3% failed, 39.7% incorrect) to the ones

in a laboratory situation (72.2% correct, 11.3% failed, 16.5% incorrect). Obviously, the field test situation was considerably more difficult for the recognizer

than a laboratory situation. For the field test situation, the figures can be compared to the recognition accuracy (SA = 14.0%, WA = 52.4%). It turns out

that the understanding error rate is approximately in the middle of the word and

sentence error rates.

The relation between ASR performance (WA) and speech understanding

performance (CA) was also investigated by Boros et al. (1996). Both measures can differ considerably, because WA does not make a difference between

functional words and filler words. Thus, perfect CA can be reached without

perfect WA. On the other hand, CA may become lower than WA when words

which are relevant for understanding are missing in the system’s interpretation.

Results from a test corpus recorded over the public telephone network how-

Thư viện tri thức trực tuyến

Quality of Telephone-Based Spoken Dialogue Systems phần 7 potx

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Quality of Telephone-Based Spoken Dialogue Systems phần 1 ppsx

Quality of Telephone-Based Spoken Dialogue Systems phần 3 ppsx

Quality of Telephone-Based Spoken Dialogue Systems phần 4 docx

Quality of Telephone-Based Spoken Dialogue Systems phần 2 potx

Quality of Telephone-Based Spoken Dialogue Systems phần 10 potx

Quality of Telephone-Based Spoken Dialogue Systems phần 9 pdf