[Paper Review] Evaluating Intention Detection Capability of Large Language Models in Persuasive Dialogues (ACL 2024)

12 minute read

Summary: The study explores intention detection in persuasive multi-turn dialogues using large language models, highlighting the importance of conversational history and incorporating face acts to analyze intentions effectively.

1 Introduction

Identifying speaker intentions is essential for smooth conversations.
Example: Alice asks Bob for a charity donation; Bob’s evasive response indicates hesitance without outright refusal.
Speaker intentions can be conveyed indirectly and vary by context.
People instinctively estimate these intentions, which is vital for natural communication.
Recent advancements in large language models (LLMs) like ChatGPT and GPT-4 facilitate human-like dialogue.
Ongoing research focuses on developing dialogue systems incorporating LLMs.
LLMs are applied in real-world scenarios and may effectively detect speaker intentions.
Existing datasets like GLUE evaluate LLMs’ understanding of natural language but lack focus on intention detection.
This study introduces a new dataset for measuring LLMs’ ability to identify intentions in persuasive conversations.
The dataset features multiple-choice questions that consider the context of past utterances.
Persuasive conversations often require careful consideration of other parties’ feelings and perspectives.
The dataset utilizes the concept of “face” (Goffman, 1967) to assess intentions linked to social relationships.
By grouping intentions by face, clarity in analysis and insights improves.
The research includes an assessment of LLMs’ intention detection capabilities and identifies particularly challenging types of intentions.
Contributions:
- Developed a dataset for evaluating intention detection from persuasive dialogues.
- Evaluated state-of-the-art LLMs like GPT-4 and ChatGPT on their ability to detect utterance intentions, highlighting their mistakes and difficulties.

2 Background

Explanation of face and face acts.
Overview of existing dialogue data used in the research.
Discussion of previous studies on:
- Dialogue comprehension.
- Intention detection.

2.1 Face and Face Acts

Definition of Face:
- Face is a primary human need related to social relationships.
- Introduced by Erving Goffman in 1967.
Politeness Theory:
- Developed by Brown and Levinson to analyze verbal behaviors affecting face.
- Systematizes face-related behaviors as politeness strategies.
Types of Face:
- Positive Face:
  - The desire to be recognized, admired, and liked by others.
- Negative Face:
  - The desire to maintain one’s freedom and autonomy.
Face Acts:
- Utterances that affect the face of oneself or others.
- Face Threatening Act (FTA):
  - Acts that threaten face.
- Face Saving Act (FSA):
  - Acts that aim to preserve face, e.g., praising or alleviating burdens.
Politeness Strategy:
- People typically avoid threatening faces to manage relationships.
- If they must threaten face, they use strategies to minimize the threat (e.g., implying needs, apologizing).
Application in Dialogue:
- Dutt et al. (2020) applied face acts to analyze persuasive dialogues.
- Face acts are crucial for successful persuasion.
Machine Learning Model:
- Developed to track conversation dynamics using face acts and history.
- Face acts categorized based on three criteria:
  - Direction: Speaker or hearer (s/h).
  - Type: Positive or negative face (pos/neg).
  - Action: Saved or attacked face (+/-).
Example of Persuasive Situation:

Persuader (ER): Person attempting to change the mind of the other.
Persuadee (EE): Person whose mind is being changed.
Requesting something from EE is a face act categorized as hneg- (threatens EE’s freedom).
Validating an argument supports positive face, categorized as spos+.

2.2 Dataset Annotated with Face Acts

The dataset is created by Dutt et al. (2020) and focuses on dialogues related to persuading donations to Save the Children (STC).
There are two main participants in the dialogues:
- Persuader (ER): The individual seeking to convince another to donate.
- Persuadee (EE): The individual being persuaded.
The dataset includes various utterances, with some categorized as “other,” which includes greetings, fillers, and unrelated remarks.
The dialogues were sourced from Wang et al. (2019), and only one face act label is assigned to each utterance.
It is acknowledged that some utterances could have multiple face acts, but these instances make up only 2% of the dataset.
To simplify annotation, only one face act is randomly selected from the possible options and used as the gold label.
Example dialogue from the dataset illustrates the interaction:
- ER asks about interest in donating.
- EE inquires about the charity.
- ER provides information about STC.
- EE reveals limited prior knowledge about the charity.
Some face act labels used in the example include:
- “hneg-“ for negative acknowledgment.
- “hpos+” for positive acknowledgment.
- “spos+” for supportive remarks.

2.3 Intention Detection

Importance in Dialogue Systems:
- Essential for task-oriented dialogue systems to understand user objectives.
- Systems must classify utterances to determine if they fall within their operational domain.
Intention Detection Tasks:
- Typically involve classifying utterances into predefined intention labels.
- Examples of specific domains include travel and banking, while some datasets encompass multiple domains.
Representative Datasets:
- SNIPS is highlighted as a significant dataset in the field.
- Modern language models (LLMs) like GPT-2 show high performance in intention detection tasks.
Model Limitations:
- Many existing studies focus on intent prediction solely from individual utterances, without incorporating conversational context.
Contextual Studies:
- A few studies consider contextual information for intention detection.
- Cui et al. (2020) developed a dataset to evaluate dialogue understanding by focusing on next utterance predictions within conversational settings.
Persuasive Conversations:
- Dutt et al. (2020) introduced a model that incorporates conversational context to predict intentions in persuasive dialogues.
- They focused on face acts as intention labels, showing the model’s capability in intention detection but did not use LLMs.
Unexplored Areas:
- There is a lack of research on how well LLMs can understand intentions in multi-turn persuasive dialogues.

3 Data

Previous studies on intention detection often did not utilize multi-turn dialogue data.
A persuasive dialogue dataset by Dutt et al. (2020) suggests predicting face acts from utterances but face acts are abstract intentions that are not intuitive for humans.
Face acts are likely insufficiently learned by LLMs due to their infrequency in pretraining data.
To effectively evaluate LLMs’ intention detection capability, it’s necessary to modify the approach for zero-shot or few-shot scenarios.
The study transforms face acts into understandable intention descriptions in natural language.
Each dataset entry consists of conversational history and four intention descriptions for the last utterance.
The task output involves selecting one description from four options, akin to a reading comprehension format.
This format is inspired by past dialogue reasoning studies and is commonly used to assess LLMs’ reasoning abilities.
The persuasive dialogue dataset was partitioned into training, development, and test subsets in an 8:1:1 ratio, focusing solely on the test subset for evaluation.
The section details the development process of the evaluation dataset, including:
- Definition of intention descriptions annotated into utterances.
- Annotation process through crowdsourcing for each utterance.
- Selection of distractors to create the four option choices.

3.1 Preparation of Intention Description

Dutt et al. (2020) provided various intention descriptions in persuasive contexts alongside associated face acts.
Adaptation and expansion of these descriptions were conducted.
The descriptions were annotated to align with specific utterances.
New descriptions were created to cover all utterances in the development data.
Broader intention descriptions were refined into more specific ones.
A total of 42 intention descriptions were curated and presented in Table 2.

3.2 Intention Annotation

Selected 30 dialogues from the persuasion dialogue dataset for test data.
Annotated intention descriptions to utterances, focusing on face act labels by Dutt et al. (2020) for their emotional impact.
Used crowdworkers from the US via Amazon Mechanical Turk (AMT) for the annotation process.
Provided fair compensation, with an average hourly wage of $12 for workers.
Conducted three rounds of pilot tests to refine instructions and select high-quality annotators.
Finalized instructions for annotation available in Appendix A.
Workers read entire conversations and assigned intention descriptions from a provided set of candidate descriptions.
- Descriptions were categorized under the same face act as the utterance.
- Example: For an EE’s utterance with face act hpos-, possible intentions included doubt or lack of interest.
Each utterance was annotated by three workers, resulting in three intention descriptions per utterance.
Majority vote determined final annotation; gold labels were assigned if there was agreement from at least two workers.
A total of 691 utterances were annotated, with 620 receiving agreement from at least two annotators.
Developed an intention classification problem for the 620 agreed-upon utterances.
Measured annotator agreement using Krippendorff’s alpha, yielding a value of 0.406, indicating moderate agreement.
Additional details on annotator agreement are found in Appendix B.

3.3 Question Creation

Data Collection: 620 utterances were initially annotated with intention descriptions.
Utterance Concatenation:
- Consecutive utterances sharing the same intention descriptions were combined.
- This step is crucial as some intentions become clear only after listening to subsequent utterances.
Outcome: The concatenation process resulted in 549 utterances annotated with intention descriptions.
Question Formation:
- Multi-choice questions were created from these 549 utterances.
- For each utterance, three distractors were randomly selected from a predefined description pool.
Appendices:
- Appendix C provides additional details on the utterance concatenation process.
- Appendix D outlines the rules for the distractor selection process.
Data Statistics: Table 3 includes specific statistics about the data used.

4 Experiment

The goal is to evaluate how effectively Language Models (LLMs) detect intentions in persuasive dialogues.
Various sizes of LLMs were tested to observe the impact of model size on intention detection ability.
- Models used include:
  - GPT-4 and ChatGPT from OpenAI
  - Llama 2-Chat (Meta)
  - Vicuna (LMSYS)
Prompts provided to the LLMs contained:
- Information for detecting intentions (conversational context, task explanation, conversational script, and a four-option question).
- Designed in a zero-shot Chain-of-Thought style, dividing the answering process into two phases:
  - Reason explanation: LLMs state whether intentions are explicit or implied.
  - Option selection: LLMs choose the best option based on reasoning.
Memory constraints limited history length to the past ten utterances for Llama 2-Chat and Vicuna models.
Human performance benchmarking was conducted using workers from AMT (Amazon Mechanical Turk):
- Workers selected intention descriptions for the last utterance from four options, ensuring no prior knowledge of gold intention descriptions.
- Majority vote among three workers was used to determine the final answer.
Key statistics:
- 549 questions across 30 dialogues
- Average of 18.3 questions and 30.8 turns per dialogue
- Average words per utterance: 11.99, per description: 10.61
Model performance results:
- The smallest model achieved over 50% accuracy; GPT-4 exceeded 90%.
- Larger models consistently showed improved accuracy.
- Notably, LLMs struggled with identifying intentions categorized as hpos-.
  - GPT-4 detected hpos- intentions correctly in only 1 out of 7 cases.
The section plans to address the issues faced by smaller LLMs and analyze the utterances where LLMs, particularly GPT-4, encountered challenges in intention detection.

4.1 Behavior of Smaller LLMs

Comparison with GPT-4:
- GPT-4 answered over 90% of questions correctly.
- Smaller models, including ChatGPT and Llama 2-Chat-70B, faced difficulties in inference.
Problem Types:
- Intention-related Problems:
  - Flawed interpretation of intention leads to incorrect answers.
  - Smaller models occasionally overinterpret intentions:
    - Example: GPT-4 accurately inferred intentions regarding donations, while smaller models incorrectly concluded that the speaker had no intention to donate based on overextensions of the conversation.
- Non-intention-related Problems:
  - Issues like generation loops and misinterpreting utterances unrelated to the main task.
  - Complexity of prompts poses comprehension challenges for smaller models.
  - Logical inconsistencies in responses were prevalent:
    - Examples show Llama 2-Chat-70B frequently selected the last answer option without proper evaluation.
    - While option D had a 25.7% correctness overall, Llama 2-Chat-70B chose it 31.9% of the time, indicating a tendency to select the last option.
  - Inconsistencies and poor option selection significantly reduced the performance of smaller models.

4.2 About hpos-

Weakness of LLMs:
- LLMs struggle particularly with interpreting “hpos” utterances where:
  - ER condemns EE’s hesitation to donate.
  - EE expresses doubts about ER’s credibility.
GPT-4’s Mistakes:
- Instances of misunderstanding intentions behind EE’s utterances are largely attributed to flawed questions highlighted in the limitations.
- Focus of examination is on how GPT-4 interprets ER’s criticisms of EE.

4.2.1 Patterns in Our Dataset

Two Primary Patterns of Criticism by ER:
1. ER questions EE’s spending habits, promoting a redirection of funds to charity (Save the Children).
  - Example: ER asks about wasteful spending on junk food.
2. ER brings up financially struggling individuals to elicit guilt in EE for inaction.
  - Example: ER highlights children suffering due to lack of donations.
GPT-4’s Recognition:
- GPT-4 incorrectly identified many of these utterances as non-critical or having different intentions, as detailed in subsequent tables.

4.2.2 Artificially Created Dataset

Methodology:
- Created scenarios testing perception of critical utterances by generating persuasive dialogues where EE hesitates to donate.
- 90 utterances were judged for their critical nature by both GPT-4 and human annotators.
Findings:
- Majority of human judgments categorized utterances as motivating rather than critical:
  - 85 as ‘motivating’, 4 as ‘criticizing’, 1 as ‘confirming donation amount’.
- GPT-4’s interpretations closely aligned with human judgments for 87 utterances.
Contrast Between Critical and Non-Critical:
- Examples of utterances categorized as critical had a more sarcastic and obvious tone, while non-critical utterances relied on emotional appeals.
- Emotional appeals were viewed as strategies to boost donations, whereas sarcastic remarks were recognized for their implicit critique.
Further Inquiry:
- The difference in how guilt-tripping strategies motivate donations versus being perceived negatively is suggested as a key area for future exploration regarding human versus LLM judgments.

5 Conclusion

The study assesses the capability of large language models (LLMs) to detect intentions in multi-turn persuasive dialogues.
A dataset was created for evaluation, revealing limitations that highlight areas for improvement in intention detection methods.
Key findings include:
- Inappropriate labeling may lead to incorrect intention representation due to limited label sets.
- Multiple interpretations of intentions can complicate the detection task, making singular answers insufficient.
- The specific dataset is not fully representative of various dialogue types, limiting the generalizability of findings.
Future research should focus on improving dataset diversity and developing training data for fine-tuning LLMs.
The study raises ethical concerns regarding the potential misuse of LLMs, particularly in persuading individuals or spreading misinformation.
Acknowledgments were made for support received and the contributions of various individuals to the research.

6 Limitations

Inappropriate Labeling:
- Some questions in the dataset could not be labeled appropriately due to the limited, predetermined label set.
- This leads to inaccurate intention descriptions based on misannotated face acts.
Multiple Correct Answers:
- The dataset inevitably contains questions where utterances can express multiple correct intentions.
- This results in models inaccurately selecting intention descriptions due to a lack of a single correct option.
Insufficient Dataset for Evaluation:
- The dataset is sparse in face act distributions, lacking examples of less frequent intentions.
- The exclusive focus on persuasive dialogues limits generalizability; a diverse dataset is necessary for comprehensive analysis.
Bias in Generated Data:
- The additional experiment used GPT-4 generated conversations, which may reflect biases inherent in the model.
- This could compromise the validity of the findings based on artificial conversation data.
Potential Ethical Concerns:
- While the study investigates LLMs’ intention detection, insights may later be misused in applications with significant ethical implications.
- The risk of LLMs misleading individuals or spreading misinformation exists if intent detection capabilities are exploited.
Impact of LLMs’ Knowledge and Biases:
- Results may be affected by the aggressive knowledge and various biases of the LLMs employed in the study.

7 Ethical Considerations

The study evaluates LLMs’ intention detection capabilities without immediate severe ethical implications anticipated from its findings.
If LLMs become precise in detecting intentions, they may be widely deployed as interactive agents across various fields.
Potential misuse of LLMs:
- Could manipulate human intentions leading to deception by malicious entities, posing risks of fraud.
- Ability to disseminate misinformation, especially on social media, could result in widespread public confusion.
This research involves models like ChatGPT and GPT-4, which inherently possess biases and aggressive knowledge that may influence results.
Careful monitoring is necessary as the technology develops and integrates into dialogue systems.

Reader Feedback

The study involves a significant psychological component, which makes the mathematical analysis seem somewhat lacking.
It would be beneficial to incorporate mathematical analysis using Explainable AI (XAI) methodologies.

Hanyong Lee