[Paper Review] Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning (EMNLP 2023)

13 minute read

This paper presents an offline reinforcement learning (RL) framework to enhance persona consistency in dialogue systems, combining the benefits of supervised learning and RL while introducing a novel variance-reducing importance sampling method, demonstrating improved performance in both persona consistency and dialogue quality.

1 Introduction

Rapid advancements in large language models have enabled the development of dialogue agents that generate fluent and natural responses.
These dialogue systems are trained on vast amounts of unlabeled text data and fine-tuned for dialogue tasks but still face issues, such as a lack of consistency.
To improve consistency in social dialogue, previous research suggested conditioning dialogue generation on a persona describing the agent (e.g., personal interests and characteristics).
Efforts have been made to enhance persona consistency through supervised learning or online reinforcement learning (RL), but these methods still have limitations:
- Supervised learning tends to focus on encouraging persona-related examples while neglecting contradictions, leading to insensitivity to conflicting statements.
- Online RL involves expensive training as it requires continuous generation of new samples and accurate critics to evaluate the output, enforcing both persona consistency and dialogue fluency.
This paper proposes an offline RL framework to improve the persona consistency of open-domain dialogue systems, addressing several challenges:
- Offline RL can explicitly penalize contradictory responses, enhancing sensitivity to contradictions.
- Unlike online RL, offline RL does not demand the generation of new samples, allowing training on existing datasets with human-annotated rewards.
- The method reduces training failures from policy divergence and employs a new importance sampling technique called VaRMI to minimize variance in importance weights.
Previous research has applied offline RL to task-oriented dialogue, but applying it to social dialogue is less straightforward due to the lack of clear reward definitions.
The work centers around the idea that persona consistency is crucial for effective open-domain dialogue as humans naturally communicate with a persona.
Key contributions of this study include:
- An offline RL framework to create persona-consistent dialogue agents using human-annotated rewards.
- The introduction of VaRMI for reduced variance in offline RL training.
- Improvements in persona consistency and dialogue quality of BlenderBot3 (BB3) based on both automated and human evaluations.

Persona Consistent Dialogue

Focus on persona-based dialogue generation, primarily using the PersonaChat dataset (Zhang et al., 2018).
Common method: Fine-tune models with supervised learning (Roller et al., 2020; Shuster et al., 2022; Yavuz et al., 2019).
Consistency issues remain despite fine-tuning.
Prior work aims to enhance persona consistency by:
- Encouraging entailing utterances.
- Discouraging contradictory utterances.
Online reinforcement learning (RL) methods have been applied:
- Example: Song et al. (2019b) utilizes NLI classifier and naturalness module.
- Liu et al. (2020) employs mutual persona perception.
Other consistency improvement methods without additional dialogue policy training:
- Multistage re-writing (Song et al., 2020), but limited in handling multi-turn consistency.
- Bayesian rational speech acts (Kim et al., 2020; Frank and Goodman, 2012) come with higher computational costs.
Unlikelihood training (Li et al., 2020; Welleck et al., 2019a) proposed to enhance persona consistency:
- Issues include failure to explicitly reward entailing utterances and potential incoherency due to token-level punishment of contradictions (Shi et al., 2021).
Offline RL approach aims to maintain coherence by addressing utterance-level contradictions.
Offline RL
- Applications in dialogue tasks are limited, mainly focusing on task-oriented dialogue.
- Examples include price negotiation (Verma et al., 2022) and benchmarks like MultiWOZ (Jang et al., 2022; Budzianowski et al., 2018).
- Previous studies often utilize Q-learning (Jaques et al., 2020; Snell et al., 2023), which requires additional models to guide dialogue policy, increasing complexity.
- The proposed method uses a policy-gradient based offline RL framework with fixed rewards, simplifying training and deployment.
- Policy-gradient offline RL has limited practical use due to high variance from importance sampling.
- Variance reduction techniques for importance weights are notable in off-policy learning (Munos et al., 2016) and offline RL (Pang and He, 2021; Levine et al., 2020).
- The introduction of VaRMI seeks to reduce variance in offline RL training.

3 Method

This section outlines the offline reinforcement learning (RL) framework designed to enhance persona consistency.
3.1 Overview of Offline RL Training
- Provides a general description of the offline RL training process.
3.2 VaRMI Importance Sampling Method
- Introduces a novel method called VaRMI for importance sampling.
3.3 Framework Outline
- Summarizes the proposed framework for implementing the offline RL.
3.4 Implementation on Dialogue Model
- Discusses the specific application of the framework within a dialogue model context.

3.1 Offline RL

Training Approach:
- Utilizes a policy-gradient method to optimize the RL objective.
- The objective is defined as \(J(\theta) = E_{\tau \sim p(\pi*{\theta}(\tau))} \left[ \sum_{t=0}^{T} \gamma^{t} r(s_{t}, a_{t}) \right]\).
Policy Gradient Calculation:
- The gradient of the RL objective with respect to the policy is computed as:
  \[\nabla_{\theta} J(\theta) = E_{\tau \sim p(\pi_{\theta}(\tau))} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t}) \hat{Q}(s_{t}, a_{t}) \right]\]
- Here, \(\hat{Q}(s_{t}, a_{t})\) is the estimated return based on the current state.
Reward Function:
- The reward is utterance-level, either -1 or 1, based on adherence to a given persona.
- The model does not factor in response fluency during offline training to avoid incoherent responses typical in online RL training.
Sample Collection:
- In offline RL, samples come from a behavioral policy \(\pi*b\) that differs from the optimizing policy \(\pi*{\theta}\).
- Importance sampling is used to derive an unbiased estimator of the policy gradient:
  \[\nabla_{\theta} J(\theta) = E_{\tau \sim p(\pi_b(\tau))} \left[ \sum_{t=0}^{T} w_{t} \nabla_{\theta} \log \pi_{\theta}(a_{t} \vert s_{t}) \hat{Q}(s_{t}, a_{t}) \right]\]
- Importance weights \(w_{t} = \prod_{t}^{t'=0} \frac{\pi_{\theta}(a_{t'} \vert s_{t'})}{\pi_{b}(a_{t'} \vert s_{t'})}\) are approximated for variance reduction.
Importance Sampling Methods:
1. GOLD Method:
  - Assumes constant likelihood for all training samples under \(\pi*b\) allowing simplification to \(w_{t} = \pi_{\theta}(a_{t} \vert s_{t})\).
  - Effective in scenarios with unknown \(\pi_b\).
2. Second Method:
  - Further exploration of alternative importance sampling methods is discussed in the following sections.

3.2 VaRMI Importance Sampling

Policy-gradient based offline RL methods face high variance in gradient estimators due to importance sampling required for correcting distributional shift between πθ (policy) and πb (behavior policy).
VaRMI (Value-relevant Importance Sampling) is introduced to alleviate variance issues during offline RL training.
Key aspects of VaRMI importance sampling:
- Importance weights are reduced by initializing πθ to the Maximum Likelihood Estimation (MLE) solution before offline training.
- This initialization implies that πθ has encountered numerous positive reward examples, suggesting minimal distributional shift during training.
- For positive reward candidates, importance weights are set to approximately 1, whereas for negative reward candidates, weights reflect their likelihood under the policy.
Effectively, this approach eliminates much of the variance in importance weights, introducing some bias but enhancing stability.
VaRMI is specifically utilized for ensuring persona consistency but can be adapted to other tasks under certain conditions:
1. There must be absolute positive and negative rewards present, rather than relative rewards derived from a baseline.
2. The acting policy must begin with the MLE solution for the task.
These conditions are applicable to a wide range of dialogue tasks and possibly beyond.
Future research is needed to explore the generalizability of VaRMI for tasks with complex rewards, extended time intervals, and other contexts outside persona consistency.

3.3 Framework

Overview of framework details including critic construction and offline dataset generation.
Critic Construction:
- Utilizes a mapping between Dialogue Natural Language Inference (DNLI) and PersonaChat datasets.
- PersonaChat:
  - Crowd-sourced dialogue dataset with 10,907 dialogues (1,000 for validation, 968 for testing).
  - Workers chat by adopting a given persona.
- DNLI:
  - Contains 310,110 sentence pairs from PersonaChat.
  - Each pair labeled for entailment (entailment, neutrality, contradiction) based on human-annotated triples.
Mapping Process:
- Pairs from DNLI are matched to dialogue utterances in PersonaChat.
- DNLI personas are added to the existing persona set.
- Matching sentences become the next utterance candidates.
Persona Filtering:
- Avoids contradictions in persona sets during mapping.
- Filters contradicting personas using human-annotated triples, removing any with entity overlap.
Data Integrity:
- Each persona in training set is also present in DNLI, ensuring mapping applicability.
- Additional filtering using a NLI classifier for longer personas.
- Neutral sentences regarding the inserted persona are filtered out to carry a zero reward.
Results of Mapping and Filtering:
- Approximately 42,000 utterance candidates available for training with offline RL.
Dataset Items:
- Each item includes a persona, dialogue context, utterance candidate, and entailment label.
- The persona and dialogue context are concatenated to form the state, the utterance candidate acts as τ, and the entailment label indicates the estimated return.
Example Dialogues:
- Mapped dataset examples provided for illustration.

3.4 Implementation

Method Utilization:
- Implemented on BlenderBot3 (BB3), an open-source dialogue system by Meta.
- BB3 is tailored for open-domain dialogue and has been fine-tuned on various datasets including PersonaChat.
Performance Metrics:
- BB3 has a perplexity of approximately 5.8 on the PersonaChat dataset.
- Additional fine-tuning may lead to overfitting.
Consistency Issues:
- Despite good performance, BB3 exhibits consistency problems, being less consistent than its predecessor.
Training Details:
- Trained the 3 billion parameter version of BB3 for four epochs.
- Employed two importance sampling methods with learning rates of 5e−7 for GOLD and 1e−6 for VaRMI.
Implementation Framework:
- The method is implemented within the ParlAI framework.
Module Adjustments:
- Disabling dynamic memory, internet search, and memory decision modules due to their error-proneness and negative impact on dialogue performance.
- This adjustment aids in isolating the effects of persona consistency in the model’s responses.

4 Experiments

Tested the effectiveness of the offline RL framework for persona consistency.
Evaluated using both automatic and human evaluations.
Results indicate that both importance sampling methods enhance persona consistency of BB3.
Human evaluations confirm that the VaRMI importance sampling method improves overall dialogue quality of the model.

4.1 Evaluation Datasets

DNLI Evaluation Set
- Designed to test persona consistency of dialogue models.
- Consists of:
  - Sentence pairs with entailment labels from the base DNLI dataset.
  - Personas and dialogue histories from the PersonaChat evaluation set.
  - 31 utterance candidates including:
    - 10 contradictory candidates.
    - 10 entailment candidates.
    - 10 neutral candidates.
    - 1 actual next utterance.
- Contains a total of 542 dialogues for testing.
Mapped DNLI-PersonaChat Dataset
- Evaluation on 5,000 dialogues from the mapped dataset.
- Dialogues held out from training and divided into positive and negative utterance candidates based on entailment.
- The aim is to:
  - Encourage entailing candidates.
  - Discourage contradictions.
- Performance monitoring on these two sets to assess success of training methods.
Model Evaluation Results
- Comparison of various models (BB3, BB3+RL, BB3+GOLD, BB3+VaRMI) based on:
  - Hits@1
  - Entail@1
  - Rand@1
  - Contradict@1
- Statistical significance tested via independent two-sample z-test (p < 0.05) marked with an asterisk.
Human Evaluation Results
- Ratings for raw and calibrated model quality.
- Comparison of importance sampling techniques against the BB3-3B baseline.
- Results show:
  - Ratings with standard deviations included.
  - Statistically significant improvements found through independent two-sample t-test (p < 0.05) marked with an asterisk.

4.2 Automatic Evaluation

Results on Mapped DNLI-PersonaChat Dataset
- Loss trajectories analyzed during training phases.
- Initial loss (Epoch 0) shows minor gaps between positive and negative utterance sets, indicating low sensitivity to contradictions.
- Training with the GOLD method results in both sets’ loss increasing, with negative candidates’ loss increasing more, reflecting a heightened sensitivity to contradictions but less incentive for entailing utterances.
- VaRMI training outcome aligns expectations: positive candidates’ loss decreases significantly while negative candidates’ nearly doubles, indicating a successful model focus on entailing utterances and avoidance of contradictions.
- Loss discrepancies suggest prior training on persona entailing examples limits additional improvements on these cases.
Results on DNLI Evaluation Dataset
- Comparison of training methods against BB3 (imitation learning only) and a baseline with online RL.
- Metrics assessed:
  - Hits@1: Top-1 candidate match percentage with gold next utterance.
  - Entail@1: Percent of top candidates sharing the same triple as gold next utterance.
  - Contradict@1: Percent of top candidates with contradicting triples.
  - Rand@1: Percent of candidates with triples that neither contradict nor entail.
- Offline training methods (GOLD and VaRMI) outperform baselines:
  - GOLD method excels at reducing contradictions.
  - VaRMI method ranks entailing and gold utterances highly while minimizing neutral candidates.
- Evaluation results show significant improvements through offline training methods compared to BB3 and BB3+RL, though no significant difference between GOLD and VaRMI.
- Online RL shows no significant advantage over the BB3 baseline trained only with supervised learning.

4.3 Human Evaluation

Setup:
- 90 individuals recruited via email, social media, and in-person.
- Participants randomly assigned to one of three systems, yielding 30 responses per model.
- Users instructed to chat with the bot for a minimum of seven turns.
- Post-chat survey rated conversation quality and bot’s persona representation on a scale of 1-5.
- Option for users to provide additional feedback or suggestions in a text box.
- Bot personas randomly selected from 967 options in the PersonaChat dataset.
- Participants only shown the bot’s persona after completing the chat for reference during the survey.
Results:
- Results displayed in Table 3.
- Bayesian calibration applied to results to account for annotator bias and inter-annotator variability.
- Findings indicate both offline RL methods enhance bot’s consistency with its assigned persona; GOLD method performs best.
- VaRMI importance sampling method improves dialogue quality over the BB3 baseline.
- GOLD importance sampling method underperforms in conversation quality compared to the other methods.

4.4 User Comments and Error Analysis

BB3	BB3 + VaRMI

Complaints Received:
- Users reported awkward language usage and abrupt topic changes across all bots.
Persona Representation Issues:
- Complaints arose particularly regarding the GOLD method bot, which was noted for over-representing its persona.
- Users felt the bot ignored their inputs, focusing instead on its persona.
- Some interactions felt scripted due to the fixation on persona.
Quality vs. Consistency Trade-off:
- The GOLD bot successfully represented its persona but sacrificed dialogue quality for consistency.
- This raises questions about how much a chatbot should represent its persona throughout a conversation.
Conversation Context Importance:
- In short conversations (around seven turns), fully representing a persona may seem unnatural.
- The optimal score for consistency and conversation quality may vary depending on conversation type.
Evaluation of Models:
- The BB3 baseline model faced multiple complaints regarding inadequate persona representation.
- In contrast, the VaRMI model improved consistency and quality, correcting contradictions found in the BB3 conversations.
Examples of Conversations:
- Table 4 provides an example of inconsistency from the BB3 model.
- Table 5 showcases a more consistent conversation with the VaRMI model, highlighting its improved performance.
Further Information:
- Full conversations from the human evaluation are available in Appendix C.

5 Conclusion and Future Work

Demonstrated the effectiveness of offline reinforcement learning (RL) for enhancing open-domain dialogue systems.
Applied offline RL to a persona consistency task, showing improved persona consistency and dialogue quality compared to imitation learning-only systems.
Developed a persona consistency critic utilizing human-annotated labels and introduced a novel importance sampling method called VaRMI.
Evaluations (both automatic and human) confirmed the enhancement in persona consistency of BB3 and overall dialogue quality.
Suggested future work directions:
- Extend framework to address other dialogue aspects, such as reducing hallucinations and offensive language.
- Utilize LLMs for generating quality synthetic data, minimizing the need for human conversation collection.
- Explore the generalization of VaRMI to other tasks.
- Investigate offline policy gradient methods and assess if VaRMI can mitigate their high variance issues effectively.

6 Limitations

The framework’s training sample size is fixed.
Collecting more data from human users or synthesizing data using LLMs is necessary for increasing sample size.
Both methods of acquiring more samples are more costly compared to online RL methods, which can generate unlimited samples without extra cost.
Human experiments were constrained by the size of the language model used.
Only the 3B parameter version of BB3 was utilized due to resource limitations, which is significantly smaller than many state-of-the-art language models.
The next version of BB3 has 30B parameters, which exceeds current training resource availability.
Future improvements may benefit from increasing the language model size to address quality complaints.

7 Ethical Concerns

Giving a language model a persona encourages it to act human-like.
Language models may be incentivized to deny being a bot when asked.
Transparency with users is crucial; they should know they are interacting with a chatbot.
Participants were informed they were conversing with a bot in all experiments.
Users were advised not to share personally identifiable information with the bot.

Reader’s Opinion

This study used only a 3B parameter language model due to resource constraints. To overcome this limitation, larger models could be utilized while employing parameter-efficient training methods such as LoRA.
- Specifically, the parameters of larger language models, such as 30B or 175B, can be frozen while updating only the parameters of the LoRA module during reinforcement learning.
There are also limitations in reinforcement learning conducted with a restricted dataset. This issue could be addressed by augmenting the training dataset using an LLM.
- However, in this case, additional research is needed on how LLMs can generate conversational datasets effectively without issues such as hallucination.

Hanyong Lee

[Paper Review] Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning (EMNLP 2023)

1 Introduction

3 Method

3.1 Offline RL

3.2 VaRMI Importance Sampling

3.3 Framework

3.4 Implementation

4 Experiments

4.1 Evaluation Datasets

4.2 Automatic Evaluation

4.3 Human Evaluation

4.4 User Comments and Error Analysis

5 Conclusion and Future Work

6 Limitations

7 Ethical Concerns

Reader’s Opinion

Comments

You May Also Enjoy

[논문리뷰] EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas (NeurIPS 2024)

[논문리뷰] Self-Emotion Blended Dialogue Generation in Social Simulation Agents (SIGDIAL 2024)

[논문리뷰] Code Models are Zero-shot Precondition Reasoners (NAACL 2024)

[논문리뷰] Embodied Agent Interface- Benchmarking LLMs for Embodied Decision Making (NeurIPS 2024)

Hanyong Lee

1 Introduction

2 Related Work

3 Method

3.1 Offline RL

3.2 VaRMI Importance Sampling

3.3 Framework

3.4 Implementation

4 Experiments

4.1 Evaluation Datasets

4.2 Automatic Evaluation

4.3 Human Evaluation

4.4 User Comments and Error Analysis

5 Conclusion and Future Work

6 Limitations

7 Ethical Concerns

Reader’s Opinion

Comments

You May Also Enjoy

[논문리뷰] EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas (NeurIPS 2024)

[논문리뷰] Self-Emotion Blended Dialogue Generation in Social Simulation Agents (SIGDIAL 2024)

[논문리뷰] Code Models are Zero-shot Precondition Reasoners (NAACL 2024)

[논문리뷰] Embodied Agent Interface- Benchmarking LLMs for Embodied Decision Making (NeurIPS 2024)