13 minute read

This paper presents an offline reinforcement learning (RL) framework to enhance persona consistency in dialogue systems, combining the benefits of supervised learning and RL while introducing a novel variance-reducing importance sampling method, demonstrating improved performance in both persona consistency and dialogue quality.


1 Introduction

image

  • Rapid advancements in large language models have enabled the development of dialogue agents that generate fluent and natural responses.
  • These dialogue systems are trained on vast amounts of unlabeled text data and fine-tuned for dialogue tasks but still face issues, such as a lack of consistency.
  • To improve consistency in social dialogue, previous research suggested conditioning dialogue generation on a persona describing the agent (e.g., personal interests and characteristics).
  • Efforts have been made to enhance persona consistency through supervised learning or online reinforcement learning (RL), but these methods still have limitations:
    • Supervised learning tends to focus on encouraging persona-related examples while neglecting contradictions, leading to insensitivity to conflicting statements.
    • Online RL involves expensive training as it requires continuous generation of new samples and accurate critics to evaluate the output, enforcing both persona consistency and dialogue fluency.
  • This paper proposes an offline RL framework to improve the persona consistency of open-domain dialogue systems, addressing several challenges:
    • Offline RL can explicitly penalize contradictory responses, enhancing sensitivity to contradictions.
    • Unlike online RL, offline RL does not demand the generation of new samples, allowing training on existing datasets with human-annotated rewards.
    • The method reduces training failures from policy divergence and employs a new importance sampling technique called VaRMI to minimize variance in importance weights.
  • Previous research has applied offline RL to task-oriented dialogue, but applying it to social dialogue is less straightforward due to the lack of clear reward definitions.
  • The work centers around the idea that persona consistency is crucial for effective open-domain dialogue as humans naturally communicate with a persona.
  • Key contributions of this study include:
    • An offline RL framework to create persona-consistent dialogue agents using human-annotated rewards.
    • The introduction of VaRMI for reduced variance in offline RL training.
    • Improvements in persona consistency and dialogue quality of BlenderBot3 (BB3) based on both automated and human evaluations.

2 Related Work

  • Persona Consistent Dialogue

image

  • Focus on persona-based dialogue generation, primarily using the PersonaChat dataset (Zhang et al., 2018).
  • Common method: Fine-tune models with supervised learning (Roller et al., 2020; Shuster et al., 2022; Yavuz et al., 2019).
  • Consistency issues remain despite fine-tuning.
  • Prior work aims to enhance persona consistency by:
    • Encouraging entailing utterances.
    • Discouraging contradictory utterances.
  • Online reinforcement learning (RL) methods have been applied:
    • Example: Song et al. (2019b) utilizes NLI classifier and naturalness module.
    • Liu et al. (2020) employs mutual persona perception.
  • Other consistency improvement methods without additional dialogue policy training:
    • Multistage re-writing (Song et al., 2020), but limited in handling multi-turn consistency.
    • Bayesian rational speech acts (Kim et al., 2020; Frank and Goodman, 2012) come with higher computational costs.
  • Unlikelihood training (Li et al., 2020; Welleck et al., 2019a) proposed to enhance persona consistency:
    • Issues include failure to explicitly reward entailing utterances and potential incoherency due to token-level punishment of contradictions (Shi et al., 2021).
  • Offline RL approach aims to maintain coherence by addressing utterance-level contradictions.

  • Offline RL
    • Applications in dialogue tasks are limited, mainly focusing on task-oriented dialogue.
    • Examples include price negotiation (Verma et al., 2022) and benchmarks like MultiWOZ (Jang et al., 2022; Budzianowski et al., 2018).
    • Previous studies often utilize Q-learning (Jaques et al., 2020; Snell et al., 2023), which requires additional models to guide dialogue policy, increasing complexity.
    • The proposed method uses a policy-gradient based offline RL framework with fixed rewards, simplifying training and deployment.
    • Policy-gradient offline RL has limited practical use due to high variance from importance sampling.
    • Variance reduction techniques for importance weights are notable in off-policy learning (Munos et al., 2016) and offline RL (Pang and He, 2021; Levine et al., 2020).
    • The introduction of VaRMI seeks to reduce variance in offline RL training.

3 Method

  • This section outlines the offline reinforcement learning (RL) framework designed to enhance persona consistency.
  • 3.1 Overview of Offline RL Training
    • Provides a general description of the offline RL training process.
  • 3.2 VaRMI Importance Sampling Method
    • Introduces a novel method called VaRMI for importance sampling.
  • 3.3 Framework Outline
    • Summarizes the proposed framework for implementing the offline RL.
  • 3.4 Implementation on Dialogue Model
    • Discusses the specific application of the framework within a dialogue model context.

3.1 Offline RL

image

  • Training Approach:

    • Utilizes a policy-gradient method to optimize the RL objective.
    • The objective is defined as \(J(\theta) = E_{\tau \sim p(\pi*{\theta}(\tau))} \left[ \sum_{t=0}^{T} \gamma^{t} r(s_{t}, a_{t}) \right]\).
  • Policy Gradient Calculation:

    • The gradient of the RL objective with respect to the policy is computed as:

      \[\nabla_{\theta} J(\theta) = E_{\tau \sim p(\pi_{\theta}(\tau))} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t}) \hat{Q}(s_{t}, a_{t}) \right]\]
    • Here, \(\hat{Q}(s_{t}, a_{t})\) is the estimated return based on the current state.

  • Reward Function:

    • The reward is utterance-level, either -1 or 1, based on adherence to a given persona.
    • The model does not factor in response fluency during offline training to avoid incoherent responses typical in online RL training.
  • Sample Collection:

    • In offline RL, samples come from a behavioral policy \(\pi*b\) that differs from the optimizing policy \(\pi*{\theta}\).
    • Importance sampling is used to derive an unbiased estimator of the policy gradient:

      \[\nabla_{\theta} J(\theta) = E_{\tau \sim p(\pi_b(\tau))} \left[ \sum_{t=0}^{T} w_{t} \nabla_{\theta} \log \pi_{\theta}(a_{t} \vert s_{t}) \hat{Q}(s_{t}, a_{t}) \right]\]
    • Importance weights \(w_{t} = \prod_{t}^{t'=0} \frac{\pi_{\theta}(a_{t'} \vert s_{t'})}{\pi_{b}(a_{t'} \vert s_{t'})}\) are approximated for variance reduction.
  • Importance Sampling Methods:

    1. GOLD Method:
      • Assumes constant likelihood for all training samples under \(\pi*b\) allowing simplification to \(w_{t} = \pi_{\theta}(a_{t} \vert s_{t})\).
      • Effective in scenarios with unknown \(\pi_b\).
    2. Second Method:
      • Further exploration of alternative importance sampling methods is discussed in the following sections.

3.2 VaRMI Importance Sampling

  • Policy-gradient based offline RL methods face high variance in gradient estimators due to importance sampling required for correcting distributional shift between πθ (policy) and πb (behavior policy).
  • VaRMI (Value-relevant Importance Sampling) is introduced to alleviate variance issues during offline RL training.
  • Key aspects of VaRMI importance sampling:
    • Importance weights are reduced by initializing πθ to the Maximum Likelihood Estimation (MLE) solution before offline training.
    • This initialization implies that πθ has encountered numerous positive reward examples, suggesting minimal distributional shift during training.
    • For positive reward candidates, importance weights are set to approximately 1, whereas for negative reward candidates, weights reflect their likelihood under the policy.
  • Effectively, this approach eliminates much of the variance in importance weights, introducing some bias but enhancing stability.
  • VaRMI is specifically utilized for ensuring persona consistency but can be adapted to other tasks under certain conditions:
    1. There must be absolute positive and negative rewards present, rather than relative rewards derived from a baseline.
    2. The acting policy must begin with the MLE solution for the task.
  • These conditions are applicable to a wide range of dialogue tasks and possibly beyond.
  • Future research is needed to explore the generalizability of VaRMI for tasks with complex rewards, extended time intervals, and other contexts outside persona consistency.

3.3 Framework

  • Overview of framework details including critic construction and offline dataset generation.
  • Critic Construction:
    • Utilizes a mapping between Dialogue Natural Language Inference (DNLI) and PersonaChat datasets.
    • PersonaChat:
      • Crowd-sourced dialogue dataset with 10,907 dialogues (1,000 for validation, 968 for testing).
      • Workers chat by adopting a given persona.
    • DNLI:
      • Contains 310,110 sentence pairs from PersonaChat.
      • Each pair labeled for entailment (entailment, neutrality, contradiction) based on human-annotated triples.
  • Mapping Process:
    • Pairs from DNLI are matched to dialogue utterances in PersonaChat.
    • DNLI personas are added to the existing persona set.
    • Matching sentences become the next utterance candidates.
  • Persona Filtering:
    • Avoids contradictions in persona sets during mapping.
    • Filters contradicting personas using human-annotated triples, removing any with entity overlap.
  • Data Integrity:
    • Each persona in training set is also present in DNLI, ensuring mapping applicability.
    • Additional filtering using a NLI classifier for longer personas.
    • Neutral sentences regarding the inserted persona are filtered out to carry a zero reward.
  • Results of Mapping and Filtering:
    • Approximately 42,000 utterance candidates available for training with offline RL.
  • Dataset Items:
    • Each item includes a persona, dialogue context, utterance candidate, and entailment label.
    • The persona and dialogue context are concatenated to form the state, the utterance candidate acts as τ, and the entailment label indicates the estimated return.
  • Example Dialogues:
    • Mapped dataset examples provided for illustration.

3.4 Implementation

  • Method Utilization:

    • Implemented on BlenderBot3 (BB3), an open-source dialogue system by Meta.
    • BB3 is tailored for open-domain dialogue and has been fine-tuned on various datasets including PersonaChat.
  • Performance Metrics:

    • BB3 has a perplexity of approximately 5.8 on the PersonaChat dataset.
    • Additional fine-tuning may lead to overfitting.
  • Consistency Issues:

    • Despite good performance, BB3 exhibits consistency problems, being less consistent than its predecessor.
  • Training Details:

    • Trained the 3 billion parameter version of BB3 for four epochs.
    • Employed two importance sampling methods with learning rates of 5e−7 for GOLD and 1e−6 for VaRMI.
  • Implementation Framework:

    • The method is implemented within the ParlAI framework.
  • Module Adjustments:

    • Disabling dynamic memory, internet search, and memory decision modules due to their error-proneness and negative impact on dialogue performance.
    • This adjustment aids in isolating the effects of persona consistency in the model’s responses.

4 Experiments

image

  • Tested the effectiveness of the offline RL framework for persona consistency.
  • Evaluated using both automatic and human evaluations.
  • Results indicate that both importance sampling methods enhance persona consistency of BB3.
  • Human evaluations confirm that the VaRMI importance sampling method improves overall dialogue quality of the model.

4.1 Evaluation Datasets

  • DNLI Evaluation Set

    • Designed to test persona consistency of dialogue models.
    • Consists of:
      • Sentence pairs with entailment labels from the base DNLI dataset.
      • Personas and dialogue histories from the PersonaChat evaluation set.
      • 31 utterance candidates including:
        • 10 contradictory candidates.
        • 10 entailment candidates.
        • 10 neutral candidates.
        • 1 actual next utterance.
    • Contains a total of 542 dialogues for testing.
  • Mapped DNLI-PersonaChat Dataset

    • Evaluation on 5,000 dialogues from the mapped dataset.
    • Dialogues held out from training and divided into positive and negative utterance candidates based on entailment.
    • The aim is to:
      • Encourage entailing candidates.
      • Discourage contradictions.
    • Performance monitoring on these two sets to assess success of training methods.
  • Model Evaluation Results

    • Comparison of various models (BB3, BB3+RL, BB3+GOLD, BB3+VaRMI) based on:
      • Hits@1
      • Entail@1
      • Rand@1
      • Contradict@1
    • Statistical significance tested via independent two-sample z-test (p < 0.05) marked with an asterisk.
  • Human Evaluation Results

    • Ratings for raw and calibrated model quality.
    • Comparison of importance sampling techniques against the BB3-3B baseline.
    • Results show:
      • Ratings with standard deviations included.
      • Statistically significant improvements found through independent two-sample t-test (p < 0.05) marked with an asterisk.

4.2 Automatic Evaluation

  • Results on Mapped DNLI-PersonaChat Dataset

    • Loss trajectories analyzed during training phases.
    • Initial loss (Epoch 0) shows minor gaps between positive and negative utterance sets, indicating low sensitivity to contradictions.
    • Training with the GOLD method results in both sets’ loss increasing, with negative candidates’ loss increasing more, reflecting a heightened sensitivity to contradictions but less incentive for entailing utterances.
    • VaRMI training outcome aligns expectations: positive candidates’ loss decreases significantly while negative candidates’ nearly doubles, indicating a successful model focus on entailing utterances and avoidance of contradictions.
    • Loss discrepancies suggest prior training on persona entailing examples limits additional improvements on these cases.
  • Results on DNLI Evaluation Dataset

    • Comparison of training methods against BB3 (imitation learning only) and a baseline with online RL.
    • Metrics assessed:
      • Hits@1: Top-1 candidate match percentage with gold next utterance.
      • Entail@1: Percent of top candidates sharing the same triple as gold next utterance.
      • Contradict@1: Percent of top candidates with contradicting triples.
      • Rand@1: Percent of candidates with triples that neither contradict nor entail.
    • Offline training methods (GOLD and VaRMI) outperform baselines:
      • GOLD method excels at reducing contradictions.
      • VaRMI method ranks entailing and gold utterances highly while minimizing neutral candidates.
    • Evaluation results show significant improvements through offline training methods compared to BB3 and BB3+RL, though no significant difference between GOLD and VaRMI.
    • Online RL shows no significant advantage over the BB3 baseline trained only with supervised learning.

4.3 Human Evaluation

  • Setup:
    • 90 individuals recruited via email, social media, and in-person.
    • Participants randomly assigned to one of three systems, yielding 30 responses per model.
    • Users instructed to chat with the bot for a minimum of seven turns.
    • Post-chat survey rated conversation quality and bot’s persona representation on a scale of 1-5.
    • Option for users to provide additional feedback or suggestions in a text box.
    • Bot personas randomly selected from 967 options in the PersonaChat dataset.
    • Participants only shown the bot’s persona after completing the chat for reference during the survey.
  • Results:
    • Results displayed in Table 3.
    • Bayesian calibration applied to results to account for annotator bias and inter-annotator variability.
    • Findings indicate both offline RL methods enhance bot’s consistency with its assigned persona; GOLD method performs best.
    • VaRMI importance sampling method improves dialogue quality over the BB3 baseline.
    • GOLD importance sampling method underperforms in conversation quality compared to the other methods.

4.4 User Comments and Error Analysis

BB3 BB3 + VaRMI
image image
  • Complaints Received:
    • Users reported awkward language usage and abrupt topic changes across all bots.
  • Persona Representation Issues:

    • Complaints arose particularly regarding the GOLD method bot, which was noted for over-representing its persona.
    • Users felt the bot ignored their inputs, focusing instead on its persona.
    • Some interactions felt scripted due to the fixation on persona.
  • Quality vs. Consistency Trade-off:

    • The GOLD bot successfully represented its persona but sacrificed dialogue quality for consistency.
    • This raises questions about how much a chatbot should represent its persona throughout a conversation.
  • Conversation Context Importance:

    • In short conversations (around seven turns), fully representing a persona may seem unnatural.
    • The optimal score for consistency and conversation quality may vary depending on conversation type.
  • Evaluation of Models:
    • The BB3 baseline model faced multiple complaints regarding inadequate persona representation.
    • In contrast, the VaRMI model improved consistency and quality, correcting contradictions found in the BB3 conversations.
  • Examples of Conversations:
    • Table 4 provides an example of inconsistency from the BB3 model.
    • Table 5 showcases a more consistent conversation with the VaRMI model, highlighting its improved performance.
  • Further Information:
    • Full conversations from the human evaluation are available in Appendix C.

5 Conclusion and Future Work

  • Demonstrated the effectiveness of offline reinforcement learning (RL) for enhancing open-domain dialogue systems.
  • Applied offline RL to a persona consistency task, showing improved persona consistency and dialogue quality compared to imitation learning-only systems.
  • Developed a persona consistency critic utilizing human-annotated labels and introduced a novel importance sampling method called VaRMI.
  • Evaluations (both automatic and human) confirmed the enhancement in persona consistency of BB3 and overall dialogue quality.
  • Suggested future work directions:
    • Extend framework to address other dialogue aspects, such as reducing hallucinations and offensive language.
    • Utilize LLMs for generating quality synthetic data, minimizing the need for human conversation collection.
    • Explore the generalization of VaRMI to other tasks.
    • Investigate offline policy gradient methods and assess if VaRMI can mitigate their high variance issues effectively.

6 Limitations

  • The framework’s training sample size is fixed.
  • Collecting more data from human users or synthesizing data using LLMs is necessary for increasing sample size.
  • Both methods of acquiring more samples are more costly compared to online RL methods, which can generate unlimited samples without extra cost.
  • Human experiments were constrained by the size of the language model used.
  • Only the 3B parameter version of BB3 was utilized due to resource limitations, which is significantly smaller than many state-of-the-art language models.
  • The next version of BB3 has 30B parameters, which exceeds current training resource availability.
  • Future improvements may benefit from increasing the language model size to address quality complaints.

7 Ethical Concerns

  • Giving a language model a persona encourages it to act human-like.
  • Language models may be incentivized to deny being a bot when asked.
  • Transparency with users is crucial; they should know they are interacting with a chatbot.
  • Participants were informed they were conversing with a bot in all experiments.
  • Users were advised not to share personally identifiable information with the bot.

Reader’s Opinion

  • This study used only a 3B parameter language model due to resource constraints. To overcome this limitation, larger models could be utilized while employing parameter-efficient training methods such as LoRA.
    • Specifically, the parameters of larger language models, such as 30B or 175B, can be frozen while updating only the parameters of the LoRA module during reinforcement learning.
  • There are also limitations in reinforcement learning conducted with a restricted dataset. This issue could be addressed by augmenting the training dataset using an LLM.
    • However, in this case, additional research is needed on how LLMs can generate conversational datasets effectively without issues such as hallucination.

Comments