Model Behavior
Research on large language model (LLM) behavior focuses on understanding and mitigating undesirable outputs, such as toxicity or biases, while improving desirable traits like helpfulness and accuracy. Current efforts investigate methods for post-hoc safety alignment, analyzing the impact of decoding strategies and persona assignment on model responses, and developing techniques to interpret and edit model behavior through targeted interventions or data manipulation. These studies are crucial for building safer and more reliable LLMs, impacting both the development of trustworthy AI systems and the advancement of our understanding of complex model architectures.
Papers
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodríguez
Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior
Pedro Henrique Luz de Araujo, Benjamin Roth