Model Behavior

Research on large language model (LLM) behavior focuses on understanding and mitigating undesirable outputs, such as toxicity or biases, while improving desirable traits like helpfulness and accuracy. Current efforts investigate methods for post-hoc safety alignment, analyzing the impact of decoding strategies and persona assignment on model responses, and developing techniques to interpret and edit model behavior through targeted interventions or data manipulation. These studies are crucial for building safer and more reliable LLMs, impacting both the development of trustworthy AI systems and the advancement of our understanding of complex model architectures.

Papers

October 24, 2023

Characterizing Mechanisms for Factual Recall in Language Models
Qinan Yu, Jack Merullo, Ellie Pavlick
Language Model Urban Environment Functional Mechanism Model Behavior Factual Recall

October 3, 2023

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
Large Language Model Alignment Problem Social Science Model Behavior Explanatory Paradigm Abstract Concept

September 12, 2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Maximilian Li, Xander Davies, Max Nadeau
Language Model Non Toxic Model Behavior Causal Pathway Radiofrequency Ablation

July 18, 2023

Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
Feature Imitation Internal Representation Great Truth Model Behavior Language Model Behavior

June 30, 2023

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems
Adrian Stando, Mustafa Cavus, Przemysław Biecek
Machine Learning Model Mixed Effect Model Bias Imbalanced Data Balancing Strategy Imbalanced Classification Model Comparison Model Behavior Balanced Distribution

April 28, 2023

Are Emergent Abilities of Large Language Models a Mirage?
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo
Large Language Model Artificial Intelligence Model Model Behavior Emergent Ability

March 24, 2023

TRAK: Attributing Model Behavior at Scale
Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry
Language Model Visual Analogue Scale Attribution Method Model Behavior Data Attribution

January 25, 2023

Explaining Large Language Model-Based Neural Semantic Parsers (Student Abstract)
Daking Rai, Yilun Zhou, Bailin Wang, Ziyu Yao
Large Language Model Semantic Parsing Large Language Structured Prediction Model Behavior Neural Semantic

January 17, 2023

Are Language Models Worse than Humans at Following Prompts? It's Complicated
Albert Webson, Alyssa Marie Loo, Qinan Yu, Ellie Pavlick
Language Model Real Human Complex Prompt Human Behavior Model Behavior Unsafe Prompt

December 8, 2022

Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi
Pre Trained Model Model Editing Model Behavior Arithmetic Task Task Vector

November 10, 2022

Understanding Text Classification Data and Models Using Aggregated Input Salience
Sebastian Ebert, Alice Shoshana Jakobovits, Katja Filippova
Full Model Text Classification Relevant Visualization Model Behavior Model Developer

August 8, 2022

Behavior Trees and State Machines in Robotics Applications
Razan Ghzouli, Thorsten Berger, Einar Broch Johnsen, Andrzej Wasowski, Swaib Dragule
Robotics Domain Behavior Tree Behavior Model Model Behavior Robotics Application State Machine

June 27, 2022

Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior
Jean-Stanislas Denain, Jacob Steinhardt
Transparency Index Model Behavior Anomalous Behavior Model Transparency Model Visualization

November 30, 2021

What Do You See in this Patient? Behavioral Testing of Clinical NLP Models
Betty van Aken, Sebastian Herrmann, Alexander Löser
Deep Neural Network Patient Friendly Language Model Behavior Clinical NLP Behavioral Testing Model Decision