LLM Behavior

Research on Large Language Model (LLM) behavior focuses on understanding and controlling their outputs, particularly concerning safety and reliability. Current efforts involve developing methods to interpret LLM decision-making processes, such as through meta-models analyzing internal activations, and improving control mechanisms like activation steering and prompt baking to mitigate harmful or undesirable behaviors. These investigations are crucial for building trustworthy and beneficial LLMs, addressing concerns about replicability in evaluation methodologies and the need for robust techniques to ensure responsible deployment in various applications.

Papers