LLM Behavior
Research on Large Language Model (LLM) behavior focuses on understanding and controlling their outputs, particularly concerning safety and reliability. Current efforts involve developing methods to interpret LLM decision-making processes, such as through meta-models analyzing internal activations, and improving control mechanisms like activation steering and prompt baking to mitigate harmful or undesirable behaviors. These investigations are crucial for building trustworthy and beneficial LLMs, addressing concerns about replicability in evaluation methodologies and the need for robust techniques to ensure responsible deployment in various applications.
Papers
February 27, 2024
February 12, 2024
February 9, 2024
February 2, 2024
January 31, 2024
December 1, 2023
April 2, 2023