Paper ID: 2501.02518
CHAIR-Classifier of Hallucination as Improver
Ao Sun
This paper presents a supervised method for detecting hallucinations in large language models. By analyzing token scores (logitis) across layers of the LLaMA model, we derive a small set, aiming to reduce overfitting, of features-including maximum, minimum, mean, standard deviation, and slope. We use logistic regression for classification and validate the model on the TruthfulQA and MMLU datasets. The results demonstrate significant performance gains, especially in zero-shot scenarios, highlighting the effectiveness and potential for generalization.
Submitted: Jan 5, 2025