Hallucination Evaluation Benchmark
Hallucination evaluation benchmarks aim to systematically assess the tendency of large language models (LLMs) to generate factually incorrect or unsupported information. Current research focuses on developing comprehensive benchmarks encompassing diverse tasks and modalities (text, image, audio-visual), employing various techniques like contrastive learning on internal LLM states and probabilistic frameworks based on belief propagation to detect hallucinations. These benchmarks are crucial for advancing LLM development by providing standardized metrics to evaluate and improve model reliability, ultimately impacting the safe and effective deployment of LLMs in real-world applications.
Papers
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models
Yichen Sun, Zhixuan Chu, Zhan Qin, Kui Ren