Answer Correctness

Answer correctness in large language models (LLMs) and vision-language models (VLMs) is a critical area of research focusing on improving the reliability and trustworthiness of AI-generated responses. Current efforts concentrate on developing methods to assess answer reliability, including techniques that analyze consistency across multiple model outputs or decompose complex questions into simpler sub-questions. These advancements aim to mitigate issues like hallucination and overconfidence, ultimately leading to more accurate and dependable AI systems for various applications. The improved evaluation of answer correctness is crucial for advancing the field and ensuring responsible deployment of these powerful technologies.

Papers

February 17, 2024

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence
Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber
Text Generation Question Answering Automatic Evaluation Answer Correctness

January 24, 2024

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Boyd-Graber
Large Language Model Question Answering Open Domain Question Answering Human Judgment Answer Correctness

January 11, 2024

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models
Matthew Renze, Erhan Guven
Chain of Thought Prompt Engineering Side Chain Complementary Benefit Thought Reasoning Problem Solving Step by Step Reasoning Answer Correctness

November 29, 2023

October 20, 2023

Self-Consistency of Large Language Models under Ambiguity
Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau
Large Language Model Self Consistency Structural Ambiguity Answer Correctness

October 4, 2023

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao
Large Language Model Mixture Component Reasoning Benchmark TF Cascade Answer Correctness Mental Representation

August 30, 2023

Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness
Jiuhai Chen, Jonas Mueller
Large Language Model Language Model Uncertainty Quantification Top Two Answer Confidence Score Answer Correctness

July 24, 2023

Comparative Analysis of Drug-GPT and ChatGPT LLMs for Healthcare Insights: Evaluating Accuracy and Relevance in Patient and HCP Contexts
Giorgos Lysandrou, Roma English Owen, Kirsty Mursec, Grant Le Brun, Elizabeth A. L. Fairley
Language Model Generative Pre Trained Transformer Non Negative Textual Response Answer Correctness Accuracy Measurement Real World Healthcare ChatGPT Like LLM

April 27, 2023

Federated Prompting and Chain-of-Thought Reasoning for Improving LLMs Answering
Xiangyang Liu, Tianqi Pang, Chenyou Fan
Large Language Model Complex Reasoning Question Classification Answer Correctness Federated Prompt Relevant Question

March 31, 2023

Asking Better Questions -- The Art and Science of Forecasting: A mechanism for truer answers to high-stakes questions
Emily Dardaman, Abhishek Gupta
Artificial Intelligence Yes No Question State of the Art Forecasting Functional Mechanism Forecasting Performance Social Science Answer Correctness AI Benchmark

October 21, 2022

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning
Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark
Complex Reasoning Yes No Question Textual Entailment Answer Correctness Truthful Incentive Mechanism Backward Chaining

June 27, 2022

Consistency-preserving Visual Question Answering in Medical Imaging
Sergio Tascon-Morales, Pablo Márquez-Neila, Raphael Sznitman
Visual Question Answering Medical Imaging Answer Correctness

January 26, 2022

SCAI-QReCC Shared Task on Conversational Question Answering
Svitlana Vakulenko, Johannes Kiesel, Maik Fröbe
Question Answering Shared Task Answer Generation Conversational Search Answer Correctness Conversational QA

December 16, 2021

DREAM: Improving Situational QA by First Elaborating the Situation
Yuling Gu, Bhavana Dalvi Mishra, Peter Clark
Language Model Cognitive Science Answer Correctness Situation Understanding Dream Report Scene Description Low Complexity Explanation

September 26, 2020

Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents
Chejui Liao, Tabish Maniar, Sravanajyothi N, Anantha Sharma
Transformer Based Large Corpus Transformer Based Model Ticket BERT Barzilai Borwein Technique Text Analysis Answer Correctness Small Corpus

Answer Correctness

Papers

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models

How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation

Universal Self-Consistency for Large Language Model Generation

Self-Consistency of Large Language Models under Ambiguity

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness

Comparative Analysis of Drug-GPT and ChatGPT LLMs for Healthcare Insights: Evaluating Accuracy and Relevance in Patient and HCP Contexts

Federated Prompting and Chain-of-Thought Reasoning for Improving LLMs Answering

Asking Better Questions -- The Art and Science of Forecasting: A mechanism for truer answers to high-stakes questions

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Consistency-preserving Visual Question Answering in Medical Imaging

SCAI-QReCC Shared Task on Conversational Question Answering

DREAM: Improving Situational QA by First Elaborating the Situation

Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents