LLM Evaluation
Evaluating large language models (LLMs) focuses on establishing their reliability, safety, and suitability for various applications. Current research emphasizes developing robust and comprehensive evaluation frameworks, moving beyond simple accuracy metrics to assess aspects like data privacy, bias, explainability, and the ability to combine different skills. This rigorous evaluation is crucial for responsible LLM development and deployment, informing both the scientific understanding of these models and their safe integration into real-world applications across diverse fields.
Papers
MERA: A Comprehensive LLM Evaluation in Russian
Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, Sergei Markov
The Critique of Critique
Shichao Sun, Junlong Li, Weizhe Yuan, Ruifeng Yuan, Wenjie Li, Pengfei Liu