Unknown Language
Research on unknown languages focuses on developing and evaluating computational methods to analyze and process text and speech across a wide range of languages, particularly those with limited digital resources. Current efforts concentrate on improving large language models (LLMs) for multilingual tasks, including translation, question answering, and toxicity detection, often employing techniques like self-supervised learning, preference tuning, and multilingual feedback mechanisms. This work is crucial for advancing natural language processing capabilities globally, enabling more equitable access to technology and fostering deeper cross-cultural understanding within the scientific community and various practical applications.
Papers
LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages
Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos
Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation
Di Wu, Christof Monz
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov
Scaling Speech Technology to 1,000+ Languages
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning
Rochelle Choenni, Dan Garrette, Ekaterina Shutova
Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale
Marta R. Costa-jussà, Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, Carleigh Wood
PrOnto: Language Model Evaluations for 859 Languages
Luke Gessler
PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India
Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, Barry Haddow
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages
Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schütze
A Crosslingual Investigation of Conceptualization in 1335 Languages
Yihong Liu, Haotian Ye, Leonie Weissweiler, Philipp Wicke, Renhao Pei, Robert Zangenfeind, Hinrich Schütze