Indian Language

Research on Indian languages focuses on developing and evaluating natural language processing (NLP) models for the diverse linguistic landscape of India, addressing the challenges posed by low-resource languages and significant dialectal variation. Current efforts concentrate on adapting and fine-tuning multilingual transformer models, such as BERT and its variants, for tasks like machine translation, question answering, and sentiment analysis, alongside developing new benchmarks and datasets to facilitate robust evaluation. This work is crucial for bridging the digital divide, enabling wider access to technology and information in India, and advancing the broader field of multilingual NLP.

Papers

January 11, 2024

End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2
Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra
Automatic Speech Recognition Neural Machine Translation Text to Speech Cross Lingual Indian Language Wav2vec U Annotated End State

January 8, 2024

Overview of the 2023 ICON Shared Task on Gendered Abuse Detection in Indic Languages
Aatman Vaidya, Arnav Arora, Aditya Joshi, Tarunima Prabhakar
Related Task Indian Language Novel Dataset Hateful Content

January 4, 2024

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages
Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi
Indian Language News Article Multilingual BERT News Dataset L3Cube MahaSocialNER Headline Classification

December 23, 2023

Multilingual Bias Detection and Mitigation for Indian Languages
Ankita Maity, Anubhav Sharma, Rudra Dhar, Tushar Abhishek, Manish Gupta, Vasudeva Varma
Indian Language Faulty Negative Mitigation Wikipedia Article Multilingual Transformer BIAS Detection Multilingual Bias

December 15, 2023

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages
Saiful Haq, Ashutosh Sharma, Pushpak Bhattacharyya
Indian Language Multilingual Dataset Neural Information Retrieval Monolingual Retrieval Neural Information

December 2, 2023

Enabling Quantum Natural Language Processing for Hindi Language
Naman Srivastava, Gaurang Belekar, Sunil Saumya, Aswath Babu H
Natural Language Processing Indian Language Quantum Natural Language Processing

November 29, 2023

Mukhyansh: A Headline Generation Dataset for Indic Languages
Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava
Natural Language Processing Indian Language Headline Generation Article Headline Pair

November 15, 2023

Assessing Translation capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia, Ashok Urlana, Yash Bhaskar, Penumalla Aditya Pavani, Kukkapalli Shravya, Parameswari Krishnamurthy, Dipti Misra Sharma
Large Language Model Machine Translation Indian Language Generative Large Language Model Translation Capability

October 26, 2023

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge
Tanel Alumäe, Jiaming Kong, Daniil Robnikov
Data Augmentation Automatic Speech Recognition Challenge Task Low Resource Indian Language Prefix Tuning

October 23, 2023

SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras
Nithya R, Malavika S, Jordan F, Arjun Gangwar, Metilda N J, S Umesh, Rithik Sarab, Akhilesh Kumar Dubey, Govind Divakaran, Samudra Vijaya K, Suryakanth V Gangashetty
Speech Analysis Speech Data Indian Language NLP Community ASR System Tamil Language

October 15, 2023

MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages
Sayan Mahapatra, Debtanu Datta, Shubham Soni, Adrijit Goswami, Saptarshi Ghosh
New Benchmark Machine Translation View Translation Indian Language Legal Document Machine Translation Model Vision Science JUSTICE Legal Corpus

October 12, 2023

Structural analysis of Hindi online handwritten characters for character recognition
Anand Sharma, A. G. Ramakrishnan
Indian Language Character Recognition Stroke Patient Handwritten Character Structural Analysis Character Classifier

October 3, 2023

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages
Ananya Joshi, Raviraj Joshi
Hate Speech Hate Speech Detection Indian Language Offensive Content Offensive Language Detection Wire Harness Low Resource Indian Language Sentence Transformer

September 23, 2023

Hindi to English: Transformer-Based Neural Machine Translation
Kavit Gangar, Hardik Ruparel, Shreyas Lele
Machine Translation Neural Machine Translation Indian Language Back Translation Hindi English

September 19, 2023

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
Md Nishat Raihan, Dhiman Goswami, Antara Mahmud
BERT Model Indian Language Bangla Text Code Mixed Distilled BERT

August 19, 2023

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi
Maithili Sabane, Onkar Litake, Aman Chadha
Data Set Natural Language Question Answering Low Resource Language Yes No Question Indian Language Question Answering System Language Barrier

July 28, 2023

Multilingual Tourist Assistance using ChatGPT: Comparing Capabilities in Hindi, Telugu, and Kannada
Sanjana Kolar, Rohit Kumar
Language Model ChatGPT Generated Conversation Indian Language Translation Quality Human Translation Machine Translated Artificial Intelligence Language Model

July 11, 2023

ISLTranslate: Dataset for Translating Indian Sign Language
Abhinav Joshi, Susmit Agrawal, Ashutosh Modi
Data Set Sign Language Indian Language Sign Language Translation End to End Sign Language

Indian Language

Papers

End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2

Overview of the 2023 ICON Shared Task on Gendered Abuse Detection in Indic Languages

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Multilingual Bias Detection and Mitigation for Indian Languages

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Enabling Quantum Natural Language Processing for Hindi Language

Mukhyansh: A Headline Generation Dataset for Indic Languages

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras

MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages

Structural analysis of Hindi online handwritten characters for character recognition

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages

Hindi to English: Transformer-Based Neural Machine Translation

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

Multilingual Tourist Assistance using ChatGPT: Comparing Capabilities in Hindi, Telugu, and Kannada

ISLTranslate: Dataset for Translating Indian Sign Language