BharatGen Research Papers

BharatGen’s research addresses the foundational challenges of building AI for India’s linguistic and cultural diversity. Our work spans large language models, tokenization for Indic scripts, text-to-speech synthesis, speech translation and visual document understanding.

The papers below are published on arXiv and in peer-reviewed venues. They reflect the breadth of our work: from the core infrastructure of multilingual AI to applied systems for agriculture, document analysis and low-resource languages.

No.	Title & Description	Authors	Venue
1.	Hi-SEMFLOW: Lie Algebra-Based Semantic Flow for Span-Level Informal Language Identification in Hindi Proposes a Lie algebra-based semantic flow model for span-level informal language identification in Hindi.	Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rohit Saluja	CHiPSAL @ LREC 2026
2.	UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees Addresses imbalanced domain mixtures for language model training via Partial Optimal Transport with uniform weightage across domains to mitigate class imbalance.	Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria	AISTATS 2026
3.	Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses Studies how existing multilingual tokenizers handle informal Hindi expressions, revealing significant performance gaps on romanized Hindi text.	Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rakesh Prakash, Rohit Saluja, Shayan Mohanty	LoResLM @ EACL 2026
4.	Post-ASR Correction in Hindi: Comparing Language Models and Large Language Models in Low-Resource Scenarios Shows that smaller fine-tuned models like ByT5 and mT5 outperform larger LLMs for Hindi ASR post-correction, with inverse scaling under zero-shot learning.	Rishabh Kumar, Amrith Krishna, Ganesh Ramakrishnan, Preethi Jyothi	EACL 2026
5.	Tables Decoded: DELTA for Structure, TARQA for Understanding Converts table images into structured text for accurate table recognition and visual question answering.	Jahanvi Rajput, Dhruv Kudale, Saikiran Kasturi, Utkarsh Verma, Ganesh Ramakrishnan	WACV 2026
6.	Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning Proposes an adaptive SVD-based full fine-tuning method that constrains updates to orthogonal subspaces, achieving near-zero catastrophic forgetting without adding parameters.	Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, Akash Srivastava	ICLR 2026
7.	IndicParam: Benchmark to Evaluate LLMs on Low-Resource Indic Languages A 13K-question benchmark across 11 low-resource Indic languages showing current LLMs perform poorly (maximum ~58%), highlighting gaps in cross-lingual reasoning.	Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari	arXiv 2025
8.	HiLearners: Non-Native Spoken Hindi Error Correction Addresses automatic error correction in spoken Hindi produced by non-native speakers, covering phonological, lexical, and grammatical errors across diverse language backgrounds.	Sourava Kumar Behera, Rohit Saluja	IJCNLP-AACL 2025
9.	Towards Scene Text Recognition in Rainy Weather Conditions Investigates the robustness of scene text recognition models in rainy conditions and proposes methods to mitigate rain-induced visual degradation.	Anandita Jamwal, Lalithya Koneti, Manikandan Ravikiran, Dinesh Singh, Rohit Saluja	ICDAR 2025
10.	Multi-Feature Graph Convolution Network for Hindi OCR Verification Uses graph convolutional networks with multiple visual features to verify and correct Hindi OCR outputs, improving reliability for Devanagari script recognition.	Shikhar Dubey, Krish Mittal, Sourava Kumar Behera, Manikandan Ravikiran, Nitin Kumar, Saurabh Shigwan, Rohit Saluja	BHASHA Workshop 2025
11.	AnciDev: A Dataset for High-Accuracy Handwritten Text Recognition of Ancient Devanagari Manuscripts Presents a high-quality annotated dataset of ancient Devanagari manuscript images to support training and evaluation of handwritten text recognition systems.	Vriti Sharma, Rajat Verma, Rohit Saluja	BHASHA Workshop 2025
12.	INDRA: Iterative Difficulty Refinement Attention for MCQ Difficulty Estimation for Indic Languages Proposes an iterative attention mechanism that progressively refines MCQ difficulty estimation for Indic languages through multiple refinement passes.	Manikandan Ravikiran, Rohit Saluja, Arnav Bhavsar	BHASHA Workshop 2025
13.	Bandit Guided Submodular Curriculum for Adaptive Subset Selection Recasts curriculum learning as a bandit over submodular selectors and introduces ONLINESUBMOD, a no-regret online greedy method leveraging validation rewards.	Prateek Chanda, Prayas Agrawal, Saral Sureka, Lokesh Reddy Polu, Atharv Kshirsagar, Ganesh Ramakrishnan	NeurIPS 2025
14.	TaskMixPGM: Task Mixtures via Probabilistic Graphical Modelling for Language Model Finetuning Introduces TASKPGM, an MRF-based framework that optimizes task mixture proportions using behavioral divergences, improving LLM fine-tuning performance and interpretability.	Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan	arXiv 2025
15.	FairPO: Robust Preference Optimization for Fair Multi-Label Learning Introduces FairPO, a preference-driven and group-robust framework enhancing underperforming labels through DPO-style ranking corrections.	Soumen Kumar Mondal, Prateek Chanda, Akshit Varmora, Ganesh Ramakrishnan	NeurIPS 2025
16.	ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects A 17K-question Hindi benchmark with graduate-level, culturally grounded Indian questions; current LLMs achieve only ~56% accuracy.	Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari	arXiv 2025
17.	HiSlang-4.9k: A Benchmark Dataset for Hindi Slang Detection and Identification Introduces a 4,900-instance annotated dataset for detecting and identifying slang expressions in Hindi text, supporting informal language and social media analysis.	Tanmay Tiwari, Vibhu Gupta, Manikandan Ravikiran, Rohit Saluja	ICNLSP 2025
18.	The Art of Breaking Words: Rethinking Multilingual Tokenizer Design Proposes improved pre-tokenization techniques for Indic scripts, reducing token-to-word ratios by approximately 6% and enhancing multilingual model efficiency.	Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan	arXiv 2025
19.	Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering Presents Krishi Sathi, an AI chatbot for Indian farmers leveraging multi-turn intent tracking and retrieval-augmented generation for agricultural assistance.	Abhay Vijayvargia, Ajay Nagpal, Kundeshwar Pundalik, Atharva Savarkar, Smita Gautam, Pankaj Singh, Rohit Saluja, Ganesh Ramakrishnan	arXiv 2025
20.	CASSA: Context-Aware Self-Attention with Global Context Suppression and Relevance Modulation for MCQ Difficulty Estimation Proposes a self-attention mechanism with global context suppression and relevance modulation to improve MCQ difficulty estimation in educational assessment systems.	Manikandan Ravikiran, Tarun Sharma, Arnav Bhavsar, Rohit Saluja	AIED 2025
21.	ScoreCLIQ: A Dynamic LLM-Based Framework for Item Difficulty Estimation A dynamic framework for educational item difficulty estimation combining BERT-based scoring with LLM-driven paraphrastic refinement using reinforcement learning.	S. Sarkar, Manikandan Ravikiran, Rohit Saluja	AIED 2025
22.	How Far Are We from Automatic Grading of Handwritten Cloze Form Questions? Benchmarks AI models against human evaluators on handwritten cloze-form examination responses, highlighting significant challenges in automated grading.	Shrey Chandola, Manikandan Ravikiran, Rohit Saluja	AIED 2025
23.	PARAM-1 BharatGen 2.9B Model A 2.9B-parameter bilingual language model for Hindi and English, featuring morphology-aware tokenization and new evaluation benchmarks for Indic code-switching.	Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, Abhishek Shinde, Prateek Chanda, Vedant Goswami, Ajay Nagpal, Atul Singh, Viraj Thakur, Vijay Dewane, Aamod Thakur, Bhargav Patel, Smita Gautam, Bhagwan Panditi, Shyam Pawar, Madhav Kotcha, Suraj Racha, Saral Sureka, Pankaj Singh, Rishi Bal, Rohit Saluja, Ganesh Ramakrishnan	arXiv 2025
24.	A2TTS: TTS for Low Resource Indian Languages A diffusion-based, speaker-conditioned text-to-speech system for multiple Indian languages, enabling high-quality zero-shot speech generation for unseen speakers.	Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan	arXiv 2025
25.	Impact of Duration Prediction on Speaker-Specific TTS for Indian Languages Examines how different duration prediction strategies influence speech naturalness, intelligibility, and speaker identity preservation in TTS for low-resource Indian languages.	Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan	arXiv 2025
26.	DRISHTIKON: Visual Grounding at Multiple Granularities in Documents Locates specific text regions in complex multilingual document images at block, line, word, and point levels, with a new benchmark for document visual grounding.	Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan	arXiv 2025
27.	Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights Demonstrates that extremely low-resource Indian languages benefit significantly from tokenizers trained on related higher-resource languages, across 17 Indic languages.	N J Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, Maunendra Sankar Desarkar	arXiv 2025
28.	LexGen: Domain-aware Multilingual Lexicon Generation Creates multilingual dictionaries of domain-specific terms for Indian languages, improving technical vocabulary coverage and language resource development.	Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan	ACL 2025
29.	Language Translation and Change of Accent for Speech-to-Speech Using Diffusion Models Reformulates speech translation and accent adaptation as a unified conditional generation task, producing target-language speech with an adapted accent.	Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey, Ganesh Ramakrishnan	arXiv 2025
30.	MorphTok: Morphologically Grounded Tokenization for Indian Languages Introduces morphology-aware preprocessing and Constrained BPE for Indic scripts, along with EvalTok, a human-centered evaluation metric for tokenization quality.	Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar	TokShop 2025

No.	Title & Description	Authors	Date	Venue
1.	Hi-SEMFLOW: Lie Algebra-Based Semantic Flow for Span-Level Informal Language Identification in Hindi Proposes a Lie algebra-based semantic flow model for span-level informal language identification in Hindi.	Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rohit Saluja	May 2026	CHiPSAL @ LREC 2026
2.	UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees Addresses imbalanced domain mixtures for language model training via Partial Optimal Transport with uniform weightage across domains to mitigate class imbalance.	Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria	Apr 2026	AISTATS 2026
3.	Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses Studies how existing multilingual tokenizers handle informal Hindi expressions through static, downstream, and robustness analyses, revealing significant performance gaps on romanized Hindi text.	Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rakesh Prakash, Rohit Saluja, Shayan Mohanty	Mar 2026	LoResLM @ EACL 2026
4.	Post-ASR Correction in Hindi: Comparing Language Models and Large Language Models in Low-Resource Scenarios Shows that smaller fine-tuned models like ByT5 and mT5 outperform larger LLMs for Hindi ASR post-correction, exhibiting inverse scaling under zero-shot in-context learning and generalizing across Marathi and Telugu.	Rishabh Kumar, Amrith Krishna, Ganesh Ramakrishnan, Preethi Jyothi	Mar 2026	EACL 2026
5.	Tables Decoded: DELTA for Structure, TARQA for Understanding Converts table images into structured text for accurate table recognition and visual question answering.	Jahanvi Rajput, Dhruv Kudale, Saikiran Kasturi, Utkarsh Verma, Ganesh Ramakrishnan	Mar 2026	WACV 2026
6.	Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning Proposes an adaptive SVD-based full fine-tuning method that constrains updates to orthogonal subspaces of prior tasks, achieving near-zero catastrophic forgetting without adding new parameters.	Nikhil Shivakumar Nayak, Krishnateja Killamsetty, Ligong Han, Abhishek Bhandwaldar, Prateek Chanda, Kai Xu, Hao Wang, Aldo Pareja, Oleg Silkin, Mustafa Eyceoz, Akash Srivastava	Jan 2026	ICLR 2026
7.	IndicParam: Benchmark to Evaluate LLMs on Low-Resource Indic Languages A 13K-question benchmark across 11 low-resource Indic languages showing current LLMs perform poorly (maximum ~58%), highlighting limitations in cross-lingual understanding and reasoning.	Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari	Nov 2025	arXiv 2025
8.	HiLearners: Non-Native Spoken Hindi Error Correction Addresses automatic error correction in spoken Hindi produced by non-native speakers, covering phonological, lexical, and grammatical errors across diverse language backgrounds.	Sourava Kumar Behera, Rohit Saluja	Dec 2025	IJCNLP-AACL 2025
9.	Towards Scene Text Recognition in Rainy Weather Conditions Investigates the robustness of scene text recognition models in rainy conditions and proposes methods to mitigate rain-induced visual degradation in natural scene images.	Anandita Jamwal, Lalithya Koneti, Manikandan Ravikiran, Dinesh Singh, Rohit Saluja	Dec 2025	ICDAR 2025
10.	Multi-Feature Graph Convolution Network for Hindi OCR Verification Uses graph convolutional networks with multiple visual features to verify and correct Hindi OCR outputs, improving reliability for complex Devanagari script recognition.	Shikhar Dubey, Krish Mittal, Sourava Kumar Behera, Manikandan Ravikiran, Nitin Kumar, Saurabh Shigwan, Rohit Saluja	Dec 2025	BHASHA Workshop 2025
11.	AnciDev: A Dataset for High-Accuracy Handwritten Text Recognition of Ancient Devanagari Manuscripts Presents a high-quality annotated dataset of ancient Devanagari manuscript images to support the training and evaluation of handwritten text recognition systems for historical documents.	Vriti Sharma, Rajat Verma, Rohit Saluja	Dec 2025	BHASHA Workshop 2025
12.	INDRA: Iterative Difficulty Refinement Attention for MCQ Difficulty Estimation for Indic Languages Proposes an iterative attention mechanism that progressively refines MCQ difficulty estimation for Indic languages, improving prediction accuracy through multiple refinement passes.	Manikandan Ravikiran, Rohit Saluja, Arnav Bhavsar	Dec 2025	BHASHA Workshop 2025
13.	Bandit Guided Submodular Curriculum for Adaptive Subset Selection Recasts curriculum learning as a bandit over submodular selectors and introduces ONLINESUBMOD, a no-regret online greedy method that leverages validation rewards to outperform prior subset selection approaches.	Prateek Chanda, Prayas Agrawal, Saral Sureka, Lokesh Reddy Polu, Atharv Kshirsagar, Ganesh Ramakrishnan	Sep 2025	NeurIPS 2025
14.	TaskMixPGM: Task Mixtures via Probabilistic Graphical Modelling for Language Model Finetuning Introduces TASKPGM, an MRF-based framework that optimizes task mixture proportions using behavioral divergences, providing a closed-form, theoretically grounded solution that improves LLM fine-tuning performance and interpretability.	Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan	Sep 2025	arXiv 2025
15.	FairPO: Robust Preference Optimization for Fair Multi-Label Learning Introduces FairPO, a preference-driven and group-robust framework that enhances underperforming labels through DPO-style ranking corrections while preserving performance on other labels via constrained optimization and adaptive balancing.	Soumen Kumar Mondal, Prateek Chanda, Akshit Varmora, Ganesh Ramakrishnan	Sep 2025	NeurIPS 2025
16.	ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects A 17K-question Hindi benchmark featuring graduate-level, culturally grounded Indian questions, showing current LLMs achieve only ~56% accuracy and highlighting gaps in domain-specific reasoning.	Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari	Aug 2025	arXiv 2025
17.	HiSlang-4.9k: A Benchmark Dataset for Hindi Slang Detection and Identification Introduces a 4,900-instance annotated dataset for detecting and identifying slang expressions in Hindi text, supporting informal language understanding and social media content analysis.	Tanmay Tiwari, Vibhu Gupta, Manikandan Ravikiran, Rohit Saluja	Aug 2025	ICNLSP 2025
18.	The Art of Breaking Words: Rethinking Multilingual Tokenizer Design Proposes improved pre-tokenization techniques for Indic scripts, reducing token-to-word ratios by approximately 6% and enhancing the efficiency of multilingual language models.	Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan	Aug 2025	arXiv 2025
19.	Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering Presents Krishi Sathi, an AI chatbot for Indian farmers that leverages multi-turn intent tracking and retrieval-augmented generation to provide accurate agricultural assistance in Hindi and English.	Abhay Vijayvargia, Ajay Nagpal, Kundeshwar Pundalik, Atharva Savarkar, Smita Gautam, Pankaj Singh, Rohit Saluja, Ganesh Ramakrishnan	Aug 2025	arXiv 2025
20.	CASSA: Context-Aware Self-Attention with Global Context Suppression and Relevance Modulation for MCQ Difficulty Estimation Proposes a self-attention mechanism with global context suppression and relevance modulation to improve multiple-choice question difficulty estimation in educational assessment systems.	Manikandan Ravikiran, Tarun Sharma, Arnav Bhavsar, Rohit Saluja	Jul 2025	AIED 2025
21.	ScoreCLIQ: A Dynamic LLM-Based Framework for Item Difficulty Estimation A dynamic framework for educational item difficulty estimation that combines BERT-based scoring with LLM-driven paraphrastic refinement using reinforcement learning techniques.	S. Sarkar, Manikandan Ravikiran, Rohit Saluja	Jul 2025	AIED 2025
22.	How Far Are We from Automatic Grading of Handwritten Cloze Form Questions? Benchmarks AI models against human evaluators on handwritten cloze-form examination responses, highlighting significant challenges in achieving reliable and accurate automated grading.	Shrey Chandola, Manikandan Ravikiran, Rohit Saluja	Jul 2025	AIED 2025
23.	PARAM-1 BharatGen 2.9B Model A 2.9B-parameter bilingual language model for Hindi and English, featuring morphology-aware tokenization and new evaluation benchmarks for Indic code-switching and multilingual language understanding.	Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, Abhishek Shinde, Prateek Chanda, Vedant Goswami, Ajay Nagpal, Atul Singh, Viraj Thakur, Vijay Dewane, Aamod Thakur, Bhargav Patel, Smita Gautam, Bhagwan Panditi, Shyam Pawar, Madhav Kotcha, Suraj Racha, Saral Sureka, Pankaj Singh, Rishi Bal, Rohit Saluja, Ganesh Ramakrishnan	Jul 2025	arXiv 2025
24.	A2TTS: TTS for Low Resource Indian Languages A diffusion-based, speaker-conditioned text-to-speech system for multiple Indian languages, enabling high-quality zero-shot speech generation for previously unseen speakers.	Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan	Jul 2025	arXiv 2025
25.	Impact of Duration Prediction on Speaker-Specific TTS for Indian Languages Examines how different duration prediction strategies influence speech naturalness, intelligibility, and speaker identity preservation in speaker-specific text-to-speech systems for low-resource Indian languages.	Isha Pandey, Pranav Gaikwad, Amruta Parulekar, Ganesh Ramakrishnan	Jul 2025	arXiv 2025
26.	DRISHTIKON: Visual Grounding at Multiple Granularities in Documents Locates specific text regions in complex multilingual document images at block, line, word, and point levels, supported by a new benchmark dataset for document visual grounding.	Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan	Jun 2025	arXiv 2025
27.	Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights Demonstrates that extremely low-resource Indian languages benefit significantly from tokenizers trained on related higher-resource languages, supported by a comprehensive evaluation across 17 Indic languages.	N J Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, Maunendra Sankar Desarkar	Jun 2025	arXiv 2025
28.	LexGen: Domain-aware Multilingual Lexicon Generation Creates multilingual dictionaries of domain-specific terms for Indian languages, improving technical vocabulary coverage and language resource development in low-resource settings.	Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan	May 2025	ACL 2025
29.	Language Translation and Change of Accent for Speech-to-Speech Using Diffusion Models Reformulates speech translation and accent adaptation as a unified conditional generation task, producing target-language speech with an adapted accent directly from source phoneme representations.	Abhishek Mishra, Ritesh Sur Chowdhury, Vartul Bahuguna, Isha Pandey, Ganesh Ramakrishnan	May 2025	arXiv 2025
30.	MorphTok: Morphologically Grounded Tokenization for Indian Languages Introduces morphology-aware preprocessing and Constrained BPE for Indic scripts, along with EvalTok, a human-centered evaluation metric for assessing tokenization quality.	Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar	Apr 2025	TokShop 2025