BharatGen Research Papers

BharatGen’s research addresses the foundational challenges of building AI for India’s linguistic and cultural diversity. Our work spans large language models, tokenization for Indic scripts, text-to-speech synthesis, speech translation and visual document understanding.

The papers below are published on arXiv and in peer-reviewed venues. They reflect the breadth of our work: from the core infrastructure of multilingual AI to applied systems for agriculture, document analysis and low-resource languages.

No.
Paper
Date
Read Paper
1.

Tables Decoded: DELTA for Structure, TARQA for Understanding
Published at WACV 2026 Converts table images to structured text for precise table recognition and visual question answering.

Mar 2026

2.

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Proposes improved pre-tokenization for Indic scripts, reducing token-to-word ratios by ~6% and improving multilingual model efficiency.

Aug 2025

3.

Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering
Presents Krishi Sathi, an AI chatbot for Indian farmers using multi-turn intent tracking and retrieval-augmented generation in Hindi and English.

Aug 2025

4.

Impact of Duration Prediction on Speaker-Specific TTS for Indian Languages
Examines how different duration prediction strategies affect speech naturalness and speaker identity in low-resource Indian language TTS.

Jul 2025

5.

A2TTS: TTS for Low Resource Indian Languages
A diffusion-based, speaker-conditioned TTS system for multiple Indian languages with zero-shot generation for unseen speakers.

Jul 2025

6.

PARAM-1 BharatGen 2.9B Model
A 2.9B parameter bilingual language model for Hindi and English, with morphology-aware tokenization and new evaluation benchmarks for Indic code-switching.

Jul 2025

7.

DRISHTIKON: Visual Grounding at Multiple Granularities in Documents
Locates specific text regions in complex mu+ltilingual document images at block, line, word, and point levels, with a new benchmark dataset.

Jun 2025

8.

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights
Shows that extremely low-resource Indian languages benefit significantly from tokenizers trained on related higher-resource languages.

Jun 2025

9.

Language Translation and Change of Accent for Speech-to-Speech Using Diffusion Models
Reformulates speech translation and accent adaptation as a single conditional generation task, generating target-language speech with adapted accent from source phonemes.

May 2025

10.

MorphTok: Morphologically Grounded Tokenization for Indian Languages
Introduces morphology-aware preprocessing and Constrained BPE for Indic scripts, with a human evaluation metric called EvalTok.

Apr 2025

Scroll to Top