Param-1: Revolutionizing AI for India — A 2.9B Parameter Language Model Built from the Ground Up

Published By: BharatGen

India is home to one of the world’s most linguistically diverse populations—yet, until now, artificial intelligence (AI) has largely ignored this fact. Today, we’re proud to introduce Param-1, a groundbreaking 2.9 billion parameter bilingual language model designed with India at its core.

The Problem: AI’s Language Barrier

Modern large language models like GPT-4, LLaMA, and others are impressive—but they’re primarily trained on English. Consider this: Meta’s LLaMA dedicates just 0.01% of its training data to Indic languages, despite India accounting for nearly 18% of the global population.

This creates major accessibility issues for the 1.4+ billion Indians who speak 20+ official languages and 100+ dialects:

Poor comprehension of Indian languages and contexts
Culturally irrelevant or biased outputs
Ineffective tokenization of Indian scripts
Lack of support for regional and government use cases

Introducing Param-1: AI Built for India

Param-1 is not a retrofitted Western model—it’s been built from the ground up for India’s multilingual reality. A bilingual Hindi-English model, Param-1 understands and responds naturally in both languages, providing equitable access to high-quality AI tools.

Core Design Principles of Param-1

Equitable Language Representation

Param-1 allocates 25% of its training data to Hindi, a significant leap compared to the negligible representation in traditional models. Out of 7.5 trillion tokens, 2.77 trillion are dedicated to Hindi content.

Tokenization Fairness

Param-1 uses a custom SentencePiece tokenizer with a 128K vocabulary, optimized for Indian scripts and morphology. This dramatically improves processing for Hindi and related languages compared to Western-trained tokenizers.

Culturally Aligned Evaluation

The model is evaluated on India-specific benchmarks, including code-mixed reasoning and socio-linguistic robustness, not just English-centric tasks.

Technical Architecture: Built for Performance

Param-1 follows a decoder-only casual language model model architecture similar to GPT and LLaMA, but with optimizations for multilingual and Indian culture specific performance:

Architecture attributes	Values
model_type	causal-language-model
hidden_size	2048
intermediate_size	7168
max_position_embeddings	2048
num_of_attention_heads	16
rope_theta	10000
num_of_decoder_blocks	32
seq_length	2048
num_of_key_value_heads	8
activation_function	Swiglu
attention	Grouped-query attention
precision	bf16-mixed

Three-Phase Pre-training Strategy

Phase 1: Bootstrap Training

5T tokens (4.73T English + 2.77T Hindi)
Trained on 64 nodes with 8× NVIDIA H100 GPUs each
Datasets: FineWeb-Edu, DCLM, Nemotron-CC, Sangraha, Books OCR, Udaan
Data curation techniques: FastText detection, toxic content removal, Unicode normalization, deduplication, PII removal

Phase 2: Factual Consistency

Focused on improving knowledge retention and factual correctness
Uses enhanced monolingual and bilingual corpora with Indian context emphasis

Phase 3: Long-Context Adaptation

500B tokens (250B English + 250B Hindi)
Document lengths range from 2K to 16K+ tokens
Trains long-range dependency and memory performance

Custom Tokenizer Advantage

The BharatGen tokenizer outperforms popular alternatives in Indian language efficiency:

Language	BharatGen-128K	LLaMA	Qwen
Hindi	1.43	2.65	4.66

Lower scores mean better token efficiency.

Benchmark Evaluation & Baseline Comparisons

To thoroughly evaluate Param-1’s capabilities, we benchmarked its pretraining checkpoint against a range of open-weight language models of similar scale (2–3B parameters). These evaluations span general reasoning, knowledge recall, and most importantly, Indian language and cultural understanding.

Param-1 demonstrates competitive performance across standard benchmarks while excelling in Indian language tasks. The provided benchmarks results are of Pre-Trained (PT) checkpoints.

Task	Param1 2.9B	Sarvam1 2B	Qwen2.5 3B	Gemma2 2B	Llama3.2 3B	Granite3.1 2B
Hella Swag Hi	45.7*	43.8*	32.9	39.1*	40.6*	31
MMLU (Hi)	36.1*	41.4*	38.32	35.8*	37.5*	29
SANSKRITI	60.15	52.61	69.72	69.76	55.47	60.95
MILU (Hi)	30.17	28.48	33.6	29.17	29.36	26.06
MILU (En)	36.3	32.12	49.84	44.65	37.63	36.08

Table1 – Performance on India centric Benchmarks

Task	Param1 2.9B	Sarvam1 2B	Qwen2.5 3B	Gemma2 2B	Llama3.2 3B	Granite3.1 2B
ARC challenge	52.9*	54.4*	47.4	52.9*	50.8*	47.2
ARC Easy	74.6	80.3	73.2	80.3	71.7	76.8
Hella Swag	73.4*	67.6*	73.6	74.6*	76.3*	75.5
MMLU En	46*	47.7*	64.9	52.6*	54.8*	47.8
PIQA	79.3	76.4	78.84	78.3	77.31	79.4
TriviaQA	38.5	32.2	42.27	32.9	50.83	26.2
LogicQA	28.3	30.1	33.49	30.4	30.41	29.5
Winogrande	61.6	61.2	68.27	68.5	68.9	71.7
TruthfulQA- gen- blue	38.2	36.3	36.96	29.7	21.8	34
lambda_openai_acc	61.9	61	66.89	70	70.1	71.4
lambda_standard_acc	57.6	56.3	59.09	64.1	63.8	65.7

Table2 – Performance on various general language understanding benchmarks

* – Few shot results

Param1 2.9B delivers robust and competitive performance across a wide range of general language understanding benchmarks, consistently matching or exceeding other leading models in several key tasks.

Baseline Models for Comparison

We compare Param-1 with several leading models in the 2–3B parameter range:

Sarvam-1 (2B): A Hindi-focused language model trained specifically for Indian use cases, offering a direct comparison for Param-1’s bilingual capabilities.
Qwen 2.5 (3B): A high-performance model with strong reasoning abilities, thanks to training on expert-curated datasets for coding and mathematics. It’s a valuable baseline for multilingual and logic-based tasks.
Gemma-2 (2B): DeepMind’s smallest Gemma variant, emphasizing efficiency, long-context comprehension, and numeracy with English and code-heavy datasets.
Granite 3.1 (2B Dense): IBM’s multilingual model trained on 12 trillion tokens, designed for reasoning, instruction following, and enterprise-grade use.
LLaMA 3.2 (3B): A distilled variant of Meta’s LLaMA 3.1 70B model, optimized for general-purpose tasks. Despite strong performance, it includes minimal Indic representation (0.01%), making it a benchmark for contrast.

Benchmark Suite

Knowledge & Reasoning: ARC (Challenge & Easy), MMLU & MMLU-Hi, TriviaQA, LogiQA, TruthfulQA, LAMBADA & Lambda-OpenAI.

Commonsense & Physical Reasoning: HellaSwag (EN & HI), Winogrande, PIQA

Indian Language & Cultural Proficiency: MILU (Multilingual India-Level Understanding), SANSKRITI (Focuses on Indian culture, with over 21K multiple-choice questions across history, festivals, cuisine, and more)

Alignment and Safety: Building Responsible AI

Prometheus Evaluation (Instruction Tuning)

Assessed with 1,000 prompts across categories:

Helpfulness
Factual correctness
Reasoning
Safety

Param-1 shows strong preference in bilingual settings, ensuring reliable responses across English and Hindi.

LLM360 Safety Suite

Tested against:

Identity-based toxicity
Threatening content
Profanity and slurs

Param-1 maintains low toxicity and demonstrates safe, culturally aware behavior even in adversarial prompts.

This evaluation provides a quantitative preference score without human annotation, offering a scalable and objective measure of Param-1’s alignment. Notably, Param-1 demonstrated high alignment fidelity in bilingual queries, showcasing its ability to respond accurately and helpfully across English and Hindi prompts.

Toxicity Evaluation

We evaluated safety and toxicity mitigation using the LLM360 Safety Evaluation Suite, a robust benchmark designed to test LLMs on potentially harmful content generation.

Using the toxicity sub-suite, Param-1 was prompted with a curated set of adversarial and stereotype-sensitive queries in both English and Hindi. Toxicity was measured using classification tools like Detoxify and Perspective API, with breakdowns by content category:

Identity-based toxicity
Profanity and slurs
Threatening or violent language

Param-1 maintains low toxicity scores across both monolingual and bilingual prompts.

Its instruction-tuned checkpoint is more effective in rejecting or redirecting unsafe queries, ensuring higher trust in sensitive deployments.

Real-World Use Cases

Param-1 is ready to power India’s most critical sectors:

Governance – Multilingual digital services
Education – AI tutors for diverse learners
Healthcare – Region-specific medical assistants
Legal – Understanding local laws and regulations
Agriculture – Farmer outreach in regional languages

Instruction Fine-Tuning for Real Impact

Param-1 is ready to power India’s most critical sectors:

400K high-quality instruction-response pairs
Bilingual corpus from Indian domains
Domain-specific content (governance, education, culture)
Rigorous safety and relevance checks

Infrastructure and Scale

Trained using Yotta’s high-performance SLURM-managed cluster:

· 8× NVIDIA H100 GPUs per node

· High-speed InfiniBand interconnect

· NeMo Framework for efficient distributed training

Why Param-1 Matters: A Paradigm Shift

Param-1 is more than a model—it’s a blueprint for equitable AI. It proves that linguistic inclusivity and technical excellence are not mutually exclusive.

By embedding diversity into its foundation—not as an afterthought—Param-1 sets the standard for AI made in and for the Global South.

Get Started with Param-1

The Param1 Model is available on AIkosh and HuggingFace for public use and further research.

Join us in shaping the future of inclusive AI.

About BharatGen

Developed by the BharatGen team, Param-1 is a pioneering step toward building democratic, multilingual AI that reflects and serves India’s unique linguistic landscape.