Param-1: Revolutionizing AI for India — A 2.9B Parameter Language Model Built from the Ground Up

Published By: BharatGen

India is home to one of the world’s most linguistically diverse populations—yet, until now, artificial intelligence (AI) has largely ignored this fact. Today, we’re proud to introduce Param-1, a groundbreaking 2.9 billion parameter bilingual language model designed with India at its core.

The Problem: AI’s Language Barrier

Modern large language models like GPT-4, LLaMA, and others are impressive—but they’re primarily trained on English. Consider this: Meta’s LLaMA dedicates just 0.01% of its training data to Indic languages, despite India accounting for nearly 18% of the global population.

This creates major accessibility issues for the 1.4+ billion Indians who speak 20+ official languages and 100+ dialects:

  • Poor comprehension of Indian languages and contexts
  • Culturally irrelevant or biased outputs
  • Ineffective tokenization of Indian scripts
  • Lack of support for regional and government use cases

Introducing Param-1: AI Built for India

Param-1 is not a retrofitted Western model—it’s been built from the ground up for India’s multilingual reality. A bilingual Hindi-English model, Param-1 understands and responds naturally in both languages, providing equitable access to high-quality AI tools.

Core Design Principles of Param-1

  1. Equitable Language Representation

Param-1 allocates 25% of its training data to Hindi, a significant leap compared to the negligible representation in traditional models. Out of 7.5 trillion tokens, 2.77 trillion are dedicated to Hindi content.

  1. Tokenization Fairness

Param-1 uses a custom SentencePiece tokenizer with a 128K vocabulary, optimized for Indian scripts and morphology. This dramatically improves processing for Hindi and related languages compared to Western-trained tokenizers. 

  1. Culturally Aligned Evaluation

The model is evaluated on India-specific benchmarks, including code-mixed reasoning and socio-linguistic robustness, not just English-centric tasks.

Technical Architecture: Built for Performance

Param-1 follows a decoder-only casual language model model architecture similar to GPT and LLaMA, but with optimizations for multilingual and Indian culture specific performance:

 

Architecture attributes

Values

model_type

causal-language-model

hidden_size

2048

intermediate_size

7168

max_position_embeddings

2048

num_of_attention_heads

16

rope_theta

10000

num_of_decoder_blocks

32

seq_length

2048

num_of_key_value_heads

8

activation_function

Swiglu

attention

Grouped-query attention

precision

bf16-mixed

Three-Phase Pre-training Strategy    

 Phase 1: Bootstrap Training

Pre-Training Phase-1 Data Mixture

 

  • 5T tokens (4.73T English + 2.77T Hindi)
  • Trained on 64 nodes with 8× NVIDIA H100 GPUs each
  • Datasets: FineWeb-Edu, DCLM, Nemotron-CC, Sangraha, Books OCR, Udaan
  • Data curation techniques: FastText detection, toxic content removal, Unicode normalization, deduplication, PII removal

Phase 2: Factual Consistency

Pre-Training Phase-2 Data Mixture

 

  • Focused on improving knowledge retention and factual correctness
  • Uses enhanced monolingual and bilingual corpora with Indian context emphasis

Phase 3: Long-Context Adaptation

 

  • 500B tokens (250B English + 250B Hindi)
  • Document lengths range from 2K to 16K+ tokens
  • Trains long-range dependency and memory performance

Custom Tokenizer Advantage

The BharatGen tokenizer outperforms popular alternatives in Indian language efficiency:

Language

BharatGen-128K

LLaMA

Qwen

Hindi

1.43

2.65

4.66

Lower scores mean better token efficiency.

Benchmark Evaluation & Baseline Comparisons

To thoroughly evaluate Param-1’s capabilities, we benchmarked its pretraining checkpoint against a range of open-weight language models of similar scale (2–3B parameters). These evaluations span general reasoning, knowledge recall, and most importantly, Indian language and cultural understanding.

Param-1 demonstrates competitive performance across standard benchmarks while excelling in Indian language tasks. The provided benchmarks results are of Pre-Trained (PT) checkpoints.

Task

Param1

2.9B

Sarvam1

2B

Qwen2.5

3B

Gemma2

2B

Llama3.2

3B

Granite3.1

2B

Hella Swag Hi

45.7*

43.8*

32.9

39.1*

40.6*

31

MMLU (Hi)

36.1*

41.4*

38.32

35.8*

37.5*

29

SANSKRITI

60.15

52.61

69.72

69.76

55.47

60.95

MILU (Hi)

30.17

28.48

33.6

29.17

29.36

26.06

MILU (En)

36.3

32.12

49.84

44.65

37.63

36.08

Table1 – Performance on India centric Benchmarks

Task

Param1

2.9B

Sarvam1

2B

Qwen2.5

3B

Gemma2

2B

Llama3.2

3B

Granite3.1

2B

ARC challenge

52.9*

54.4*

47.4

52.9*

50.8*

47.2

ARC Easy

74.6

80.3

73.2

80.3

71.7

76.8

Hella Swag

73.4*

67.6*

73.6

74.6*

76.3*

75.5

MMLU En

46*

47.7*

64.9

52.6*

54.8*

47.8

PIQA

79.3

76.4

78.84

78.3

77.31

79.4

TriviaQA

38.5

32.2

42.27

32.9

50.83

26.2

LogicQA

28.3

30.1

33.49

30.4

30.41

29.5

Winogrande

61.6

61.2

68.27

68.5

68.9

71.7

TruthfulQA- gen- blue

38.2

36.3

36.96

29.7

21.8

34

lambda_openai_acc

61.9

61

66.89

70

70.1

71.4

lambda_standard_acc

57.6

56.3

59.09

64.1

63.8

65.7

Table2 – Performance on various general language understanding benchmarks

* – Few shot results

Param1 2.9B delivers robust and competitive performance across a wide range of general language understanding benchmarks, consistently matching or exceeding other leading models in several key tasks.

 

Baseline Models for Comparison

We compare Param-1 with several leading models in the 2–3B parameter range:

  • Sarvam-1 (2B): A Hindi-focused language model trained specifically for Indian use cases, offering a direct comparison for Param-1’s bilingual capabilities.
  • Qwen 2.5 (3B): A high-performance model with strong reasoning abilities, thanks to training on expert-curated datasets for coding and mathematics. It’s a valuable baseline for multilingual and logic-based tasks.
  • Gemma-2 (2B): DeepMind’s smallest Gemma variant, emphasizing efficiency, long-context comprehension, and numeracy with English and code-heavy datasets.
  • Granite 3.1 (2B Dense): IBM’s multilingual model trained on 12 trillion tokens, designed for reasoning, instruction following, and enterprise-grade use.
  • LLaMA 3.2 (3B): A distilled variant of Meta’s LLaMA 3.1 70B model, optimized for general-purpose tasks. Despite strong performance, it includes minimal Indic representation (0.01%), making it a benchmark for contrast.

 

Benchmark Suite

Knowledge & Reasoning: ARC (Challenge & Easy), MMLU & MMLU-Hi, TriviaQA, LogiQA, TruthfulQA, LAMBADA & Lambda-OpenAI.

Commonsense & Physical Reasoning: HellaSwag (EN & HI), Winogrande, PIQA

Indian Language & Cultural Proficiency: MILU (Multilingual India-Level Understanding), SANSKRITI (Focuses on Indian culture, with over 21K multiple-choice questions across history, festivals, cuisine, and more)

 

Alignment and Safety: Building Responsible AI

Prometheus Evaluation (Instruction Tuning)

Assessed with 1,000 prompts across categories:

  • Helpfulness
  • Factual correctness
  • Reasoning
  • Safety

Param-1 shows strong preference in bilingual settings, ensuring reliable responses across English and Hindi.

LLM360 Safety Suite

Tested against:

  • Identity-based toxicity
  • Threatening content
  • Profanity and slurs

Param-1 maintains low toxicity and demonstrates safe, culturally aware behavior even in adversarial prompts.

This evaluation provides a quantitative preference score without human annotation, offering a scalable and objective measure of Param-1’s alignment. Notably, Param-1 demonstrated high alignment fidelity in bilingual queries, showcasing its ability to respond accurately and helpfully across English and Hindi prompts.

 

Toxicity Evaluation

We evaluated safety and toxicity mitigation using the LLM360 Safety Evaluation Suite, a robust benchmark designed to test LLMs on potentially harmful content generation.

 

Using the toxicity sub-suite, Param-1 was prompted with a curated set of adversarial and stereotype-sensitive queries in both English and Hindi. Toxicity was measured using classification tools like Detoxify and Perspective API, with breakdowns by content category:

  • Identity-based toxicity
  • Profanity and slurs
  • Threatening or violent language

Param-1 maintains low toxicity scores across both monolingual and bilingual prompts.

Its instruction-tuned checkpoint is more effective in rejecting or redirecting unsafe queries, ensuring higher trust in sensitive deployments.

 

Real-World Use Cases

Param-1 is ready to power India’s most critical sectors:

  • Governance – Multilingual digital services
  • Education – AI tutors for diverse learners
  • Healthcare – Region-specific medical assistants
  • Legal – Understanding local laws and regulations
  • Agriculture – Farmer outreach in regional languages

Instruction Fine-Tuning for Real Impact

Param-1 is ready to power India’s most critical sectors:

  • 400K high-quality instruction-response pairs
  • Bilingual corpus from Indian domains
  • Domain-specific content (governance, education, culture)
  • Rigorous safety and relevance checks

Infrastructure and Scale

Trained using Yotta’s high-performance SLURM-managed cluster:

 

·       8× NVIDIA H100 GPUs per node

·       High-speed InfiniBand interconnect

·       NeMo Framework for efficient distributed training

 

Why Param-1 Matters: A Paradigm Shift

Param-1 is more than a model—it’s a blueprint for equitable AI. It proves that linguistic inclusivity and technical excellence are not mutually exclusive.

 

By embedding diversity into its foundation—not as an afterthought—Param-1 sets the standard for AI made in and for the Global South.

 

Get Started with Param-1

The Param1 Model is available on AIkosh and HuggingFace for public use and further research.

Join us in shaping the future of inclusive AI.

About BharatGen

Developed by the BharatGen team, Param-1 is a pioneering step toward building democratic, multilingual AI that reflects and serves India’s unique linguistic landscape.

Related Post

Share:

Scroll to Top