Senior Linguist (Text LLM – Model Evaluation)
Build India’s sovereign AI stack for a billion people and shape the future of technology


Job Summary
The Text LLM Model Evaluation Lead will own the end-to-end process of evaluating BharatGen’s text-based large language models. You will design human evaluation frameworks, test sets, rubrics, and metrics that assess model outputs across multiple tasks and languages. Working closely with ML engineers, linguists, and data operations, you’ll ensure that every model iteration is measured with rigor, fairness, and linguistic precision.
Key Responsibilities
- Model Evaluation Design:
- Design and manage evaluation frameworks for BharatGen’s Text LLM, covering diverse tasks such as summarization, dialogue, question answering, reasoning.
- Define evaluation dimensions (coherence, factuality etc.).
- Develop human evaluation rubrics, task-specific test sets for multiple languages.
- Establish evaluation workflows using human judgment that complement automated metrics.
- Create documentation, checklists, and SOPs to ensure replicability of evaluations across model versions.
- Human Evaluation Pipeline Oversight:
- Collaborate with the Data Ops Manager to execute large-scale human evaluations across languages, align throughput, timelines, and cost controls.
- Review and refine annotation guidelines to ensure inter-annotator consistency.
- Design sampling and spot-checking methods for maintaining high data integrity.
- Implement inter-annotator agreement tracking, quality audits.
- Analytical Evaluation & Reporting:
- Conduct error and trend analysis on model outputs across evaluation rounds.
- Interpret results to highlight strengths, regressions, or recurring weaknesses.
- Present findings and recommendations to ML engineers and leadership in structured, data-backed reports.
- Collaborate with the ML team to refine models based on evaluation results and feedback loops.
- Metrics & Tooling:
- Identify or adapt suitable automatic evaluation metrics (e.g., BLEU, ROUGE, BERTScore, toxicity classifiers, etc.) to complement human evaluation.
- Use simple scripts/dashboards to track scores, trends, & evaluation throughput.
- Partner with Data Ops and product engineers to improve internal tools for managing evaluation tasks and results visualization.
- Cross-functional Collaboration:
- Train and mentor junior linguists in designing high-quality evaluation schemes.
- Participate in design reviews to ensure evaluation insights are integrated into model training and product goals.
Minimum Qualifications and Experience
- Master’s or PhD in Linguistics/Computational Linguistics with 3+ years of experience working on NLP or GenAI projects, with exposure to model evaluation, test-set design, or linguistic quality assessment.
Required Expertise
- Proficiency in Python and agentic frameworks such as LangGraph, DSPy, AutoGen, CrewAI, etc.
- Experience collaborating with ML or data science teams on evaluation or model analysis workflows.
- Experience managing multi-language annotation or evaluation projects is preferred.
- Experience with multilingual LLMs or Indian language NLP.
- Exposure to instruction-tuning, safety evaluation, or RLHF workflows.
- Familiarity with bias detection, toxicity analysis, or fairness evaluation.
- Prior experience training or mentoring annotation/evaluation teams.
