What If Your AI Could Read a Document the Way You Do and Show You Exactly What It Found?

Published By: BharatGen

Every day, billions of documents change hands. A hospital sends a discharge summary. A bank issues a loan agreement. A logistics company generates a customs form. An analyst reviews a quarterly earnings filing. These aren’t simple text files. They’re visual objects pages designed for human eyes, with numbers nested inside tables, text wrapping around images, fine print curling along a poster edge, labels arching over a chart. Some are born digitally as clean PDFs. Others are photographs of paper tilted, faded, stamped, signed. The same information can live in a table cell, a chart bar, a watermarked footer, or curved text on a banner. No two documents look the same.

Humans navigate this effortlessly. We skim, scan, zoom in on a number, flip back to cross-reference a clause. We instinctively know the revenue figure is in the top-right cell of that table, not in the footnote three lines below it.

AI has gotten surprisingly good at answering questions from documents. You hand it a contract and ask “what’s the penalty clause?”, it tells you. You upload a blood report and ask “what’s abnormal?”, it answers. This field is called Document Visual Question Answering, and modern AI models have made remarkable progress on it.

But there’s a catch that nobody talks about enough. These models tell you what the answer is. They don’t show you where it came from. For casual use, that’s fine. But a lawyer verifying a contract clause, a doctor reviewing lab values, a compliance officer checking a regulatory filing, none of them can afford to just trust the AI’s word. They need to see the evidence. The exact sentence. The specific row. The precise cell the model is pointing at.

And showing that evidence is genuinely hard. The word “Total” might appear fourteen times on a single invoice, which one is the answer? Some documents have text that curves around a logo or runs at an angle; a rectangular box drawn around it captures mostly background noise.

M3Grounder, short for Mask-Based Multi-Span and Multi-Granular Grounding for Document QA, was built to close exactly this gap. Not just answering the question; but highlighting, on the actual page, the exact evidence behind it. Like a careful human reader with a highlighter, working in real time.

This work comes out of BharatGen’s vision team, in collaboration with IIIT Hyderabad, a founding member of the BharatGen consortium. BharatGen is India’s first sovereign AI initiative, built on a core belief that AI deployed in high-stakes settings must be trustworthy, not just accurate. M3Grounder is part of that effort. Think of it as a recipe for building grounding capability into document AI systems, and one part of a growing suite of models under the Patram ecosystem, India’s first Vision-Language Document Foundation Model built for Indic document intelligence.

What makes it distinct is where the grounding actually happens. Most existing systems that attempt to localize answers operate on the text modality, matching the predicted answer string back to OCR output and drawing a box around it. M3Grounder skips that entirely. It grounds end-to-end, directly on the image, producing pixel-level segmentation masks without ever relying on OCR as an intermediate step. That is not just a technical difference. It is what makes the system reliable on documents where OCR pipelines routinely break down.

“The future of document AI isn’t just answering questions. It’s answering and showing its work.”

The Problem With Document AI Today

Approach 1: “Just output coordinates.”

Some models are trained to print the bounding box of the answer — literally generating numbers like [x1, y1, x2, y2] as part of their text output. This seems reasonable until you realise how brittle it is. The model is juggling two completely different jobs at once inside a single output stream: understanding language and producing precise pixel coordinates. A small error in one corrupts the other. And because these coordinates are generated token-by-token like regular words, the model has no native spatial awareness, it’s guessing geometry through grammar.

Approach 2: “Match the answer text back to OCR.”

Other systems run OCR first, extract all the text from the document, find where the answer string appears in the OCR output, and draw a box around it. This works for simple cases but fails the moment reality gets complicated: What if the word “Total” appears six times on a page? What if the answer comes from a curved banner or a chart axis where OCR alignment is already unreliable? The pipeline breaks down and the highlighted region becomes guesswork. Both approaches share a deeper problem: they try to do language and spatial reasoning in the same place, through the same mechanism. The result is that one always suffers.

The naive approach: run OCR to find text, then separately prompt a segmentation model. Two stages, two failure points, no end-to-end learning

The Solution: Stay in the Latent Space (A smart highlighter)

M3Grounder’s key insight is that you should never have to leave the model’s internal representation to do grounding. Instead of going outside the latent space to extract text and then prompt a separate model, the VLM can natively learn latent representations that act as tool calls to a segmentation model. Think of it like function calling in LLMs, but for vision: the VLM doesn’t output coordinates or text strings; it produces a learned internal signal that directly invokes SAM to generate masks.

When a language model processes a document and generates an answer like “BOULEVARD”, it has already built up a rich internal understanding of that word, what it means, where it appeared, what visual context surrounds it. That understanding lives as a vector of numbers inside the model: a hidden state. Most approaches throw this away — either forcing the model to convert it back to raw coordinates (brittle), or exiting the model entirely to run OCR and match text strings (slow and ambiguous). M3Grounder does something different: it uses the hidden state directly as a visual prompt.

The system is built around two modules working in parallel:

A Vision-Language Model (VLM) reads the document and question, then generates the answer word by word. We add a special [GROUND] token to the model's vocabulary and train the VLM to learn two things simultaneously: what to answer, and when to ground within that answer. Whenever the model determines that a piece of the answer needs evidence, it predicts a [GROUND] token right there; mid-answer, as a trigger. That trigger's hidden state is a compact encoding of everything the model understood about that span: its meaning, context, and relationship to the document.

The handshake and the key insight is that the [GROUND] token’s hidden state is projected by a small MLP into the exact format SAM expects as a prompt. From SAM’s perspective, it’s receiving a normal input prompt and doing what it was always designed to do: segment. But that prompt wasn’t typed by a human or extracted by OCR. It was natively learned by the VLM during training a latent representation that encodes what to ground and where, produced entirely within the model’s own internal space. This is a tool calling for vision: the VLM invokes SAM with a learned signal, never leaving the latent space, and SAM produces a precise pixel mask. No OCR, no coordinate generation, no staged pipeline.

The VLM never has to “speak coordinates.” SAM never has to “understand language.” Each does what it’s natively good at. The hidden state is the bridge: a natively learned latent representation that serves as a tool call from one model to the other.

Overview of M3Grounder. Given a document image and question, a VLM (1) encodes visual features and autoregressively generates an answer sequence (2) interleaved with [GROUND] tokens. Hidden states of the [GROUND] tokens (3) from the VLM decoder are passed through granularity-specific MLP projection heads, producing phrase, line, and block level prompt embeddings (4). In parallel, the promptable segmentation module (5) extracts dense image embeddings reused across all spans and granularities. The prompt embeddings (4) are decoded by the mask decoder to generate hierarchical grounding masks (6), mapping each answer span to its corresponding grounding masks across multiple granularities.

This design has a meaningful side effect: because the grounding signal comes from the model’s own latent understanding rather than string matching, M3Grounder can ground text it has never seen OCR’d curved text on posters, rotated labels in diagrams, stamps. The geometry is handled by SAM’s pixel-level machinery, not by fragile text coordinates. The result: a single forward pass produces both the answer and its evidence, end-to-end.

Three levels of detail you choose how zoomed-in you need:

🔤

Phrase Level

Highlights the exact words: names, numbers, dates

📏

Line Level

Highlights the full sentence for reading context

🔲

Block Level

Highlights the whole section for big-picture view

Each level strictly contains the one above it: phrase ⊂ line ⊂ block

One Pass. No Staged Pipeline.

A staged pipeline — where you first run OCR, then separately prompt a segmentation model — has to solve a hard alignment problem every time: which of the OCR-detected tokens corresponds to the model’s intended answer? In documents with repeated phrases, dense tables, or ambiguous layouts, that alignment frequently fails.

M3Grounder sidesteps this entirely. The [GROUND] token is generated immediately after the answer span in the same autoregressive pass that produced the answer. Its hidden state already knows what the answer is and where in the document it came from because it was computed while reading that document. There is no external lookup, no string matching, no second model that needs to be told what to find. The only cost of adding grounding to the system is 63 milliseconds of extra latency per query, the time for SAM to decode the masks. The VLM’s answer quality is completely unchanged.

Example of multi-granular grounding in document QA. The model grounds the same answer at three levels of granularity-phrase, line, and block, showing its ability to localize evidence at different contextual scopes.

A precise date? Phrase level. The surrounding paragraph? Block level. The choice gives users and downstream applications the right level of evidence for the task at hand.

Why Does This Matter?

In industries where decisions carry real consequences, answers alone are not enough. A financial analyst needs to know which line in the quarterly report the AI is referencing. A legal reviewer needs to verify the exact clause in a contract. A doctor reviewing clinical notes needs to see which sentence flagged a risk.

🏦

Finance

Pinpoint exact figures in balance sheets, earnings reports, and invoices.

⚖️

Legal

Highlight the precise clause or regulation being referenced.

🏥

Healthcare

Surface the exact clinical text supporting a diagnosis.

🏢

Enterprise

Make document workflows auditable; every AI answer has a traceable source.

The Numbers

To build and test M3Grounder, the team created an entirely new dataset GroundingDocQA, the largest of its kind for pixel-level document grounding.

200K Documents in training dataset

2M QA pairs with pixel masks

3 Granularity levels

#1 State-of-the-art on all 4 benchmarks

How Does It Compare?

M3Grounder was tested against leading commercial and open-source AI systems. Here’s a simplified look at performance on the hardest metrics curved and multi-span text grounding:

Model	Multi-span grounding	Curved text grounding	Open source
GPT-4o (OpenAI)	✗ 5.2%	✗ 0.8%	✗Closed
Gemini-2.5 Pro (Google)	53.8%	14.0%	✗Closed
Qwen3-VL fine-tuned	68.1%	37.7%	✓Open
M3Grounder (Ours) ★	✓ 84.2%	✓ 63.5%	✓Open

The gains are especially large for curved text, the hardest case, where rectangular boxes simply can’t work. This is where segmentation-based grounding shines.

Table 1. Quantitative results on document grounding benchmarks. We evaluate all models at the phrase level across four benchmarks: BoundingDocs (BD-Test), DOGR-Bench, MMDocBench, and GroundingDocQA-Bench (ours). BD-Test and DOGR-Bench: Grounding is measured using F1g (IoU > 0.5) over bounding boxes, while answer quality (AQ) is evaluated with G-Eval. MMDocBench: Official metrics are Exact Match (EM), word-level F1ₜₓₜ, and IoU for region alignment. GroundingDocQA-Bench: SS = single-span grounding F1 (IoU > 0.5); MS = multi-span grounding F1 (IoU > 0.5); CS = curved/skew-answer grounding F1 (IoU > 0.5); F1g = overall grounding F1 (IoU > 0.5) aggregated across spans; AQ = G-Eval answer quality. Higher values are better for all metrics. Bold indicates the best result among open-source models, while underlined values (not preserved in plain text) indicate the best result among commercial models.

What's Next

M3Grounder was presented at CVPR 2026 in Nashville, one of the most competitive venues in computer vision, where fewer than 25% of submitted papers are accepted. For BharatGen’s Vision team, it’s a validation that research built around India’s document challenges belongs at the frontier of the field, not at the periphery of it.

Grounding is one of the capabilities we’re investing in most deeply as part of the Patram ecosystem. There’s more coming from this direction.

But the reason any of this matters comes down to something simple. Across India, consequential documents like land records, medical reports, court filings, and government certificates are still handled as physical objects, photographed and passed around as images. The people who depend on these documents deserve AI that doesn’t just read them but can show its work, pointing at the exact line, the precise field, the specific stamp that supports what it found. That’s what pixel-precise grounding is for. Not a research benchmark. A practical guarantee.

This blog describes research conducted by BharatGen‘s Vision team in collaboration with IIIT Hyderabad, a founding consortium member. The paper was accepted and presented at CVPR 2026.

Research by: Venkat Kesav Venna, Sai Madhusudan Gunda, Jyothi Swaroopa Jinka, Hrithik Sagar Rachakonda, Anirudh Srinivasan, Ravi Kiran Sarvadevabhatla

Project page: m3grounder.github.io