Validation Evidence — Big Indexer

Results at a glance

All numbers are from the post-shipment BGI-TWIN refresh (20 runs, p01–p04 × 5 repos, MCP + twin_context mode). Full comparison in the table below.

Actionability

4.25–4.85/5

Cross-model TWIN range (deepseek / GPT-4o / Gemini)

Boundary Accuracy

1.00

Perfect · 20/20 runs

Hallucinations

Across all 100 scored runs

Median Latency

68s

vs 134s baseline (49% faster)

p04 Evidence Coverage

96%

Safe-impl slice · 5 repos

Three-stage core comparison

Each stage adds a layer. BGI-MCP fixed boundary & latency. BGI-TWIN fixed actionability. Three-model replication (GPT-4o, Gemini auto) is shown in the next section.

Metric	No BGI Baseline (20 runs)	BGI MCP Pre-shipment (20 runs)	BGI MCP + TWIN Post-shipment (20 runs)
Actionability (1–5)	4.0	4.0 →	4.75 +0.75
Evidence Coverage	78.7%	84.9% +6.2 pp	79.9% / 96%†
Boundary Accuracy	0.95	1.00 +0.05	1.00 held
Hallucinations	0	0 —	0 —
Median Latency	133.8s	66.2s 51% faster	68.5s ≈ held

† 79.9% is the all-prompt mean for the refresh batch; 96% is the p04 (safe implementation path) slice across 5 repos. The harder p01–p03 prompts pull the mean down; p04 evidence is the most actionability-relevant slice.

What changed between stage 2 and stage 3? BGI-TWIN shipped three new MCP tools — task_fingerprint, behavioral_twins, and twin_context — that give the AI model a fingerprinted task context, the top-3 behaviorally similar code units from the real repo, the strongest architectural seam for the task, and an explicit 5-point actionability rubric. That is why actionability moved from 4.0 to 4.75 while boundary accuracy and hallucinations stayed perfect.

Three-model independent replication (deepseek + GPT-4o + Gemini auto)

We re-ran the full TWIN protocol (p01–p04 × 5 repos) on azure/gpt-4o and gemini/auto for model diversity and cross-model consistency checking.

Metric	BGI TWIN (deepseek)	BGI TWIN (GPT-4o)	BGI TWIN (Gemini auto)
Actionability (1–5)	4.75	4.85	4.25
Evidence coverage (strict)	79.9% (96% p04)	47.9% (49.3% p04)	62.4% (80% p04)
Evidence (tag-relaxed, 2nd score)	94.8% (100% p04)	59.5% (62.7% p04)	83.4%
Boundary accuracy	1.00	1.00	0.95
Hallucinations	0	0	0
Median latency	68.5s	41.6s	65.8s

Replication interpretation: actionability, boundary accuracy (≥0.95), and zero-hallucination held across all three models. Evidence coverage varies by model tagging style — see the callout below. Gemini auto's 0.95 boundary reflects one genuine architectural miss (django/p02, depth-first on query.py, not pattern calibration).

Why evidence coverage varies across models — and why it is not a quality signal. Evidence coverage is scored as checklist recall from explicit, claim-level evidence statements. Across the same TWIN protocol, deepseek outputs contained 278 explicit VERIFIED/HYPOTHESIS/UNKNOWN tags (13.9/run), Gemini outputs contained 167 (8–15/run), and GPT-4o outputs contained 139 (6.95/run). On the p04 implementation slice, actionability stayed high across all three models (deepseek 4.75, GPT-4o 4.85, Gemini 4.25), boundary stayed ≥0.95, and hallucinations stayed zero.

Interpretation: this is a model/rubric interaction (explicit tagging style), not a loss of BGI grounding. deepseek follows the VERIFIED/HYPOTHESIS/UNKNOWN protocol most strictly; Gemini auto is intermediate; GPT-4o gives correct answers without exhaustive tagging. The tag-relaxed second score (94.8% / 83.4% / 59.5%) closes most of this gap by crediting concrete repo-anchored claims without explicit labels. We keep scores unnormalized and publish raw outputs so readers can audit this directly.

Second evidence score formula (tag-relaxed): min(100, evidence_coverage_pct + min(25, (unlabeled_repo_anchor_lines / checklist_items) * 100 * 0.15)). This adds conservative credit for non-log lines that include concrete repo anchors (*.py, *.go, *.ts, etc.) without explicit evidence tags.

Raw-output comparison (same prompt, two models)

Prompt: fastapi p03 — "blast radius for solve_dependencies"

deepseek (opencode_mcp_p03_twin_refresh_r2):
| Definition spans lines 598–735 ... | VERIFIED |
| get_request_handler ...             | VERIFIED |
| get_websocket_app ...               | VERIFIED |
VERIFIED: Only 3 call sites exist in the codebase.

GPT-4o (opencode_mcp_p03_twin_refresh_gpt4o_r1):
1. VERIFIED: Changes to solve_dependencies will directly impact tests.
2. HYPOTHESIS: Other clusters may have dependent implications.

Source artifacts: deepseek run · GPT-4o run

What is BGI-TWIN?

Three deterministic MCP tools that turn "here is the architecture" into "here is exactly what to change and where."

🔬

task_fingerprint(task)

Converts a natural-language task into a COV (co-occurrence vocabulary) token set. Deterministic — same task always produces the same fingerprint. No LLM call.

🧬

behavioral_twins(task)

Ranks all indexed code units by Jaccard similarity against the task fingerprint. Returns the top-3 units that do the most similar thing in the actual codebase — with optional source snippets.

🎯

twin_context(task)

Composite output: task COV + top twins + the strongest seam boundary for the task + an explicit 5-point actionability rubric + confidence-gated escalation. The AI gets a fully structured implementation brief.

BGI-TWIN is a deterministic context compiler — it does not generate code, does not call an LLM, and does not speculate about behavior. Every output is derived directly from the indexed graph and fuse artifacts you already built with bgi scan.

Why actionability was flat — and how BGI-TWIN fixed it

The pre-shipment MCP improved evidence and latency but did not move actionability. Here is why.

The problem: Pre-shipment MCP gave the AI model a blast-radius summary and an architecture overview. That told the model what the architecture looked like, but not where to make a specific change for a specific task. The AI produced correct, well-grounded answers — but at 4.0/5 they were still "here's the pattern, you figure out the file."

The fix: BGI-TWIN surfaces the 3 code units that are most behaviorally similar to the task at hand — real functions, real files, real line numbers from the codebase being changed. The AI gets "this task looks like validators/core.py:validate_field, which lives inside seam boundary validators/↔models/ — start here." That is the difference between 4.0 and 4.75.

Per-prompt actionability after BGI-TWIN (refresh batch)

p04 · fastapi

5/5

p04 · pydantic

5/5

p04 · prometheus

5/5

p04 · next.js

5/5

p04 · django

4/5

p01–p03 mean

4.7/5

p04 = "safe implementation path" prompt. p01 = architecture overview, p02 = boundary identification, p03 = blast-radius analysis.

Per-repo breakdown

5 repos across 3 languages. Core table shows baseline/MCP/TWIN (deepseek), with GPT-4o and Gemini auto replication per repo below.

Repo	Mode	Runs	Median Latency	Evidence Cov.	Boundary	Actionability
django/ django Python	Baseline	4	99.8s	73.3%	1.00	4.0
	BGI MCP	4	73.1s	84.0%	1.00	4.0
	BGI TWIN	4	60.9s	75.3%	1.00	5.0

tiangolo/ fastapi Python	Baseline	3	131.3s	93.2%	1.00	4.3
	BGI MCP	3	54.8s	66.7%	1.00	4.3
	BGI TWIN	4	79.5s	82.0%	1.00	5.0

pydantic/ pydantic-core Python + Rust	Baseline	4	192.2s	48.6%	0.75	4.0
	BGI MCP	4	63.3s	86.7%	1.00	4.0
	BGI TWIN	4	47.5s	71.3%	1.00	4.7

prometheus/ prometheus Go	Baseline	6	89.9s	90.0%	1.00	4.0
	BGI MCP	6	119.9s	90.0%	1.00	4.0
	BGI TWIN	4	70.0s	80.8%	1.00	5.0

vercel/ next.js TypeScript	Baseline	3	291.8s	89.2%	1.00	3.7
	BGI MCP	3	66.4s	91.7%	1.00	3.7
	BGI TWIN	4	88.9s	63.4%	1.00	4.0

BGI TWIN rows are from the post-shipment refresh batch (p01–p04, fresh runs, explicit twin_context invocation, CallToolRequest evidence in every run). Boundary accuracy is 0–1 per run (correct seam identification). Actionability is 1–5.

Repo	Mode	Runs	Median Latency	Evidence Cov.	Boundary	Actionability
tiangolo/fastapi	BGI TWIN (GPT-4o)	4	31.6s	45.4%	1.00	5.0
django/django	BGI TWIN (GPT-4o)	4	48.3s	47.0%	1.00	5.0
pydantic/pydantic-core	BGI TWIN (GPT-4o)	4	40.4s	43.8%	1.00	4.5
prometheus/prometheus	BGI TWIN (GPT-4o)	4	53.4s	59.6%	1.00	4.8
vercel/next.js	BGI TWIN (GPT-4o)	4	33.6s	44.0%	1.00	5.0

tiangolo/fastapi	BGI TWIN (Gemini auto)	4	65.8s	78.7%	1.00	4.75
django/django	BGI TWIN (Gemini auto)	4	61.9s	45.7%	0.75	4.00
pydantic/pydantic-core	BGI TWIN (Gemini auto)	4	54.8s	47.5%	1.00	3.50
prometheus/prometheus	BGI TWIN (Gemini auto)	4	69.7s	71.4%	1.00	4.25
vercel/next.js	BGI TWIN (Gemini auto)	4	96.2s	68.5%	1.00	4.75

Notable findings

The most important things the data actually says.

pydantic-core — the clearest win in the dataset

Baseline p01: evidence coverage 0%, boundary accuracy 0. The model had no idea about the Rust/Python split and described a pure-Python architecture that does not exist.

BGI MCP p01: evidence coverage 80%, boundary accuracy 1.0. BGI injected the exact pyo3 bridge seam; the model identified it correctly on the first attempt.

BGI TWIN p04: actionability 5/5, evidence 100%. The safe-implementation prompt produced a copy-paste-ready patch path with exact file + function references.

fastapi — what the data actually shows (honest)

MCP evidence coverage dropped on p03/p04 (33.3% and 66.7% vs baseline 90% and 100%). This is real and worth understanding: the baseline model, having no architecture summary, was forced to read every source file individually and built a 10-item verified-claim table. The MCP model received blast-radius context (1,614 impacted units) and accepted that as the architectural picture — making fewer granular verifications.

What this reveals: MCP architecture context trades granular file-reading depth for boundary accuracy. On well-structured repos where baseline exploration is already strong, the evidence-coverage gain is smaller. Boundary accuracy was perfect (1.0) in all fastapi modes — the architecture was correctly understood. BGI-TWIN p01–p04 refresh recovered with 82% mean evidence and 5/5 actionability because behavioral twins anchor the model to specific files rather than a summary.

Raw outputs: validation/runs/fastapi/

Prometheus (Go) — cross-language neutral result

Evidence coverage is flat at 90.0% in baseline and MCP modes. MCP is slower (119.9s vs 89.9s). BGI-TWIN refresh improved actionability to 5/5 and brought latency down to 70s. This is the most important neutral finding: MCP gains on accuracy are largest when baseline exploration is architecturally blind (pydantic-core: 0→80%). On well-explored Go codebases, the accuracy lift is smaller. BGI-TWIN's actionability gain holds across both cases.

next.js (TypeScript) — large monorepo signal

Largest baseline latency in the set (291.8s) — BGI reduces it to 33–89s depending on mode/model. Evidence coverage is lower in the GPT-4o replication slice, but boundary accuracy remains perfect (1.0) and actionability improves with TWIN context.

Hallucination rate: 0 across all 100 scored runs

Not a single factually incorrect module or file claim appeared in any baseline, MCP, or BGI-TWIN run across all three models. MCP and BGI-TWIN ground the model on real indexed artifacts — they do not introduce new errors.

Limitations — read this first

We publish limitations before you find them. A reader who discovers a limitation themselves trusts the evidence less than one we told first.

Self-reported scoring

Ground-truth checklists were written by us and scored by us. The checklists were defined by reading the actual source code before scoring (not after). The full scoring rubric is at validation/SCORING_RUBRIC.md. Every raw AI output is committed to validation/runs/ on GitHub. Anyone can re-score independently. If you score any run differently, open an issue with your reasoning — we will update the record.

100 runs across 5 repos is stronger, but still limited

Python + Go + TypeScript is better than Python-only, and we now have three-model independent replication (deepseek, GPT-4o, Gemini auto), but 5 repos is still a small sample. The pydantic-core finding (0% → 80% evidence on a Rust/Python hybrid) is strong enough to stand independently. For stronger statistical confidence, we still need more repos and at least one external team replication.

BGI-TWIN refresh is MCP-only — no updated baseline

The post-shipment refresh only ran the MCP+TWIN mode. The baseline numbers in the per-repo table are from the earlier A/B batch. We have no reason to believe the baseline changed (same repos, same prompts), but this is a real limitation of the experimental design.

Evidence coverage is sensitive to explicit tagging style

This rubric rewards explicit claim-level VERIFIED/HYPOTHESIS/UNKNOWN tagging with file/line citations. deepseek follows this protocol most strictly (13.9 tags/run), Gemini auto is intermediate (8–15/run), and GPT-4o tags less explicitly (6.95/run). All three produce correct answers with high actionability and zero hallucinations — the coverage gap is a rubric/style interaction, not a quality gap. The tag-relaxed second score accounts for this.

One MCP run did not invoke tools

One next.js p04 run in the original A/B batch was confirmed as invalid (no CallToolRequest in output) and is explicitly marked in runs.csv as unscored. All 60 BGI-TWIN refresh runs (deepseek, GPT-4o, Gemini) have CallToolRequest evidence present.

We still need external replication

We completed three-model independent replication (GPT-4o and Gemini auto). The next credibility jump is an external replication: one engineer who has never used Big Indexer running this protocol on their own repo and publishing raw outputs. If you are willing to do this, open an issue and we will send the exact protocol.

Methodology

Enough detail to reproduce every number on this page.

Repos

tiangolo/fastapi (Python, web framework)
django/django (Python, web framework)
pydantic/pydantic-core (Python + Rust)
prometheus/prometheus (Go, monitoring)
vercel/next.js (TypeScript, monorepo)

Prompts (4 per repo)

p01 — Architecture overview
p02 — Boundary identification
p03 — Blast-radius analysis
p04 — Safe implementation path

Tooling

CLI: opencode 1.14.41 / gemini CLI (auto)
Models: deepseek-v4-flash + azure/gpt-4o + gemini/auto
MCP: bgi mcp --graph ... --fuse-graph ...
TWIN: twin_context explicit in prompt

Scoring rubric

Evidence coverage: recall vs ground-truth checklist
Evidence (2nd): tag-relaxed coverage with capped anchor credit
Boundary accuracy: 0/1 correct seam identification
Actionability: 1–5 (5 = copy-paste ready)
Hallucination flags: count of incorrect claims

Full rubric → · All raw run outputs → · runs.csv (100 scored runs) →

Reproduce this yourself

Every run is reproducible. The complete protocol is in docs/MCP_SETUP.md.

# 1. Install
pip install bigindexer

# 2. Clone any repo and build the BGI index
git clone --depth 1 https://github.com/tiangolo/fastapi
bgi scan fastapi/ --out output/

# 3. Start the MCP context server (includes BGI-TWIN tools)
bgi mcp --graph output/bgi-graph.json --fuse-graph output/fuse-graph.json

# 4. Point your AI coding client at it (opencode.json in the repo dir):
# {
#   "mcp": {
#     "bgi": {
#       "command": "bgi",
#       "args": ["mcp", "--graph", "output/bgi-graph.json",
#                "--fuse-graph", "output/fuse-graph.json"]
#     }
#   }
# }

# 5. Run with a task-anchored prompt to trigger twin_context:
opencode  # → AI receives architecture summary + behavioral twins + seam + rubric

The three BGI-TWIN tools are available as MCP tools automatically once the server starts. No additional configuration needed.

Want to run the full validation protocol? The exact prompt texts for p01–p04 are committed at validation/. Run against any repo, score using the rubric, and post your results — we will add them to the evidence record.

Does Big Indexer actually help AI coding assistants?

Raw-output comparison (same prompt, two models)

Per-prompt actionability after BGI-TWIN (refresh batch)

pydantic-core — the clearest win in the dataset

fastapi — what the data actually shows (honest)

Prometheus (Go) — cross-language neutral result

next.js (TypeScript) — large monorepo signal

Hallucination rate: 0 across all 100 scored runs

Self-reported scoring

100 runs across 5 repos is stronger, but still limited

BGI-TWIN refresh is MCP-only — no updated baseline

Evidence coverage is sensitive to explicit tagging style

One MCP run did not invoke tools

We still need external replication

Repos

Prompts (4 per repo)

Tooling

Scoring rubric