Public Evidence · 100 scored runs
Does Big Indexer actually help AI coding assistants?
We ran the same architectural prompts on 5 real open-source repos — with and without Big Indexer — and published every raw output, every score, and every methodology decision.
Results at a glance
All numbers are from the post-shipment BGI-TWIN refresh (20 runs, p01–p04 × 5 repos, MCP + twin_context mode). Full comparison in the table below.
Actionability
4.25–4.85/5
Cross-model TWIN range (deepseek / GPT-4o / Gemini)
Boundary Accuracy
1.00
Perfect · 20/20 runs
Hallucinations
0
Across all 100 scored runs
Median Latency
68s
vs 134s baseline (49% faster)
p04 Evidence Coverage
96%
Safe-impl slice · 5 repos
Three-stage core comparison
Each stage adds a layer. BGI-MCP fixed boundary & latency. BGI-TWIN fixed actionability. Three-model replication (GPT-4o, Gemini auto) is shown in the next section.
| Metric |
No BGI Baseline (20 runs) |
BGI MCP Pre-shipment (20 runs) |
BGI MCP + TWIN Post-shipment (20 runs) |
| Actionability (1–5) |
4.0 |
4.0 → |
4.75 +0.75 |
| Evidence Coverage |
78.7% |
84.9% +6.2 pp |
79.9% / 96%† |
| Boundary Accuracy |
0.95 |
1.00 +0.05 |
1.00 held |
| Hallucinations |
0 |
0 — |
0 — |
| Median Latency |
133.8s |
66.2s 51% faster |
68.5s ≈ held |
† 79.9% is the all-prompt mean for the refresh batch; 96% is the p04 (safe implementation path) slice across 5 repos.
The harder p01–p03 prompts pull the mean down; p04 evidence is the most actionability-relevant slice.
What changed between stage 2 and stage 3?
BGI-TWIN shipped three new MCP tools — task_fingerprint, behavioral_twins, and twin_context —
that give the AI model a fingerprinted task context, the top-3 behaviorally similar code units from the real repo,
the strongest architectural seam for the task, and an explicit 5-point actionability rubric.
That is why actionability moved from 4.0 to 4.75 while boundary accuracy and hallucinations stayed perfect.
Three-model independent replication (deepseek + GPT-4o + Gemini auto)
We re-ran the full TWIN protocol (p01–p04 × 5 repos) on azure/gpt-4o and gemini/auto for model diversity and cross-model consistency checking.
| Metric |
BGI TWIN (deepseek) |
BGI TWIN (GPT-4o) |
BGI TWIN (Gemini auto) |
| Actionability (1–5) |
4.75 |
4.85 |
4.25 |
| Evidence coverage (strict) |
79.9% (96% p04) |
47.9% (49.3% p04) |
62.4% (80% p04) |
| Evidence (tag-relaxed, 2nd score) |
94.8% (100% p04) |
59.5% (62.7% p04) |
83.4% |
| Boundary accuracy |
1.00 |
1.00 |
0.95 |
| Hallucinations |
0 |
0 |
0 |
| Median latency |
68.5s |
41.6s |
65.8s |
Replication interpretation: actionability, boundary accuracy (≥0.95), and zero-hallucination held across all three models.
Evidence coverage varies by model tagging style — see the callout below.
Gemini auto's 0.95 boundary reflects one genuine architectural miss (django/p02, depth-first on query.py, not pattern calibration).
Why evidence coverage varies across models — and why it is not a quality signal.
Evidence coverage is scored as checklist recall from explicit, claim-level evidence statements. Across the same TWIN protocol, deepseek outputs contained 278 explicit
VERIFIED/HYPOTHESIS/UNKNOWN tags (13.9/run), Gemini outputs contained
167 (8–15/run), and GPT-4o outputs contained 139 (6.95/run).
On the p04 implementation slice, actionability stayed high across all three models (deepseek 4.75, GPT-4o 4.85, Gemini 4.25), boundary stayed ≥0.95, and hallucinations stayed zero.
Interpretation: this is a model/rubric interaction (explicit tagging style), not a loss of BGI grounding.
deepseek follows the VERIFIED/HYPOTHESIS/UNKNOWN protocol most strictly; Gemini auto is intermediate; GPT-4o gives correct answers without exhaustive tagging.
The tag-relaxed second score (94.8% / 83.4% / 59.5%) closes most of this gap by crediting concrete repo-anchored claims without explicit labels.
We keep scores unnormalized and publish raw outputs so readers can audit this directly.
Second evidence score formula (tag-relaxed): min(100, evidence_coverage_pct + min(25, (unlabeled_repo_anchor_lines / checklist_items) * 100 * 0.15)).
This adds conservative credit for non-log lines that include concrete repo anchors (*.py, *.go, *.ts, etc.) without explicit evidence tags.
Raw-output comparison (same prompt, two models)
Prompt: fastapi p03 — "blast radius for solve_dependencies"
deepseek (opencode_mcp_p03_twin_refresh_r2):
| Definition spans lines 598–735 ... | VERIFIED |
| get_request_handler ... | VERIFIED |
| get_websocket_app ... | VERIFIED |
VERIFIED: Only 3 call sites exist in the codebase.
GPT-4o (opencode_mcp_p03_twin_refresh_gpt4o_r1):
1. VERIFIED: Changes to solve_dependencies will directly impact tests.
2. HYPOTHESIS: Other clusters may have dependent implications.
Source artifacts:
deepseek run
·
GPT-4o run
What is BGI-TWIN?
Three deterministic MCP tools that turn "here is the architecture" into "here is exactly what to change and where."
🔬
task_fingerprint(task)
Converts a natural-language task into a COV (co-occurrence vocabulary) token set. Deterministic — same task always produces the same fingerprint. No LLM call.
🧬
behavioral_twins(task)
Ranks all indexed code units by Jaccard similarity against the task fingerprint. Returns the top-3 units that do the most similar thing in the actual codebase — with optional source snippets.
🎯
twin_context(task)
Composite output: task COV + top twins + the strongest seam boundary for the task + an explicit 5-point actionability rubric + confidence-gated escalation. The AI gets a fully structured implementation brief.
BGI-TWIN is a deterministic context compiler — it does not generate code, does not call an LLM, and does not speculate about behavior. Every output is derived directly from the indexed graph and fuse artifacts you already built with bgi scan.
Why actionability was flat — and how BGI-TWIN fixed it
The pre-shipment MCP improved evidence and latency but did not move actionability. Here is why.
The problem: Pre-shipment MCP gave the AI model a blast-radius summary and an architecture overview.
That told the model what the architecture looked like, but not where to make a specific change for a specific task.
The AI produced correct, well-grounded answers — but at 4.0/5 they were still "here's the pattern, you figure out the file."
The fix: BGI-TWIN surfaces the 3 code units that are most behaviorally similar to the task at hand —
real functions, real files, real line numbers from the codebase being changed.
The AI gets "this task looks like validators/core.py:validate_field, which lives inside seam boundary validators/↔models/ — start here."
That is the difference between 4.0 and 4.75.
Per-prompt actionability after BGI-TWIN (refresh batch)
p04 = "safe implementation path" prompt. p01 = architecture overview, p02 = boundary identification, p03 = blast-radius analysis.
Per-repo breakdown
5 repos across 3 languages. Core table shows baseline/MCP/TWIN (deepseek), with GPT-4o and Gemini auto replication per repo below.
| Repo |
Mode |
Runs |
Median Latency |
Evidence Cov. |
Boundary |
Actionability |
django/ django Python |
Baseline |
4 | 99.8s | 73.3% | 1.00 | 4.0 |
| BGI MCP |
4 | 73.1s | 84.0% | 1.00 | 4.0 |
| BGI TWIN |
4 | 60.9s | 75.3% | 1.00 | 5.0 |
|
tiangolo/ fastapi Python |
Baseline |
3 | 131.3s | 93.2% | 1.00 | 4.3 |
| BGI MCP |
3 | 54.8s | 66.7% | 1.00 | 4.3 |
| BGI TWIN |
4 | 79.5s | 82.0% | 1.00 | 5.0 |
|
pydantic/ pydantic-core Python + Rust |
Baseline |
4 | 192.2s | 48.6% | 0.75 | 4.0 |
| BGI MCP |
4 | 63.3s | 86.7% | 1.00 | 4.0 |
| BGI TWIN |
4 | 47.5s | 71.3% | 1.00 | 4.7 |
|
prometheus/ prometheus Go |
Baseline |
6 | 89.9s | 90.0% | 1.00 | 4.0 |
| BGI MCP |
6 | 119.9s | 90.0% | 1.00 | 4.0 |
| BGI TWIN |
4 | 70.0s | 80.8% | 1.00 | 5.0 |
|
vercel/ next.js TypeScript |
Baseline |
3 | 291.8s | 89.2% | 1.00 | 3.7 |
| BGI MCP |
3 | 66.4s | 91.7% | 1.00 | 3.7 |
| BGI TWIN |
4 | 88.9s | 63.4% | 1.00 | 4.0 |
BGI TWIN rows are from the post-shipment refresh batch (p01–p04, fresh runs, explicit twin_context invocation, CallToolRequest evidence in every run).
Boundary accuracy is 0–1 per run (correct seam identification). Actionability is 1–5.
| Repo |
Mode |
Runs |
Median Latency |
Evidence Cov. |
Boundary |
Actionability |
| tiangolo/fastapi | BGI TWIN (GPT-4o) | 4 | 31.6s | 45.4% | 1.00 | 5.0 |
| django/django | BGI TWIN (GPT-4o) | 4 | 48.3s | 47.0% | 1.00 | 5.0 |
| pydantic/pydantic-core | BGI TWIN (GPT-4o) | 4 | 40.4s | 43.8% | 1.00 | 4.5 |
| prometheus/prometheus | BGI TWIN (GPT-4o) | 4 | 53.4s | 59.6% | 1.00 | 4.8 |
| vercel/next.js | BGI TWIN (GPT-4o) | 4 | 33.6s | 44.0% | 1.00 | 5.0 |
|
| tiangolo/fastapi | BGI TWIN (Gemini auto) | 4 | 65.8s | 78.7% | 1.00 | 4.75 |
| django/django | BGI TWIN (Gemini auto) | 4 | 61.9s | 45.7% | 0.75 | 4.00 |
| pydantic/pydantic-core | BGI TWIN (Gemini auto) | 4 | 54.8s | 47.5% | 1.00 | 3.50 |
| prometheus/prometheus | BGI TWIN (Gemini auto) | 4 | 69.7s | 71.4% | 1.00 | 4.25 |
| vercel/next.js | BGI TWIN (Gemini auto) | 4 | 96.2s | 68.5% | 1.00 | 4.75 |
Notable findings
The most important things the data actually says.
pydantic-core — the clearest win in the dataset
Baseline p01: evidence coverage 0%, boundary accuracy 0.
The model had no idea about the Rust/Python split and described a pure-Python architecture that does not exist.
BGI MCP p01: evidence coverage 80%, boundary accuracy 1.0.
BGI injected the exact pyo3 bridge seam; the model identified it correctly on the first attempt.
BGI TWIN p04: actionability 5/5, evidence 100%.
The safe-implementation prompt produced a copy-paste-ready patch path with exact file + function references.
fastapi — what the data actually shows (honest)
MCP evidence coverage dropped on p03/p04 (33.3% and 66.7% vs baseline 90% and 100%).
This is real and worth understanding: the baseline model, having no architecture summary, was forced to read
every source file individually and built a 10-item verified-claim table. The MCP model received blast-radius
context (1,614 impacted units) and accepted that as the architectural picture — making fewer granular
verifications.
What this reveals: MCP architecture context trades granular file-reading depth for boundary
accuracy. On well-structured repos where baseline exploration is already strong, the evidence-coverage gain
is smaller. Boundary accuracy was perfect (1.0) in all fastapi modes — the architecture was correctly understood.
BGI-TWIN p01–p04 refresh recovered with 82% mean evidence and 5/5 actionability because behavioral twins
anchor the model to specific files rather than a summary.
Raw outputs:
validation/runs/fastapi/
Prometheus (Go) — cross-language neutral result
Evidence coverage is flat at 90.0% in baseline and MCP modes. MCP is slower (119.9s vs 89.9s).
BGI-TWIN refresh improved actionability to 5/5 and brought latency down to 70s.
This is the most important neutral finding: MCP gains on accuracy are largest when baseline exploration is architecturally blind (pydantic-core: 0→80%). On well-explored Go codebases, the accuracy lift is smaller. BGI-TWIN's actionability gain holds across both cases.
next.js (TypeScript) — large monorepo signal
Largest baseline latency in the set (291.8s) — BGI reduces it to 33–89s depending on mode/model. Evidence coverage is lower in the GPT-4o replication slice, but boundary accuracy remains perfect (1.0) and actionability improves with TWIN context.
Hallucination rate: 0 across all 100 scored runs
Not a single factually incorrect module or file claim appeared in any baseline, MCP, or BGI-TWIN run across all three models. MCP and BGI-TWIN ground the model on real indexed artifacts — they do not introduce new errors.
Limitations — read this first
We publish limitations before you find them. A reader who discovers a limitation themselves trusts the evidence less than one we told first.
Self-reported scoring
Ground-truth checklists were written by us and scored by us. The checklists were defined by reading the actual source code before scoring (not after). The full scoring rubric is at
validation/SCORING_RUBRIC.md.
Every raw AI output is committed to
validation/runs/ on GitHub.
Anyone can re-score independently. If you score any run differently, open an issue with your reasoning — we will update the record.
100 runs across 5 repos is stronger, but still limited
Python + Go + TypeScript is better than Python-only, and we now have three-model independent replication (deepseek, GPT-4o, Gemini auto), but 5 repos is still a small sample. The pydantic-core finding (0% → 80% evidence on a Rust/Python hybrid) is strong enough to stand independently. For stronger statistical confidence, we still need more repos and at least one external team replication.
BGI-TWIN refresh is MCP-only — no updated baseline
The post-shipment refresh only ran the MCP+TWIN mode. The baseline numbers in the per-repo table are from the earlier A/B batch. We have no reason to believe the baseline changed (same repos, same prompts), but this is a real limitation of the experimental design.
Evidence coverage is sensitive to explicit tagging style
This rubric rewards explicit claim-level VERIFIED/HYPOTHESIS/UNKNOWN tagging with file/line citations. deepseek follows this protocol most strictly (13.9 tags/run), Gemini auto is intermediate (8–15/run), and GPT-4o tags less explicitly (6.95/run). All three produce correct answers with high actionability and zero hallucinations — the coverage gap is a rubric/style interaction, not a quality gap. The tag-relaxed second score accounts for this.
One MCP run did not invoke tools
One next.js p04 run in the original A/B batch was confirmed as invalid (no CallToolRequest in output) and is explicitly marked in runs.csv as unscored. All 60 BGI-TWIN refresh runs (deepseek, GPT-4o, Gemini) have CallToolRequest evidence present.
We still need external replication
We completed three-model independent replication (GPT-4o and Gemini auto). The next credibility jump is an external replication: one engineer who has never used Big Indexer running this protocol on their own repo and publishing raw outputs. If you are willing to do this, open an issue and we will send the exact protocol.
Methodology
Enough detail to reproduce every number on this page.
Repos
- tiangolo/fastapi (Python, web framework)
- django/django (Python, web framework)
- pydantic/pydantic-core (Python + Rust)
- prometheus/prometheus (Go, monitoring)
- vercel/next.js (TypeScript, monorepo)
Prompts (4 per repo)
- p01 — Architecture overview
- p02 — Boundary identification
- p03 — Blast-radius analysis
- p04 — Safe implementation path
Tooling
- CLI: opencode 1.14.41 / gemini CLI (auto)
- Models: deepseek-v4-flash + azure/gpt-4o + gemini/auto
- MCP:
bgi mcp --graph ... --fuse-graph ...
- TWIN:
twin_context explicit in prompt
Scoring rubric
- Evidence coverage: recall vs ground-truth checklist
- Evidence (2nd): tag-relaxed coverage with capped anchor credit
- Boundary accuracy: 0/1 correct seam identification
- Actionability: 1–5 (5 = copy-paste ready)
- Hallucination flags: count of incorrect claims
Reproduce this yourself
# 1. Install
pip install bigindexer
# 2. Clone any repo and build the BGI index
git clone --depth 1 https://github.com/tiangolo/fastapi
bgi scan fastapi/ --out output/
# 3. Start the MCP context server (includes BGI-TWIN tools)
bgi mcp --graph output/bgi-graph.json --fuse-graph output/fuse-graph.json
# 4. Point your AI coding client at it (opencode.json in the repo dir):
# {
# "mcp": {
# "bgi": {
# "command": "bgi",
# "args": ["mcp", "--graph", "output/bgi-graph.json",
# "--fuse-graph", "output/fuse-graph.json"]
# }
# }
# }
# 5. Run with a task-anchored prompt to trigger twin_context:
opencode # → AI receives architecture summary + behavioral twins + seam + rubric
The three BGI-TWIN tools are available as MCP tools automatically once the server starts. No additional configuration needed.
Want to run the full validation protocol?
The exact prompt texts for p01–p04 are committed at
validation/.
Run against any repo, score using the rubric, and post your results — we will add them to the evidence record.