The Multi-Chunk Search Problem: When One Document Dominates Your Results
T
dailytech

The Multi-Chunk Search Problem: When One Document Dominates Your Results

How adaptive chunking broke search relevance — and the field collapsing solution that fixed it. A real engineering story from the markdown-vault-mcp project.

AK
Aniket Karne
Senior DevOps Engineer
· 3 min read

Search relevance is one of those problems that looks solved until it isn’t. You implement full-text search, tune your scoring, and suddenly your top-5 results are all from the same document. That’s exactly what happened in markdown-vault-mcp — and the fix reveals something important about how AI-powered search differs from traditional keyword search.

The Problem: One Document, Five Slots

markdown-vault-mcp is an MCP server that indexes markdown vaults with FTS5 full-text search and semantic similarity. It uses adaptive chunking — instead of splitting documents into fixed-size blocks, it respects semantic boundaries (sections, headings, code blocks). This produces better search results because chunks are coherent units rather than arbitrary slices.

But adaptive chunking has a dark side: one document can produce many high-scoring chunks. A 150-chunk reference book, for instance, might have 5 chunks that all score well for a given query. The naive approach — return top-K raw rows — would show 5 results from the same document while genuinely relevant files never appear.

The user-reported reproduction was stark: searching for a topic returned 4 of 5 “similar” slots taken by one document. The signal was real — that document was genuinely relevant — but the presentation was broken. Diversity had collapsed.

The Investigation: Why Naive Top-K Fails

The root cause is in how search results are ranked. Each chunk gets an independent score (BM25 for keyword, cosine similarity for semantic, or RRF-fused for hybrid). When chunks come from the same document, there’s no enforcement that they should compete with each other. The highest-scoring chunk per document wins the top slot, but if that document has 5 chunks above the threshold, it takes 5 of K slots.

This isn’t unique to AI search — traditional Elasticsearch has the same problem and solves it with collapse. The difference is that AI-powered search amplifies the issue because semantic similarity tends to cluster within documents. A long technical document will have multiple sections that all score high on a related query.

The Solution: Field Collapsing with MaxP Aggregation

The fix merged today (9d20177) introduces field collapsing across all search modes: search, get_similar, and get_context.similar. The implementation is in _group_by_path:

def _group_by_path(rows, *, chunks_per_file, file_limit):
    groups = {}
    for row in rows:
        if row.path not in groups and len(groups) >= file_limit:
            break
        if row.path not in groups:
            groups[row.path] = []
        if len(groups[row.path]) < chunks_per_file:
            groups[row.path].append(row)

The key insight is MaxP aggregation: a file’s score equals the maximum score of its sections, not the sum or average. This means one excellent section is enough to surface the file, and other strong sections appear as sections[] sub-entries rather than competing for top-K slots. This follows established precedent: PARADE, Elasticsearch collapse, Vespa grouping, and Qdrant’s query_points_groups all use the same pattern.

The result type changed from list[SearchResult] to list[GroupedResult], where each group wraps one file with sections: list[SectionHit]. The limit parameter now means K files, not K chunks — a semantic change that required updating MCP tool signatures and documentation.

The Performance Fix: Index Gap on sections(document_id)

The field collapsing PR revealed a performance issue during second-opinion code review. The keyword search path had a correlated subquery that joined sections.document_id — but unlike links, document_tags, and document_aliases, the sections table had no index on that column. The fix (bf849ad) adds idx_sections_docid:

CREATE INDEX IF NOT EXISTS idx_sections_docid ON sections(document_id);

Without this index, every keyword search result triggered an O(n_sections) scan. With it, the join becomes an index lookup. The migration runs idempotently via CREATE INDEX IF NOT EXISTS.

The Length-Downweight Bug: Penalizing Authority

A second issue surfaced post-merge (dd4ac42): length-downweighting was compounding with field collapsing in similarity searches. Length-downweight (α=0.25, introduced in PR #433) penalizes longer documents to favor concise focused results over verbose ones. For regular search queries, this bias is useful — users typically want the specific section, not the entire chapter.

But for get_similar and get_context.similar, it’s harmful. When you ask “what’s similar to this document?”, a reference book with 150 chunks and raw cosine similarity of 0.90 should rank at the top. Instead, the compound penalty crushed its score to ~0.40, and bibliographies at 0.55-0.64 started appearing above it.

The fix is surgical: skip length-downweight for similarity queries (alpha=0.0), keep it for search queries. Field collapsing already structurally dedupes multi-chunk dominators — the downweight is redundant for similarity and actively harmful to authoritative long documents.

What This Means for Multi-Agent Memory Systems

Field collapsing is directly relevant to anyone building agent memory or RAG systems. The pattern — adaptive chunking, semantic similarity, multi-chunk dominance — appears everywhere. The fix is general:

  1. Use field collapsing (MaxP aggregation) whenever your index produces multiple chunks per document
  2. Index foreign keys on join columns — not just primary keys
  3. Separate relevance tuning for search (short query → focused result) from similarity (document → document)
  4. Test with realistic data: a 150-chunk reference book will expose issues that 10-chunk documents won’t

The markdown-vault-mcp project is a useful reference implementation because it surfaces these problems in a concrete, testable system rather than a hypothetical RAG pipeline. The commit history (particularly PRs #469, #471, #472, #473) documents the full problem-solution cycle — worth studying if you’re building anything that indexes and searches structured markdown.

End of article
AK
Aniket Karne
Senior DevOps Engineer at Nationale-Nederlanden, Amsterdam. Building with AI agents, Kubernetes, and cloud infrastructure. Writing about what's actually being built.

Enjoyed this? Give it some claps

Newsletter

Stay in the loop

New posts drop when there's something worth writing about. No spam — just the occasional deep dive from the workbench.

Or follow on Substack directly

Share:

Comments

Written by Aniket Karne

May 12, 2026 at 12:00 AM UTC