Parent Document

Source

Synced from packages/sayou-chunking/examples/quick_start_parent_document.py.

Setup¶

Create parent–child chunk pairs using ParentDocumentSplitter.

ParentDocumentSplitter implements the Parent Document Retrieval pattern: it first splits the document into large parent chunks that preserve broad context, then splits each parent into smaller child chunks optimised for embedding similarity.

At query time, the retriever matches child chunks (small, precise) but returns the parent chunk (large, contextually rich) to the language model. This gives the model enough surrounding context to generate a grounded answer while keeping the embedding comparison sharp.

Chunk hierarchy:

Text Only

Document
  └─ Parent chunk 0  (large window — context carrier)
       ├─ Child chunk 0-0  (small window — embedding target)
       ├─ Child chunk 0-1
       └─ Child chunk 0-2
  └─ Parent chunk 1
       ├─ Child chunk 1-0
       └─ Child chunk 1-1

Python

import json

from sayou.chunking.pipeline import ChunkingPipeline
from sayou.chunking.splitter.parent_document_splitter import \
    ParentDocumentSplitter

pipeline = ChunkingPipeline(extra_splitters=[ParentDocumentSplitter])
print("Pipeline initialized.")

LONG_TEXT = (
    """\
Retrieval-Augmented Generation (RAG) systems combine a retriever and a
language model.  The retriever finds relevant passages; the language model
generates an answer conditioned on those passages.

Chunking strategy is one of the most impactful design decisions in a RAG
pipeline.  Chunk too coarsely and retrieval becomes imprecise.  Chunk too
finely and you lose the surrounding context the model needs to reason.

The Parent Document strategy addresses this trade-off directly.  Small
child chunks are embedded and indexed; when a child chunk is retrieved the
system fetches its parent — a larger, semantically complete section — and
passes that to the language model.

Implementation in Sayou uses RecursiveSplitter for both levels.  The parent
uses a larger window; children subdivide each parent with a smaller window.
Every child stores a `parent_id` that points back to its parent chunk, and
every parent stores a `child_ids` list for forward navigation.
"""
    * 3
)  # repeat to ensure multiple parents

Basic Parent–Child Split¶

parent_chunk_size controls how large each parent window is. chunk_size controls how large each child window is.

The pipeline returns parents and children interleaved in document order: parent 0, children of parent 0, parent 1, children of parent 1, …

Python

chunks = pipeline.run(
    {
        "content": LONG_TEXT,
        "metadata": {"id": "rag_doc"},
        "config": {
            "parent_chunk_size": 600,
            "chunk_size": 150,
            "chunk_overlap": 0,
        },
    },
    strategy="parent_document",
)

parents = [c for c in chunks if c.metadata.get("doc_level") == "parent"]
children = [c for c in chunks if c.metadata.get("doc_level") == "child"]

print("=== Basic Parent–Child Split ===")
print(f"  Parents  : {len(parents)}")
print(f"  Children : {len(children)}")
print(f"  Total    : {len(chunks)}")

Link Integrity¶

Every child's parent_id matches a parent's chunk_id. Every parent's child_ids contains the IDs of its children.

Python

parent_map = {p.metadata["chunk_id"]: p for p in parents}

print("\n=== Link Integrity ===")
for parent in parents:
    pid = parent.metadata["chunk_id"]
    child_ids = parent.metadata.get("child_ids", [])
    my_children = [c for c in children if c.metadata.get("parent_id") == pid]
    print(
        f"  {pid:30s}  children registered={len(child_ids)}  "
        f"children found={len(my_children)}"
    )

Size Comparison¶

Parents are significantly larger than children. Print average sizes to confirm the ratio matches the configured parent_chunk_size / chunk_size.

Python

avg_parent = sum(len(p.content) for p in parents) / max(len(parents), 1)
avg_child = sum(len(c.content) for c in children) / max(len(children), 1)

print("\n=== Size Comparison ===")
print(f"  Average parent length : {avg_parent:6.0f} chars")
print(f"  Average child length  : {avg_child:6.0f} chars")
print(f"  Ratio parent/child    : {avg_parent / max(avg_child, 1):.1f}x")

Structure-based Parents¶

Set parent_strategy="structure" to use StructureSplitter for the parent pass. This is useful when the document has clear section boundaries (e.g. legislation, contracts, or API documentation) that should not be split mid-section at the parent level.

Python

structured_chunks = pipeline.run(
    {
        "content": LONG_TEXT,
        "metadata": {"id": "rag_doc_structured"},
        "config": {
            "parent_chunk_size": 600,
            "chunk_size": 150,
            "chunk_overlap": 0,
            "parent_strategy": "structure",
        },
    },
    strategy="parent_document",
)

s_parents = [c for c in structured_chunks if c.metadata.get("doc_level") == "parent"]
s_children = [c for c in structured_chunks if c.metadata.get("doc_level") == "child"]

print("\n=== Structure-based Parents ===")
print(f"  Parents  : {len(s_parents)}")
print(f"  Children : {len(s_children)}")

Save Results¶

Save only child chunks — these are the units that get embedded and indexed. Parents are stored separately for context retrieval at query time.

Python

output = {
    "parents": [p.model_dump() for p in parents],
    "children": [c.model_dump() for c in children],
}
with open("parent_document_chunks.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

print(
    f"\nSaved {len(parents)} parents and {len(children)} children to parent_document_chunks.json"
)