Structure

Source

Synced from packages/sayou-chunking/examples/quick_start_structure.py.

Setup¶

Split structured text and record lists using StructureSplitter.

StructureSplitter has two operating modes selected automatically based on the block's type and content:

Mode	Trigger	Use case
Text mode	`content` is a string	Legal text, contracts, articles
Record mode	`type="record"` and `content` is a list	Transcripts, log lines, CSV rows

Text mode — splits on a user-defined regex pattern first (structure_pattern), then applies RecursiveSplitter on any section that still exceeds chunk_size.

Record mode — groups list records by either: - chapter_intervals: explicit (start, end, title) time ranges - window_size: a rolling time window (e.g., every 300 seconds)

Python

import json

from sayou.chunking.pipeline import ChunkingPipeline
from sayou.chunking.splitter.structure_splitter import StructureSplitter

pipeline = ChunkingPipeline(extra_splitters=[StructureSplitter])
print("Pipeline initialized.")

Text Mode: Article-based Split¶

Legal and regulatory documents often use numbered articles as primary structural boundaries. Pass a regex structure_pattern that matches those boundaries; StructureSplitter splits there first and then handles any oversized sections with recursive sub-splitting.

Each chunk carries: - metadata.parent_structure_idx — zero-based section index - metadata.chunk_id — {doc_id}_s{section} or {doc_id}_s{section}_p{part}

Python

legal_text = """\
제1조 (목적)
이 약관은 Sayou Fabric 플랫폼의 이용 조건을 규정합니다.
모든 이용자는 본 약관에 동의하는 것으로 간주됩니다.

제2조 (이용 범위)
이용자는 상업적 제품에 플랫폼을 통합할 수 있으며,
문서 및 사용자 접점에서 출처 표기를 유지해야 합니다.

제3조 (금지 행위)
역설계, 무단 배포, 재라이선싱은 엄격히 금지됩니다.
위반 시 즉각적인 이용 정지 조치가 취해집니다.

제4조 (면책 조항)
플랫폼은 현 상태로 제공되며 운영사는 특정 목적 적합성을
보증하지 않습니다.
"""

text_chunks = pipeline.run(
    {
        "content": legal_text,
        "metadata": {"id": "contract"},
        "config": {
            "chunk_size": 300,
            "chunk_overlap": 0,
            "structure_pattern": r"제\d+조",
        },
    },
    strategy="structure",
)

print("=== Text Mode: Article-based Split ===")
for chunk in text_chunks:
    idx = chunk.metadata.get("parent_structure_idx", "?")
    print(f"  [section {idx}] {chunk.content[:70]!r}")

Text Mode: HTML / Code Fence Detection¶

StructureSplitter auto-selects for HTML and code-fence content when strategy="auto" and the block type is "html" or content starts with <html. Pass strategy="structure" explicitly to force selection.

Python

html_content = """\
<html>
<body>
<h1>Introduction</h1>
<p>This document covers the basics of RAG pipelines.</p>
<h2>Chunking</h2>
<p>Chunking divides documents into retrieval-ready pieces.</p>
</body>
</html>
"""

html_chunks = pipeline.run(
    {
        "content": html_content,
        "metadata": {"id": "html_doc"},
        "config": {"chunk_size": 200, "chunk_overlap": 0},
    },
    strategy="structure",
)

print("\n=== Text Mode: HTML ===")
for chunk in html_chunks:
    print(f"  [{chunk.metadata.get('chunk_id')}] {chunk.content[:60]!r}")

Record Mode: Chapter Intervals¶

Transcript cues (from YouTube or podcast) are a list of timed records. Define chapter boundaries as (start_sec, end_sec, title) tuples. StructureSplitter assigns each cue to the chapter whose time range contains its start value and merges the cue texts.

Each chunk carries: - metadata.chapter_title — chapter name - metadata.sayou:startTime — start time of the first cue - metadata.sayou:endTime — end time of the last cue - metadata.record_count — number of cues merged

Python

from sayou.core.schemas import SayouBlock

transcript_cues = [
    {"text": "Welcome to the session.", "start": 0.0, "duration": 3.0},
    {"text": "Today we cover chunking.", "start": 3.0, "duration": 4.0},
    {"text": "Recursive splitting is flexible.", "start": 65.0, "duration": 5.0},
    {"text": "Fixed length is predictable.", "start": 70.0, "duration": 4.0},
    {"text": "Semantic splitting groups ideas.", "start": 130.0, "duration": 5.0},
    {"text": "Choose based on your use case.", "start": 135.0, "duration": 3.0},
    {"text": "Thanks for joining us today.", "start": 185.0, "duration": 3.0},
]

chapter_intervals = [
    (0, 60, "Introduction"),
    (60, 120, "Text Splitting Strategies"),
    (120, 180, "Semantic Splitting"),
    (180, 240, "Closing"),
]

record_block = SayouBlock(
    type="record",
    content=transcript_cues,
    metadata={
        "id": "lecture_001",
        "config": {"chapter_intervals": chapter_intervals},
    },
)

chapter_chunks = pipeline.run(record_block, strategy="structure")

print("\n=== Record Mode: Chapter Intervals ===")
for chunk in chapter_chunks:
    m = chunk.metadata
    print(
        f"  [{m.get('chapter_title'):30s}]  "
        f"cues={m.get('record_count')}  "
        f"start={m.get('sayou:startTime')}s  "
        f"end={m.get('sayou:endTime')}s"
    )
    print(f"    {chunk.content[:80]!r}")

Record Mode: Time Window¶

When chapter boundaries are unknown, use window_size (seconds) to group cues into fixed-duration chunks. Consecutive cues are accumulated until the window duration is reached, then a new chunk begins.

Python

window_block = SayouBlock(
    type="record",
    content=transcript_cues,
    metadata={
        "id": "lecture_002",
        "config": {"window_size": 60, "window_key": "start"},
    },
)

window_chunks = pipeline.run(window_block, strategy="structure")

print("\n=== Record Mode: Time Window (60s) ===")
for chunk in window_chunks:
    m = chunk.metadata
    print(
        f"  cues={m.get('record_count')}  "
        f"start={m.get('sayou:startTime')}s  "
        f"duration={m.get('sayou:duration', 0):.1f}s  "
        f"text={chunk.content[:60]!r}"
    )

Save Results¶

Serialise text-mode and record-mode chunks to JSON.

Python

output = {
    "text_mode": [c.model_dump() for c in text_chunks],
    "record_chapters": [c.model_dump() for c in chapter_chunks],
    "record_windows": [c.model_dump() for c in window_chunks],
}
with open("structure_chunks.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

print(f"\nSaved to structure_chunks.json")