Markdown

Source

Synced from packages/sayou-chunking/examples/quick_start_markdown.py.

Setup¶

Split Markdown documents using MarkdownSplitter.

MarkdownSplitter understands the structure of Markdown. It first splits on headers (#, ##, ###, …), creating one chunk per heading, and then recursively splits the body text beneath each header if it exceeds chunk_size.

Every chunk is enriched with semantic metadata so downstream retrieval systems can reconstruct document hierarchy, navigate to the correct section, or filter by heading level.

Protected patterns prevent code fences (```), tables, and base64 images from being torn apart mid-block.

Python

import json

from sayou.chunking.pipeline import ChunkingPipeline
from sayou.chunking.plugins.markdown_splitter import MarkdownSplitter

pipeline = ChunkingPipeline(extra_splitters=[MarkdownSplitter])
print("Pipeline initialized.")

MARKDOWN = """\
# Sayou Fabric Overview

Sayou Fabric is a collection of data-processing libraries for the LLM era.

## Architecture

The system is composed of eight libraries coordinated by Brain.

### Connector

Connector collects raw data from external sources: files, databases,
APIs, and SaaS platforms.

### Chunking

Chunking breaks documents into retrieval-ready pieces.
It supports recursive, semantic, code-aware, and structure-aware strategies.

## Getting Started

Install the core package and at least one library:

```bash
pip install sayou-core sayou-chunking

Then initialise a pipeline and call run().

Data Flow¶

Stage	Library	Output
Collect	Connector	SayouPacket
Parse	Document	SayouBlock
Refine	Refinery	SayouBlock
Chunk	Chunking	SayouChunk
"""
Text Only ## Header Chunking Each Markdown heading becomes one `SayouChunk` with: \| metadata key \| value \| \|-------------------\|---------------------------------\| \| `is_header` \| `True` \| \| `semantic_type` \| `"h1"` / `"h2"` / `"h3"` / … \| \| `level` \| heading depth as integer \| \| `chunk_id` \| unique string id \| Body text beneath a heading becomes a sibling chunk with: \| metadata key \| value \| \|-------------------\|-----------------------------------------\| \| `parent_id` \| `chunk_id` of the preceding header \| \| `section_title` \| plain text of the preceding heading \| \| `semantic_type` \| `"text"`, `"table"`, `"code_block"`, … \| ```python chunks = pipeline.run( {"content": MARKDOWN, "config": {"chunk_size": 500}}, strategy="markdown", ) print("=== Header Chunking ===") for chunk in chunks: is_hdr = chunk.metadata.get("is_header", False) s_type = chunk.metadata.get("semantic_type", "") level = chunk.metadata.get("level", "") parent = chunk.metadata.get("section_title", "") tag = f"H{level}" if is_hdr else f"body [{s_type}]" print(f" {tag:20s} \| parent={parent!r:30s} \| {chunk.content[:50]!r}")

Semantic Type Classification¶

Body chunks are classified into semantic types:

"text" — plain prose
"code_block" — content starting with ```
"table" — content starting with |
"list_item" — content starting with - or *

Use semantic_type to apply different embedding strategies per content type — code blocks and tables often benefit from specialised encoders.

Python

code_chunks = [c for c in chunks if c.metadata.get("semantic_type") == "code_block"]
table_chunks = [c for c in chunks if c.metadata.get("semantic_type") == "table"]
text_chunks = [c for c in chunks if c.metadata.get("semantic_type") == "text"]

print(f"\n=== Semantic Type Distribution ===")
print(f"  text blocks  : {len(text_chunks)}")
print(f"  code blocks  : {len(code_chunks)}")
print(f"  table blocks : {len(table_chunks)}")

Header Hierarchy¶

Collect all header chunks and print the document outline. The heading level is stored as an integer in metadata["level"].

Python

headers = [c for c in chunks if c.metadata.get("is_header")]
print("\n=== Document Outline ===")
for h in headers:
    indent = "  " * (h.metadata["level"] - 1)
    print(f"  {indent}{'#' * h.metadata['level']} {h.content}")

Small chunk_size Triggers Sub-splitting¶

When body content exceeds chunk_size, MarkdownSplitter recursively splits it using the standard separator list. The resulting sub-chunks all share the same parent_id and section_title.

Python

fine_chunks = pipeline.run(
    {"content": MARKDOWN, "config": {"chunk_size": 100}},
    strategy="markdown",
)

body_chunks = [c for c in fine_chunks if not c.metadata.get("is_header")]
print(f"\n=== Sub-splitting at chunk_size=100 ===")
print(f"  Total chunks : {len(fine_chunks)}")
print(f"  Header chunks: {len(fine_chunks) - len(body_chunks)}")
print(f"  Body chunks  : {len(body_chunks)}")

Save Results¶

Serialise all chunks to JSON. Each entry includes content, metadata, and the full model_dump() structure.

Python

with open("markdown_chunks.json", "w", encoding="utf-8") as f:
    json.dump([c.model_dump() for c in chunks], f, indent=2, ensure_ascii=False)

print(f"\nSaved {len(chunks)} chunks to markdown_chunks.json")