Skip to content

JSON

Setup

Split JSON data using JsonSplitter.

JsonSplitter operates on structured data rather than plain text. It accepts a JSON string, a Python list, or a Python dict as content and groups items into chunks whose serialised size stays under chunk_size bytes.

Two routing modes:

Input type Strategy chunk_type metadata
list Batch consecutive items "json_list"
dict Batch key-value pairs "json_object"

Oversized items within a list are split further using the dict strategy. Oversized values within a dict are split recursively by key.

Use JsonSplitter for structured records from databases, API responses, transcript cue lists, or any other tabular data.

Python
import json

from sayou.chunking.pipeline import ChunkingPipeline
from sayou.chunking.plugins.json_splitter import JsonSplitter

pipeline = ChunkingPipeline(extra_splitters=[JsonSplitter])
print("Pipeline initialized.")

Split a JSON Array

The most common use case: a list of records fetched from a database or API.

Each chunk contains as many consecutive items as fit within chunk_size characters (after JSON serialisation). Metadata records:

  • chunk_type"json_list"
  • item_count — number of items in this chunk
  • index_start — first item's original index
  • index_end — last item's original index
Python
records = [
    {"id": i, "name": f"user_{i:03d}", "score": round(i * 1.7, 2)} for i in range(30)
]

chunks = pipeline.run(
    {
        "content": json.dumps(records),
        "config": {"chunk_size": 300, "chunk_overlap": 0},
    },
    strategy="json",
)

print("=== Split a JSON Array ===")
for chunk in chunks:
    m = chunk.metadata
    print(
        f"  items {m['index_start']:2d}{m['index_end']:2d}  "
        f"count={m['item_count']}  "
        f"bytes={len(chunk.content):4d}"
    )

total = sum(c.metadata["item_count"] for c in chunks)
print(f"  Total items recovered: {total} / {len(records)}")

Split a JSON Object

A large dict is split by key — key-value pairs are batched until the serialised size reaches chunk_size. Each chunk carries:

  • chunk_type"json_object"
  • chunk_id_suffix"part_0", "part_1", …

Useful for wide API responses, configuration objects, or document metadata dicts that are too large to embed in a single call.

Python
large_dict = {f"field_{i:03d}": "x" * 40 for i in range(20)}

dict_chunks = pipeline.run(
    {
        "content": json.dumps(large_dict),
        "config": {"chunk_size": 200},
    },
    strategy="json",
)

print("\n=== Split a JSON Object ===")
for chunk in dict_chunks:
    keys = list(json.loads(chunk.content).keys())
    print(f"  suffix={chunk.metadata['chunk_id_suffix']:8s}  keys={keys}")

Transcript Cue Lists

YouTube and podcast transcripts are delivered as lists of timed cue dicts:

Python
[{"text": "Hello.", "start": 0.0, "duration": 1.5}, ]

JsonSplitter batches consecutive cues until the chunk reaches chunk_size. When a "start" key is present, metadata also records:

  • sayou:startTime — start time of the first cue in the chunk
  • sayou:endTime — end time of the last cue in the chunk
Python
transcript = [
    {"text": f"Sentence {i}.", "start": i * 3.0, "duration": 2.8} for i in range(20)
]

transcript_chunks = pipeline.run(
    {
        "content": json.dumps(transcript),
        "config": {"chunk_size": 400},
    },
    strategy="json",
)

print("\n=== Transcript Cue Lists ===")
for chunk in transcript_chunks:
    m = chunk.metadata
    print(
        f"  cues {m['index_start']:2d}{m['index_end']:2d}  "
        f"start={m.get('sayou:startTime', '?'):5}s  "
        f"end={m.get('sayou:endTime', '?'):5}s"
    )

Block Type Input

Pass type="json" in the block so JsonSplitter.can_handle() returns 1.0 without needing the explicit strategy="json" argument.

Python
from sayou.core.schemas import SayouBlock

block = SayouBlock(
    type="json",
    content=json.dumps(records[:10]),
    metadata={"config": {"chunk_size": 200}, "source": "api_response"},
)

auto_chunks = pipeline.run(block, strategy="auto")
print(f"\n=== Block Type Input (auto) ===")
print(f"  Chunks: {len(auto_chunks)}")
for c in auto_chunks:
    print(f"    item_count={c.metadata['item_count']}  bytes={len(c.content)}")

Save Results

Write all list chunks to JSON for downstream processing.

Python
with open("json_chunks.json", "w", encoding="utf-8") as f:
    json.dump([c.model_dump() for c in chunks], f, indent=2, ensure_ascii=False)

print(f"\nSaved {len(chunks)} chunks to json_chunks.json")