Refinery¶

The Universal Data Cleaning & Normalization Engine for Sayou Fabric.

sayou-refinery acts as the "Cleaning Plant" in your data pipeline. It transforms heterogeneous raw data (JSON Documents, HTML, DB Records) into a standardized stream of SayouBlocks.

It ensures that downstream components (like Chunkers or LLMs) receive clean, uniform data regardless of whether the source was a messy web scrape or a structured database row.

1. Architecture & Role¶

Refinery operates in two distinct stages to guarantee data quality: Normalization (Shape Shifting) and Processing (Hygiene).

graph LR
    Raw[Raw Input] --> Pipeline[Refinery Pipeline]

    subgraph Stage1 [Normalization]
        Doc[Doc Normalizer]
        Html[Html Normalizer]
        Json[Json Normalizer]
    end

    subgraph Stage2 [Processing Chain]
        Space[Whitespace]
        PII[PII Masker]
        Link[Link Extractor]
    end

    Pipeline --> Stage1
    Stage1 --> Stage2
    Stage2 --> Blocks[Clean SayouBlocks]

1.1. Core Features¶

Normalization: Flattens complex structures (Nested JSON, HTML Trees) into a linear list of blocks.
Hygiene: Removes invisible characters, normalizes Unicode, and fixes broken encoding.
Safety: Automatically masks sensitive information (PII) like emails or phone numbers before they reach the LLM.

2. Available Strategies¶

sayou-refinery provides strategies tailored to specific input formats.

Strategy Key	Target Format	Description
`standard_doc`	Sayou Document	[Default] Converts parsed document dictionaries into Markdown blocks. Applies standard text cleaning.
`html`	Web Pages	Strips HTML tags, extracts links, and converts the DOM tree into readable text blocks.
`json`	API/DB Records	Flattens JSON objects into key-value pairs or text representations.

3. Installation¶

Bash

pip install sayou-refinery

4. Usage¶

The RefineryPipeline orchestrates the normalization and processing chain.

Case A: Document Cleaning (Standard)¶

Cleans messy OCR output or parsed document text.

Python

from sayou.refinery import RefineryPipeline

raw_doc = {
    "metadata": {"title": "Test Doc"},
    "pages": [{
        "elements": [
            {"type": "text", "text": "Contact:   admin@sayou.ai  "},
            {"type": "text", "text": "Generic    Whitespace   Error"}
        ]
    }]
}

blocks = RefineryPipeline.process(
    data=raw_doc,
    strategy="standard_doc"
)

for block in blocks:
    print(f"[{block.type}] {block.content}")
    # Output: [text] Contact: [EMAIL]
    # Output: [text] Generic Whitespace Error

Case B: HTML Processing¶

Converts web content into clean text while preserving hyperlinks.