Trafilatura

Source

Synced from packages/sayou-connector/examples/quick_start_trafilatura.py.

Setup¶

Extract and archive the main article content from any web page using TrafilaturaFetcher.

trafilatura downloads the raw HTML and strips ads, navigation bars, and boilerplate, returning the article body as clean Markdown.

Install the dependency before running with a real URL:

Bash

pip install trafilatura
python quick_start_trafilatura.py

The example below uses a mock so it runs without an internet connection. Remove setup_mock() and substitute a real URL to collect live content.

Python

import json
import sys
from unittest.mock import MagicMock

from sayou.brain.pipelines.transfer import TransferPipeline

OUTPUT_DIR = "./sayou_archive/trafilatura"

Mock Setup¶

TrafilaturaFetcher calls trafilatura.fetch_url() then trafilatura.extract(). The mock below returns a fixed Markdown string so the full pipeline path is exercised without a network connection.

To switch to live mode: delete this function and its call below.

Python

def setup_mock():
    mock = MagicMock()
    mock.fetch_url.return_value = "<html><body><p>Article body.</p></body></html>"
    mock.extract.return_value = (
        "# How Trafilatura Works\n\n"
        "Trafilatura downloads the raw HTML of a page and removes boilerplate "
        "content such as navigation menus, advertisements, and footers.\n\n"
        "The extracted Markdown is ready for downstream chunking or indexing."
    )
    sys.modules["trafilatura"] = mock

Transfer a Single URL¶

Prefix the target URL with trafilatura:// to route through TrafilaturaGenerator and TrafilaturaFetcher.

packet.data is the extracted Markdown string. TransferPipeline writes it to a file under destination.

Python

setup_mock()

stats = TransferPipeline.process(
    source="trafilatura://https://example.com/article",
    destination=OUTPUT_DIR,
    strategies={"connector": "trafilatura"},
)

print("=== Transfer a Single URL ===")
print(json.dumps(stats, indent=2))

Transfer Multiple URLs¶

Call TransferPipeline.process() once per URL. All strategies are stateless — multiple calls are safe and independent.

Python

urls = [
    "trafilatura://https://example.com/page-1",
    "trafilatura://https://example.com/page-2",
    "trafilatura://https://example.com/page-3",
]

total = {"read": 0, "written": 0, "failed": 0}
for url in urls:
    s = TransferPipeline.process(
        source=url,
        destination=f"{OUTPUT_DIR}/batch",
        strategies={"connector": "trafilatura"},
    )
    for k in total:
        total[k] += s.get(k, 0)

print("=== Transfer Multiple URLs ===")
print(json.dumps(total, indent=2))

Validate Output¶

Each URL produces one file in destination. Inspect the first file to confirm that extraction produced readable content.

Python

import os

if os.path.isdir(OUTPUT_DIR):
    files = [
        n for n in os.listdir(OUTPUT_DIR) if os.path.isfile(os.path.join(OUTPUT_DIR, n))
    ]
    print(f"\nArchived {len(files)} file(s) in '{OUTPUT_DIR}'.")
    if files:
        sample_path = os.path.join(OUTPUT_DIR, files[0])
        with open(sample_path, encoding="utf-8") as f:
            preview = f.read(200)
        print(f"Preview of '{files[0]}':\n{preview}")