HTML Splitter - Upsonic AI

Overview

HTML splitter parses HTML DOM to intelligently group content into semantic blocks. Follows a multi-stage pipeline: parse & sanitize, segment by tags, chunk text within blocks, and merge small chunks. Preserves document structure and extracts rich metadata. Splitter Class: HTMLChunker Config Class: HTMLChunkingConfig

Dependencies

uv pip install beautifulsoup4 lxml

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.html import HTMLLoader
from upsonic.loaders.config import HTMLLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.html_chunker import HTMLChunker, HTMLChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = HTMLChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    split_on_tags=["h1", "h2", "h3", "p"],
    preserve_whole_tags=["table", "pre"]
)
splitter = HTMLChunker(splitter_config)

# Setup KnowledgeBase
loader = HTMLLoader(HTMLLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="html_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["https://example.com/article"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("anthropic/claude-sonnet-4-5")
task = Task("Extract main content", context=[kb])
result = agent.do(task)
print(result)

Parameters

Parameter	Type	Description	Default	Source
`chunk_size`	`int`	Target size of each chunk	1024	Base
`chunk_overlap`	`int`	Overlapping units between chunks	200	Base
`min_chunk_size`	`int \| None`	Minimum size for a chunk	None	Base
`length_function`	`Callable[[str], int]`	Function to measure text length	`len`	Base
`strip_whitespace`	`bool`	Strip leading/trailing whitespace	False	Base
`split_on_tags`	`list[str]`	HTML tags that signify boundaries	`["h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "table"]`	Specific
`tags_to_ignore`	`list[str]`	Tags to remove before processing	`["script", "style", "nav", "footer", "aside", "header", "form", "head", "meta", "link"]`	Specific
`tags_to_extract`	`list[str] \| None`	Allowlist of tags to process	None	Specific
`preserve_whole_tags`	`list[str]`	Indivisible tag types	`["table", "pre", "code", "ul", "ol"]`	Specific
`extract_link_info`	`bool`	Transform links to Markdown format	True	Specific
`preserve_html_content`	`bool`	Preserve original HTML content	False	Specific
`text_chunker_to_use`	`BaseChunker`	Chunker for oversized blocks	RecursiveChunker	Specific
`merge_small_chunks`	`bool`	Merge small chunks with adjacent	True	Specific
`min_chunk_size_ratio`	`float`	Minimum ratio for merging (0.0-1.0)	0.3	Specific

Documentation Index

​Overview

​Dependencies

​Examples

​Parameters

Overview

Dependencies

Examples

Parameters