Documentation Index
Fetch the complete documentation index at: https://docs.upsonic.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
HTML splitter parses HTML DOM to intelligently group content into semantic blocks. Follows a multi-stage pipeline: parse & sanitize, segment by tags, chunk text within blocks, and merge small chunks. Preserves document structure and extracts rich metadata. Splitter Class:HTMLChunker
Config Class: HTMLChunkingConfig
Dependencies
Examples
Parameters
| Parameter | Type | Description | Default | Source |
|---|---|---|---|---|
chunk_size | int | Target size of each chunk | 1024 | Base |
chunk_overlap | int | Overlapping units between chunks | 200 | Base |
min_chunk_size | int | None | Minimum size for a chunk | None | Base |
length_function | Callable[[str], int] | Function to measure text length | len | Base |
strip_whitespace | bool | Strip leading/trailing whitespace | False | Base |
split_on_tags | list[str] | HTML tags that signify boundaries | ["h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "table"] | Specific |
tags_to_ignore | list[str] | Tags to remove before processing | ["script", "style", "nav", "footer", "aside", "header", "form", "head", "meta", "link"] | Specific |
tags_to_extract | list[str] | None | Allowlist of tags to process | None | Specific |
preserve_whole_tags | list[str] | Indivisible tag types | ["table", "pre", "code", "ul", "ol"] | Specific |
extract_link_info | bool | Transform links to Markdown format | True | Specific |
preserve_html_content | bool | Preserve original HTML content | False | Specific |
text_chunker_to_use | BaseChunker | Chunker for oversized blocks | RecursiveChunker | Specific |
merge_small_chunks | bool | Merge small chunks with adjacent | True | Specific |
min_chunk_size_ratio | float | Minimum ratio for merging (0.0-1.0) | 0.3 | Specific |

