Documentation
¶
Overview ¶
Package htmlutils provides utilities for extracting readable text content from raw HTML. It uses golang.org/x/net/html for robust parsing that gracefully handles malformed, truncated, and noisy HTML from the wild web.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CollapseWhitespace ¶
CollapseWhitespace normalises whitespace: removes blank lines, collapses runs of spaces, and limits consecutive newlines to at most two.
func ExtractText ¶
ExtractText parses raw HTML and returns only the readable text content.
The parser (golang.org/x/net/html) naturally handles all edge cases:
- Truncated/unclosed tags (e.g., a 200KB __NEXT_DATA__ <script> cut off at the 512KB response cap — the parser treats it as a single element)
- Embedded JSON blobs (safely contained inside <script> elements)
- Malformed/nested HTML
This typically reduces 500KB of raw HTML to ~5–15KB of clean text.
Types ¶
This section is empty.
Click to show internal directories.
Click to hide internal directories.