htmlutils

package
v0.1.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2026 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package htmlutils provides utilities for extracting readable text content from raw HTML. It uses golang.org/x/net/html for robust parsing that gracefully handles malformed, truncated, and noisy HTML from the wild web.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CollapseWhitespace

func CollapseWhitespace(s string) string

CollapseWhitespace normalises whitespace: removes blank lines, collapses runs of spaces, and limits consecutive newlines to at most two.

func ExtractText

func ExtractText(rawHTML string) string

ExtractText parses raw HTML and returns only the readable text content.

The parser (golang.org/x/net/html) naturally handles all edge cases:

  • Truncated/unclosed tags (e.g., a 200KB __NEXT_DATA__ <script> cut off at the 512KB response cap — the parser treats it as a single element)
  • Embedded JSON blobs (safely contained inside <script> elements)
  • Malformed/nested HTML

This typically reduces 500KB of raw HTML to ~5–15KB of clean text.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL