internal

package

v1.3.1 Latest Latest Go to latest Published: Mar 4, 2026 License: MIT Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cybergodev/html

Links

Open Source Insights

Documentation ¶

Overview ¶

Package internal provides caching functionality for content extraction results. It implements a thread-safe LRU cache with TTL support to improve performance for repeated extractions of the same content.

Package internal provides centralized constant definitions for internal use.

Package internal provides character encoding detection and conversion functionality. It supports 15+ encodings including Unicode variants, Western European, and East Asian character sets, with intelligent auto-detection capabilities.

Package internal provides implementation details for the cybergodev/html library. It contains content extraction, table processing, and text manipulation functionality that is not part of the public API.

Package internal provides pooled resources for memory allocation optimization.

Package internal provides implementation details for the cybergodev/html library. This file contains the Scorer interface and default implementation for content scoring.

Package internal provides unsafe utility functions for zero-allocation conversions.

Package internal provides URL parsing and resolution utilities.

Index ¶

Constants
Variables
func BytesToString(b []byte) string
func CalculateContentDensity(n *html.Node) float64
func CleanContentNode(node *html.Node) *html.Node
func CleanText(text string, whitespaceRegex *regexp.Regexp) string
func ConvertToUTF8(data []byte, charset string) ([]byte, error)
func CountChildElements(n *html.Node, tag string) int
func CountTags(n *html.Node) int
func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)
func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)
func DetectAudioType(url string) string
func DetectCharsetFromBytes(data []byte) string
func DetectVideoType(url string) string
func ExtractBaseFromURL(url string) string
func ExtractDomain(url string) string
func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, imageCounter *int, linkCounter *int, ...)
func FindElementByTag(doc *html.Node, tagName string) *html.Node
func GetBuffer() *bytes.Buffer
func GetBuilder() *strings.Builder
func GetHash128() hash.Hash
func GetLinkDensity(node *html.Node) float64
func GetNamespacePrefix(tag string) string
func GetTextContent(node *html.Node) string
func GetTextLength(node *html.Node) int
func GetTransformBuffer() *[]byte
func IsBlockElement(tag string) bool
func IsDifferentDomain(baseURL, targetURL string) bool
func IsExternalURL(url string) bool
func IsInlineElement(tag string) bool
func IsKnownInlineNamespacePrefix(prefix string) bool
func IsNamespaceTag(tag string) bool
func IsNonContentElement(tag string) bool
func IsParagraphLevelBlockElement(tag string) bool
func IsValidURL(url string) bool
func IsVideoURL(url string) bool
func MatchesPattern(value string, patterns map[string]bool) bool
func NormalizeBaseURL(baseURL string) string
func PutBuffer(buf *bytes.Buffer)
func PutBuilder(sb *strings.Builder)
func PutHash128(h hash.Hash)
func PutTransformBuffer(buf *[]byte)
func RemoveTagContent(content, tag string) string
func ReplaceHTMLEntities(text string) string
func ResolveURL(baseURL, relativeURL string) string
func SanitizeHTML(htmlContent string) string
func SanitizeHTMLWithAudit(htmlContent string, audit AuditRecorder) string
func ScoreAttributes(n *html.Node) int
func ScoreContentNode(node *html.Node) int
func SelectBestCandidate(candidates map[*html.Node]int) *html.Node
func SetPoolLogger(logger func(format string, args ...any))
func ShouldRemoveElement(n *html.Node) bool
func ShouldTreatAsBlockElement(node *html.Node) bool
func ShouldTreatNamespaceTagAsInline(node *html.Node) bool
func StringToBytes(s string) []byte
func TableProcessor() *table.Processor
func WalkNodes(node *html.Node, fn func(*html.Node) bool)
type AuditRecorder
type Cache
- func NewCache(maxEntries int, ttl time.Duration) *Cache
- func (c *Cache) Clear()
- func (c *Cache) Get(key string) any
- func (c *Cache) Len() int
- func (c *Cache) Set(key string, value any)
- func (c *Cache) StartCleanup(interval time.Duration) context.CancelFunc
- func (c *Cache) StopCleanup()
type DefaultScorer
- func NewDefaultScorer() *DefaultScorer
- func NewDefaultScorerWithConfig(config *ScoringConfig) *DefaultScorer
- func (s *DefaultScorer) Score(node *html.Node) int
- func (s *DefaultScorer) ScoreAttributes(n *html.Node) int
- func (s *DefaultScorer) ShouldRemove(node *html.Node) bool
type EncodingDetector
- func NewEncodingDetector() *EncodingDetector
- func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)
- func (ed *EncodingDetector) DetectCharset(data []byte) string
- func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string
- func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch
- func (ed *EncodingDetector) SetMaxSampleSize(size int) *EncodingDetector
- func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)
type EncodingMatch
type NoOpAuditRecorder
- func (NoOpAuditRecorder) RecordBlockedAttr(attr, value string)
- func (NoOpAuditRecorder) RecordBlockedTag(tag string)
- func (NoOpAuditRecorder) RecordBlockedURL(url, reason string)
type Scorer
type ScoringConfig
- func DefaultScoringConfig() *ScoringConfig

Constants ¶

View Source

const (

	// URL validation limits
	MaxURLLength     = 2000   // Maximum URL length
	MaxDataURILength = 100000 // Maximum data URL length (100KB)

)

View Source

const (
	DefaultCacheCleanupInterval = 5 * time.Minute
)

Default cache cleanup configuration

Variables ¶

View Source

var BufferPool = sync.Pool{
	New: func() any {
		return bytes.NewBuffer(make([]byte, 0, bufferPoolInitialCapacity))
	},
}

BufferPool is a sync.Pool for bytes.Buffer instances. Use this for functions that work with byte slices to reduce allocations.

For most use cases, prefer the helper functions GetBuffer() and PutBuffer():

buf := internal.GetBuffer()
defer internal.PutBuffer(buf)
buf.Grow(estimatedSize)
// ... use buf ...
return buf.Bytes()

Direct pool access is also available for advanced use cases:

bufPtr := internal.BufferPool.Get().(*bytes.Buffer)
buf := *bufPtr
defer func() {
    buf.Reset()
    internal.BufferPool.Put(bufPtr)
}()

View Source

var BuilderPool = sync.Pool{
	New: func() any {
		sb := &strings.Builder{}
		sb.Grow(builderPoolInitialCapacity)
		return sb
	},
}

BuilderPool is a sync.Pool for strings.Builder instances. Use this for functions that build strings incrementally to reduce allocations.

For most use cases, prefer the helper functions GetBuilder() and PutBuilder():

sb := internal.GetBuilder()
defer internal.PutBuilder(sb)
sb.Grow(estimatedSize)
// ... use sb ...
return sb.String()

Direct pool access is also available for advanced use cases:

sbPtr := internal.BuilderPool.Get().(*strings.Builder)
sb := *sbPtr
defer func() {
    sb.Reset()
    internal.BuilderPool.Put(sbPtr)
}()

View Source

var Hash128Pool = sync.Pool{
	New: func() any {
		return fnv.New128a()
	},
}

Hash128Pool is a sync.Pool for FNV-128a hash instances. Use this for cache key generation to avoid repeated allocations.

Usage pattern:

h := internal.GetHash128()
defer internal.PutHash128(h)
h.Write(data)
var buf [16]byte
sum := h.Sum(buf[:0])

View Source

var TransformBufferPool = sync.Pool{
	New: func() any {
		buf := make([]byte, 0, 8192)
		return &buf
	},
}

TransformBufferPool is a sync.Pool for byte slices used in encoding transformation. These buffers are used for charset conversion operations.

Functions ¶

func BytesToString ¶ added in v1.3.0

func BytesToString(b []byte) string

BytesToString converts a byte slice to string without memory allocation. The returned string shares memory with the input slice.

WARNING: The caller must ensure the byte slice is not modified after this call. Modifying the slice will cause undefined behavior in the returned string.

Use this only when the byte slice is guaranteed to remain unchanged, such as when converting read-only data or when the result has a short lifetime.

func CalculateContentDensity ¶

func CalculateContentDensity(n *html.Node) float64

CalculateContentDensity calculates text-to-tag ratio. This is the exported version that uses the internal calculateDensityFromMetrics.

func CleanContentNode ¶

func CleanContentNode(node *html.Node) *html.Node

CleanContentNode removes non-content elements from the node tree.

func CleanText ¶

func CleanText(text string, whitespaceRegex *regexp.Regexp) string

func ConvertToUTF8 ¶ added in v1.2.0

func ConvertToUTF8(data []byte, charset string) ([]byte, error)

ConvertToUTF8 is a convenience function that converts data to UTF-8

func CountChildElements ¶

func CountChildElements(n *html.Node, tag string) int

CountChildElements counts child elements of specific tag type.

func CountTags ¶

func CountTags(n *html.Node) int

CountTags counts all element nodes in the subtree.

func DetectAndConvertToUTF8 ¶ added in v1.2.0

func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)

DetectAndConvertToUTF8 is a convenience function that detects charset and converts to UTF-8

func DetectAndConvertToUTF8String ¶ added in v1.2.0

func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)

DetectAndConvertToUTF8String detects encoding and converts to UTF-8 string. If forcedEncoding is not empty, it will use that encoding instead of auto-detection. Returns a UTF-8 string and the detected/used encoding. Uses safe string conversion to ensure memory safety.

func DetectAudioType ¶

func DetectAudioType(url string) string

DetectAudioType detects the audio MIME type from a URL

func DetectCharsetFromBytes ¶ added in v1.2.0

func DetectCharsetFromBytes(data []byte) string

DetectCharsetFromBytes is a convenience function that detects charset from byte data

func DetectVideoType ¶

func DetectVideoType(url string) string

DetectVideoType detects the video MIME type from a URL

func ExtractBaseFromURL ¶ added in v1.2.0

func ExtractBaseFromURL(url string) string

ExtractBaseFromURL extracts the base URL (scheme://domain/) from a URL. Returns the base URL including trailing slash, or empty string for invalid URLs.

func ExtractDomain ¶ added in v1.2.0

func ExtractDomain(url string) string

ExtractDomain extracts the domain from a URL. Returns the domain portion (scheme://domain) or empty string for invalid URLs.

func ExtractTextWithStructureAndImages ¶

func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, imageCounter *int, linkCounter *int, tableFormat string)

ExtractTextWithStructureAndImages extracts text content from an HTML node tree while preserving document structure (headings, paragraphs, lists, tables).

func FindElementByTag ¶

func FindElementByTag(doc *html.Node, tagName string) *html.Node

func GetBuffer ¶ added in v1.3.0

func GetBuffer() *bytes.Buffer

GetBuffer gets a bytes.Buffer from the pool. The returned buffer has been reset and is ready for use. Call PutBuffer when done to return it to the pool.

IMPORTANT: Callers MUST ensure PutBuffer is called even on error paths. Use defer immediately after GetBuffer to guarantee cleanup:

buf := internal.GetBuffer()
defer internal.PutBuffer(buf)
// ... use buf ...

Failure to return the buffer to the pool will not cause memory leaks (the GC will collect it), but will reduce the effectiveness of the pool.

func GetBuilder ¶ added in v1.3.0

func GetBuilder() *strings.Builder

GetBuilder gets a strings.Builder from the pool. The returned builder has been reset and is ready for use. Call PutBuilder when done to return it to the pool.

IMPORTANT: Callers MUST ensure PutBuilder is called even on error paths. Use defer immediately after GetBuilder to guarantee cleanup:

sb := internal.GetBuilder()
defer internal.PutBuilder(sb)
// ... use sb ...

Failure to return the builder to the pool will not cause memory leaks (the GC will collect it), but will reduce the effectiveness of the pool.

func GetHash128 ¶ added in v1.3.0

func GetHash128() hash.Hash

GetHash128 gets an FNV-128a hasher from the pool. The returned hasher has been reset and is ready for use. Call PutHash128 when done to return it to the pool.

func GetLinkDensity ¶

func GetLinkDensity(node *html.Node) float64

func GetNamespacePrefix ¶ added in v1.3.0

func GetNamespacePrefix(tag string) string

GetNamespacePrefix extracts the namespace prefix from a namespaced tag. For "ix:nonnumeric", it returns "ix".

func GetTextContent ¶

func GetTextContent(node *html.Node) string

Example ¶

ExampleGetTextContent demonstrates the GetTextContent function with HTML entities.

html := `<p>&nbsp;&copy; 2025 &mdash; All rights reserved&nbsp;</p>`
doc, _ := stdxhtml.Parse(strings.NewReader(html))
result := GetTextContent(doc)
fmt.Println(result)

Output:

© 2025 — All rights reserved

func GetTextLength ¶

func GetTextLength(node *html.Node) int

func GetTransformBuffer ¶ added in v1.3.0

func GetTransformBuffer() *[]byte

GetTransformBuffer gets a byte slice from the transform buffer pool. The returned slice has zero length but retained capacity.

func IsBlockElement ¶

func IsBlockElement(tag string) bool

IsBlockElement returns true if the tag is a known block-level element.

func IsDifferentDomain ¶ added in v1.2.0

func IsDifferentDomain(baseURL, targetURL string) bool

IsDifferentDomain checks if two URLs have different domains. Returns false if either URL is not external.

func IsExternalURL ¶

func IsExternalURL(url string) bool

IsExternalURL checks if a URL is an external HTTP(S) URL or protocol-relative URL.

func IsInlineElement ¶ added in v1.2.0

func IsInlineElement(tag string) bool

IsInlineElement returns true if the tag is a known inline element. Inline elements should not add newlines or paragraph spacing.

func IsKnownInlineNamespacePrefix ¶ added in v1.3.0

func IsKnownInlineNamespacePrefix(prefix string) bool

IsKnownInlineNamespacePrefix checks if the prefix is a known inline namespace prefix.

func IsNamespaceTag ¶ added in v1.3.0

func IsNamespaceTag(tag string) bool

IsNamespaceTag checks if a tag is a namespaced tag (contains ':'). Examples: ix:nonnumeric, xbrl:value, dei:CityAreaCode

func IsNonContentElement ¶

func IsNonContentElement(tag string) bool

IsNonContentElement returns true if the tag is typically not part of main content.

func IsParagraphLevelBlockElement ¶ added in v1.3.0

func IsParagraphLevelBlockElement(tag string) bool

IsParagraphLevelBlockElement returns true if the element is a block element that should be separated by paragraph spacing (double newlines) in the output.

Paragraph-level block elements create visual separation with blank lines in Markdown:

Text containers: p, div, pre, blockquote
Headings: h1-h6
Semantic sections: article, section, main, figure, figcaption, address
Lists: ul, ol, dl
Tables: table
Forms: fieldset
Interactive: details, summary, dialog
Media: canvas

Block elements WITHOUT paragraph spacing (treated as inline blocks):

List items: li, dt, dd
Table structure: thead, tbody, tfoot, tr, td, th
Self-closing: hr
Structural: body, html, head
Semantic (non-content): nav, aside, header, footer, form

func IsValidURL ¶ added in v1.2.0

func IsValidURL(url string) bool

IsValidURL checks if a URL is valid and safe for processing. This is a centralized URL validation function with size limits for security.

func IsVideoURL ¶

func IsVideoURL(url string) bool

IsVideoURL checks if a URL is a video based on extension or embed pattern

func MatchesPattern ¶

func MatchesPattern(value string, patterns map[string]bool) bool

MatchesPattern checks if value contains any pattern from the map with word boundaries. This is exported for testing purposes.

func NormalizeBaseURL ¶ added in v1.2.0

func NormalizeBaseURL(baseURL string) string

NormalizeBaseURL ensures a base URL ends with a slash. Returns empty string for non-HTTP URLs (javascript:, data:, mailto:, etc.).

func PutBuffer ¶ added in v1.3.0

func PutBuffer(buf *bytes.Buffer)

PutBuffer returns a bytes.Buffer to the pool. The buffer is reset before being returned to the pool. It is safe to call PutBuffer with a nil pointer (no-op).

func PutBuilder ¶ added in v1.3.0

func PutBuilder(sb *strings.Builder)

PutBuilder returns a strings.Builder to the pool. The builder is reset before being returned to the pool. It is safe to call PutBuilder with a nil pointer (no-op).

func PutHash128 ¶ added in v1.3.0

func PutHash128(h hash.Hash)

PutHash128 returns an FNV-128a hasher to the pool. The hasher is reset before being returned to the pool. It is safe to call PutHash128 with a nil pointer (no-op).

func PutTransformBuffer ¶ added in v1.3.0

func PutTransformBuffer(buf *[]byte)

PutTransformBuffer returns a byte slice to the transform buffer pool. The slice is reset to zero length before being returned. It is safe to call PutTransformBuffer with a nil pointer (no-op).

func RemoveTagContent ¶

func RemoveTagContent(content, tag string) string

RemoveTagContent removes all occurrences of the specified HTML tag and its content. This function uses string-based parsing as the primary method to handle edge cases like unclosed tags, malformed HTML, and to preserve original character case.

func ReplaceHTMLEntities ¶

func ReplaceHTMLEntities(text string) string

ReplaceHTMLEntities replaces HTML entities with their corresponding characters. It handles both named entities (like &,  ) and numeric entities (like A, A). For unknown entities, it falls back to the standard library's html.UnescapeString. Optimized with a fast path for the most common entities.

Example ¶

ExampleReplaceHTMLEntities demonstrates the ReplaceHTMLEntities function.

input := "&nbsp;&copy; 2025 &mdash; Test &euro;100"
result := ReplaceHTMLEntities(input)
fmt.Println(result)

Output:

© 2025 — Test €100

func ResolveURL ¶ added in v1.2.0

func ResolveURL(baseURL, relativeURL string) string

ResolveURL resolves a relative URL against a base URL. Handles absolute URLs, protocol-relative URLs, absolute paths, and relative paths.

func SanitizeHTML ¶

func SanitizeHTML(htmlContent string) string

func SanitizeHTMLWithAudit ¶ added in v1.3.0

func SanitizeHTMLWithAudit(htmlContent string, audit AuditRecorder) string

SanitizeHTMLWithAudit sanitizes HTML content and records security events. The audit recorder receives events for blocked tags, attributes, and URLs.

func ScoreAttributes ¶

func ScoreAttributes(n *html.Node) int

ScoreAttributes calculates a score based on element attributes. This function delegates to the default Scorer implementation.

func ScoreContentNode ¶

func ScoreContentNode(node *html.Node) int

ScoreContentNode calculates a relevance score for content extraction. Higher scores indicate more likely main content. Negative scores suggest non-content elements. This function delegates to the default Scorer implementation.

func SelectBestCandidate ¶

func SelectBestCandidate(candidates map[*html.Node]int) *html.Node

func SetPoolLogger ¶ added in v1.3.0

func SetPoolLogger(logger func(format string, args ...any))

SetPoolLogger sets a logger function for pool corruption warnings. Pass nil to disable logging. This is a no-op if poolDebug is false. The logger function should be thread-safe.

func ShouldRemoveElement ¶

func ShouldRemoveElement(n *html.Node) bool

ShouldRemoveElement determines if a node should be removed from the content tree. This function delegates to the default Scorer implementation.

func ShouldTreatAsBlockElement ¶ added in v1.3.0

func ShouldTreatAsBlockElement(node *html.Node) bool

ShouldTreatAsBlockElement dynamically determines if an unknown/custom tag should be treated as a block-level element based on its structure and content. This enables proper handling of custom tag formats like SEC documents.

func ShouldTreatNamespaceTagAsInline ¶ added in v1.3.0

func ShouldTreatNamespaceTagAsInline(node *html.Node) bool

ShouldTreatNamespaceTagAsInline determines if a namespaced tag should be treated as an inline element based on context, content, and namespace.

func StringToBytes ¶ added in v1.3.0

func StringToBytes(s string) []byte

StringToBytes converts a string to a byte slice without memory allocation. The returned slice shares memory with the original string.

WARNING: The returned slice MUST NOT be modified. Go strings are immutable, and modifying the returned slice would violate this immutability, potentially causing undefined behavior in other code holding references to the string.

Use this only for short-lived operations where the string is guaranteed to remain in scope, such as passing strings to functions that accept []byte.

func TableProcessor ¶ added in v1.3.0

func TableProcessor() *table.Processor

TableProcessor returns the table processor with default accessor and walker.

func WalkNodes ¶

func WalkNodes(node *html.Node, fn func(*html.Node) bool)

WalkNodes traverses the HTML node tree iteratively using an explicit stack to avoid potential stack overflow on deeply nested documents. The fn callback is called for each node. If fn returns false, traversal stops for that branch (node's children are not visited).

Types ¶

type AuditRecorder ¶ added in v1.3.0

type AuditRecorder interface {
	// RecordBlockedTag records when a dangerous tag is removed.
	RecordBlockedTag(tag string)
	// RecordBlockedAttr records when a dangerous attribute is removed.
	RecordBlockedAttr(attr, value string)
	// RecordBlockedURL records when a dangerous URL is blocked.
	RecordBlockedURL(url, reason string)
}

AuditRecorder defines the interface for recording security audit events. This interface is used internally to decouple the sanitization code from the main audit implementation.

type Cache ¶

type Cache struct {
	// contains filtered or unexported fields
}

Cache is a thread-safe LRU cache with optional TTL support. It uses a doubly-linked list for LRU ordering with sentinel nodes to simplify edge case handling.

TTL Behavior:

ttl > 0: Entries expire after the specified duration
ttl = 0: Entries never expire based on time (only LRU eviction)
ttl < 0: Treated as 0 (no time-based expiration)

Thread Safety: All public methods are safe for concurrent use. Get() uses a write lock to prevent TOCTOU race conditions.

func NewCache ¶

func NewCache(maxEntries int, ttl time.Duration) *Cache

NewCache creates a new LRU cache with the specified maximum entries and TTL. If maxEntries is 0 or negative, the cache is disabled (Set becomes a no-op). If ttl is 0 or negative, entries never expire based on time.

func (*Cache) Clear ¶

func (c *Cache) Clear()

func (*Cache) Get ¶

func (c *Cache) Get(key string) any

func (*Cache) Len ¶ added in v1.3.0

func (c *Cache) Len() int

Len returns the current number of entries in the cache. This is useful for monitoring and debugging.

func (*Cache) Set ¶

func (c *Cache) Set(key string, value any)

func (*Cache) StartCleanup ¶ added in v1.3.0

func (c *Cache) StartCleanup(interval time.Duration) context.CancelFunc

StartCleanup starts a background goroutine that periodically cleans up expired entries. This is useful when TTL is enabled and the cache receives many one-time accesses, as expired entries would otherwise only be cleaned when accessed or during eviction.

The cleanup goroutine runs at the specified interval until StopCleanup is called or the cache is garbage collected. If interval is 0, DefaultCacheCleanupInterval is used.

This method is idempotent - calling it multiple times has no additional effect.

IMPORTANT: While runtime.SetFinalizer ensures cleanup when the Cache is garbage collected, it is still recommended to call StopCleanup() explicitly for deterministic resource release, especially in long-running applications.

Usage:

cache := NewCache(1000, time.Hour)
cache.StartCleanup(5 * time.Minute)
defer cache.StopCleanup()

func (*Cache) StopCleanup ¶ added in v1.3.0

func (c *Cache) StopCleanup()

StopCleanup stops the background cleanup goroutine if it was started. It is safe to call this method multiple times. This method also clears the finalizer to prevent double cleanup.

type DefaultScorer ¶ added in v1.3.0

type DefaultScorer struct {
	// contains filtered or unexported fields
}

DefaultScorer is the default implementation of the Scorer interface.

func NewDefaultScorer ¶ added in v1.3.0

func NewDefaultScorer() *DefaultScorer

NewDefaultScorer creates a new DefaultScorer with the default configuration.

func NewDefaultScorerWithConfig ¶ added in v1.3.0

func NewDefaultScorerWithConfig(config *ScoringConfig) *DefaultScorer

NewDefaultScorerWithConfig creates a new DefaultScorer with custom configuration. If config is nil, the default configuration is used.

func (*DefaultScorer) Score ¶ added in v1.3.0

func (s *DefaultScorer) Score(node *html.Node) int

Score calculates a relevance score for a content node.

func (*DefaultScorer) ScoreAttributes ¶ added in v1.3.0

func (s *DefaultScorer) ScoreAttributes(n *html.Node) int

ScoreAttributes calculates a score based on element attributes. This is the public version for external use.

func (*DefaultScorer) ShouldRemove ¶ added in v1.3.0

func (s *DefaultScorer) ShouldRemove(node *html.Node) bool

ShouldRemove determines if a node should be removed from the content tree.

type EncodingDetector ¶ added in v1.2.0

type EncodingDetector struct {
	// User-specified encoding override (optional)
	ForcedEncoding string

	// Smart detection options
	EnableSmartDetection bool // Enable intelligent encoding detection
	MaxSampleSize        int  // Max bytes to analyze for statistical detection (default: 10KB, max: 1MB)
}

EncodingDetector handles charset detection and conversion.

IMPORTANT: The data slice passed to detection methods must not be modified during the detection process. For concurrent access, pass a copy of the data.

func NewEncodingDetector ¶ added in v1.2.0

func NewEncodingDetector() *EncodingDetector

NewEncodingDetector creates a new encoding detector with smart detection enabled. The default MaxSampleSize is 10KB which is sufficient for most HTML documents.

func (*EncodingDetector) DetectAndConvert ¶ added in v1.2.0

func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)

DetectAndConvert detects charset and converts to UTF-8 in one step

func (*EncodingDetector) DetectCharset ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharset(data []byte) string

DetectCharset attempts to detect the character encoding from HTML content

func (*EncodingDetector) DetectCharsetBasic ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string

DetectCharsetBasic performs basic charset detection (BOM, meta tags, UTF-8 validation) Optimized with fast path for pure ASCII/UTF-8 content to avoid string allocation.

func (*EncodingDetector) DetectCharsetSmart ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch

DetectCharsetSmart performs intelligent charset detection using statistical analysis

func (*EncodingDetector) SetMaxSampleSize ¶ added in v1.3.0

func (ed *EncodingDetector) SetMaxSampleSize(size int) *EncodingDetector

SetMaxSampleSize sets the maximum sample size for statistical detection. Values <= 0 use the default (10KB). Values > 1MB are capped at 1MB to prevent memory exhaustion. This method returns the detector for method chaining.

func (*EncodingDetector) ToUTF8 ¶ added in v1.2.0

func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)

ToUTF8 converts the given data from the detected charset to UTF-8

type EncodingMatch ¶ added in v1.2.0

type EncodingMatch struct {
	Charset    string
	Confidence int  // 0-100
	Score      int  // Detailed score
	Valid      bool // Whether decoding produced valid UTF-8
}

EncodingMatch represents a detected encoding with confidence score

type NoOpAuditRecorder ¶ added in v1.3.0

type NoOpAuditRecorder struct{}

NoOpAuditRecorder is an audit recorder that does nothing. Used when audit logging is disabled.

func (NoOpAuditRecorder) RecordBlockedAttr ¶ added in v1.3.0

func (NoOpAuditRecorder) RecordBlockedAttr(attr, value string)

RecordBlockedAttr does nothing.

func (NoOpAuditRecorder) RecordBlockedTag ¶ added in v1.3.0

func (NoOpAuditRecorder) RecordBlockedTag(tag string)

RecordBlockedTag does nothing.

func (NoOpAuditRecorder) RecordBlockedURL ¶ added in v1.3.0

func (NoOpAuditRecorder) RecordBlockedURL(url, reason string)

RecordBlockedURL does nothing.

type Scorer ¶ added in v1.3.0

type Scorer interface {
	// Score calculates a relevance score for a content node.
	// Higher scores indicate more likely main content.
	Score(node *html.Node) int
	// ShouldRemove determines if a node should be removed from the content tree.
	ShouldRemove(node *html.Node) bool
}

Scorer defines the interface for content scoring algorithms. Implementations can provide custom scoring logic for content extraction.

type ScoringConfig ¶ added in v1.3.0

type ScoringConfig struct {
	// PositiveStrongPatterns maps pattern strings to their strong positive scores.
	PositiveStrongPatterns map[string]int
	// PositiveMediumPatterns maps pattern strings to their medium positive scores.
	PositiveMediumPatterns map[string]int
	// NegativeStrongPatterns maps pattern strings to their strong negative scores.
	NegativeStrongPatterns map[string]int
	// NegativeMediumPatterns maps pattern strings to their medium negative scores.
	NegativeMediumPatterns map[string]int
	// NegativeWeakPatterns maps pattern strings to their weak negative scores.
	NegativeWeakPatterns map[string]int
	// RemovePatterns maps pattern strings to a boolean indicating removal.
	RemovePatterns map[string]bool
	// TagScores maps tag names to their base scores.
	TagScores map[string]int
}

ScoringConfig holds the configuration for the default scorer.

func DefaultScoringConfig ¶ added in v1.3.0

func DefaultScoringConfig() *ScoringConfig

DefaultScoringConfig returns the default scoring configuration.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
table Package table provides HTML table extraction and rendering functionality.	Package table provides HTML table extraction and rendering functionality.
testutil Package testutil provides common test utilities for the html package.	Package testutil provides common test utilities for the html package.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL