Documentation
¶
Overview ¶
Package pholcus provides a distributed, high-concurrency web crawler written in pure Go.
Pholcus (Ghost Spider) targets web data collection and offers a powerful crawler for users with basic Go or JavaScript skills, focusing on rule customization.
It supports three operation modes: standalone, server, and client; three interfaces: Web, GUI, and command-line; simple flexible rules; batch task concurrency; and rich output formats (MySQL, MongoDB, Kafka, CSV, Excel, etc.) with shared demos. It also supports horizontal and vertical crawling modes, simulated login, and advanced features such as task pause and cancel.
Official QQ group: Go Big Data 42731170
Directories
¶
| Path | Synopsis |
|---|---|
|
Package app provides the main entry and task scheduling for the crawler application.
|
Package app provides the main entry and task scheduling for the crawler application. |
|
aid/history
Package history provides persistence and inheritance of success and failure request records.
|
Package history provides persistence and inheritance of success and failure request records. |
|
aid/proxy
Package proxy provides proxy IP pool management and online filtering.
|
Package proxy provides proxy IP pool management and online filtering. |
|
crawler
Package crawler provides the core crawler engine for request scheduling and page downloading.
|
Package crawler provides the core crawler engine for request scheduling and page downloading. |
|
distribute
Package distribute provides distributed task scheduling and master-slave node communication.
|
Package distribute provides distributed task scheduling and master-slave node communication. |
|
distribute/teleport
Package teleport provides a high-concurrency API framework for distributed systems.
|
Package teleport provides a high-concurrency API framework for distributed systems. |
|
downloader
Package downloader defines the page downloader interface.
|
Package downloader defines the page downloader interface. |
|
downloader/request
Package request provides encapsulation and deduplication of crawl requests.
|
Package request provides encapsulation and deduplication of crawl requests. |
|
downloader/surfer
Package surfer provides a high-concurrency web downloader written in Go.
|
Package surfer provides a high-concurrency web downloader written in Go. |
|
downloader/surfer/agent
Package agent generates user agents strings for well known browsers and for custom browsers.
|
Package agent generates user agents strings for well known browsers and for custom browsers. |
|
downloader/surfer/example
command
|
|
|
pipeline
Package pipeline provides the data collection and output pipeline.
|
Package pipeline provides the data collection and output pipeline. |
|
pipeline/collector
Package collector implements result collection and output.
|
Package collector implements result collection and output. |
|
pipeline/collector/data
Package data provides storage structure definitions for data and file cells.
|
Package data provides storage structure definitions for data and file cells. |
|
scheduler
Package scheduler provides crawl task scheduling and resource allocation.
|
Package scheduler provides crawl task scheduling and resource allocation. |
|
spider
Package spider provides spider rule definition, species registration, and parsing.
|
Package spider provides spider rule definition, species registration, and parsing. |
|
spider/common
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.
|
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules. |
|
Package cmd implements the command-line interface for Pholcus.
|
Package cmd implements the command-line interface for Pholcus. |
|
common
|
|
|
beanstalkd
Package beanstalkd provides a client wrapper for Beanstalkd job queue.
|
Package beanstalkd provides a client wrapper for Beanstalkd job queue. |
|
bytes
Package bytes provides byte unit conversion and parsing.
|
Package bytes provides byte unit conversion and parsing. |
|
closer
Package closer provides utilities for closing resources with error logging.
|
Package closer provides utilities for closing resources with error logging. |
|
gc
Package gc provides manual garbage collection to release heap memory.
|
Package gc provides manual garbage collection to release heap memory. |
|
goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
|
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document. |
|
kafka
Package kafka provides Kafka message queue sending wrapper.
|
Package kafka provides Kafka message queue sending wrapper. |
|
mahonia
This package is a character-set conversion library for Go.
|
This package is a character-set conversion library for Go. |
|
mahonia/mahoniconv
command
|
|
|
mgo
Package mgo provides MongoDB database connection and operation wrapper.
|
Package mgo provides MongoDB database connection and operation wrapper. |
|
mysql
Package mysql provides MySQL database connection and operation wrapper.
|
Package mysql provides MySQL database connection and operation wrapper. |
|
ping
Package ping provides ICMP network connectivity detection.
|
Package ping provides ICMP network connectivity detection. |
|
pinyin
Package pinyin provides Chinese character to Pinyin conversion.
|
Package pinyin provides Chinese character to Pinyin conversion. |
|
pool
Package pool provides a generic resource pool with dynamic growth and idle resource recycling.
|
Package pool provides a generic resource pool with dynamic growth and idle resource recycling. |
|
queue
Package queue provides a bounded channel-based queue.
|
Package queue provides a bounded channel-based queue. |
|
session
Package session provider
|
Package session provider |
|
simplejson
Package simplejson provides simplified JSON parsing and manipulation.
|
Package simplejson provides simplified JSON parsing and manipulation. |
|
util
Package util provides common utility functions such as MD5, random numbers, path handling, etc.
|
Package util provides common utility functions such as MD5, random numbers, path handling, etc. |
|
websocket
Package websocket implements a client and server for the WebSocket protocol as specified in RFC 6455.
|
Package websocket implements a client and server for the WebSocket protocol as specified in RFC 6455. |
|
Package config provides software configuration, path, and runtime parameter loading and management.
|
Package config provides software configuration, path, and runtime parameter loading and management. |
|
Package exec provides entry points to launch CMD or Web interface based on run mode.
|
Package exec provides entry points to launch CMD or Web interface based on run mode. |
|
Package logs provides multi-output logging.
|
Package logs provides multi-output logging. |
|
rules
module
|
|
|
runtime
|
|
|
cache
Package cache provides common configuration and cache for task runtime.
|
Package cache provides common configuration and cache for task runtime. |
|
status
Package status provides runtime mode, data header type, and status constant definitions.
|
Package status provides runtime mode, data header type, and status constant definitions. |
|
Package web provides HTTP service, routing, and embedded resources for the Web interface.
|
Package web provides HTTP service, routing, and embedded resources for the Web interface. |
Click to show internal directories.
Click to hide internal directories.


