pholcus

package module
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: Apache-2.0 Imports: 0 Imported by: 0

README

Pholcus Logo

Pholcus(幽灵蛛)

纯 Go 语言编写的分布式高并发爬虫框架

GitHub release GitHub stars Go Reference Go Report Card License GitHub issues GitHub closed issues

快速开始核心特性架构设计操作界面规则编写FAQ


免责声明

本软件仅用于学术研究,使用者需遵守其所在地的相关法律法规,请勿用于非法用途!

如在中国大陆频频爆出爬虫开发者涉诉与违规的 新闻

郑重声明:因违法违规使用造成的一切后果,使用者自行承担!


核心特性

运行模式

  • 单机模式 — 开箱即用
  • 服务端模式 — 分发任务
  • 客户端模式 — 接收并执行任务

操作界面

  • Web UI — 跨平台,浏览器操作
  • GUI — Windows 原生界面
  • Cmd — 命令行批量调度

数据输出

  • MySQL / MongoDB
  • Kafka / Beanstalkd
  • CSV / Excel
  • 原文件下载

爬虫规则

  • 静态规则(Go)— 高性能,深度定制
  • 动态规则(JS/XML)— 热加载,无需编译
  • 30+ 内置示例规则

更多亮点:

  • 三引擎下载器 surfer:Surf(高并发 HTTP)/ PhantomJS / Chrome(Chromium 无头浏览器,自动执行 JS)
  • 智能 Cookie 管理:固定 UserAgent 自动保存 cookie,或随机 UserAgent 禁用 cookie
  • 模拟登录、自定义 Header、POST 表单提交
  • 代理 IP 池,可按频率自动更换
  • 随机停歇机制,模拟人工行为
  • 采集量与并发协程数可控
  • 请求自动去重 + 失败请求自动重试
  • 成功记录持久化,支持断点续爬
  • 分布式通信全双工 Socket 框架

架构设计

模块结构
模块结构
项目架构
项目架构
分布式架构
分布式架构

目录结构

pholcus/
├── app/                    核心逻辑
│   ├── crawler/            爬虫引擎 & 并发池
│   ├── downloader/         下载器(surfer)
│   ├── pipeline/           数据管道 & 多种输出后端
│   ├── scheduler/          请求调度器
│   ├── spider/             爬虫规则引擎
│   ├── distribute/         分布式 Master/Slave 通信
│   └── aid/                辅助模块(历史记录、代理 IP)
├── config/                 配置管理
├── exec/                   启动入口 & 平台适配
├── cmd/                    命令行模式
├── gui/                    GUI 模式(Windows)
├── web/                    Web UI 模式
├── common/                 公共工具库(DB 驱动、编码、队列等)
├── logs/                   日志模块
├── runtime/                运行时缓存 & 状态
└── sample/                 示例程序 & 30+ 爬虫规则

快速开始

环境要求

  • Go 1.18+(推荐 1.22+)

获取源码

git clone https://github.com/andeya/pholcus.git
cd pholcus

编写入口

创建 main.go(或参考 sample/main.go):

package main

import (
    "github.com/andeya/pholcus/exec"
    _ "github.com/andeya/pholcus/sample/static_rules"  // 内置规则库
    // _ "yourproject/rules"                            // 自定义规则库
)

func main() {
    // 启动界面:web / gui / cmd
    // 可通过 -a_ui 运行参数覆盖
    exec.DefaultRun("web")
}

编译运行

# 编译(非 Windows 平台自动排除 GUI 包)
go build -o pholcus ./sample/

# 查看所有可选参数
./pholcus -h

Windows 下隐藏 cmd 窗口的编译方式:

go build -ldflags="-H=windowsgui -linkmode=internal" -o pholcus.exe ./sample/

命令行参数一览

./pholcus -h

命令行帮助


操作界面

Web UI

启动后访问 http://localhost:2015,在浏览器中即可完成蜘蛛选择、参数配置、任务启停等全部操作。

Web 界面

GUI(仅 Windows)

原生桌面客户端,功能与 Web 版一致。

GUI 界面

Cmd 命令行

适用于服务器部署或 cron 定时任务场景。

pholcus -_ui=cmd -a_mode=0 -c_spider=3,8 -a_outtype=csv -a_thread=20 \
    -a_batchcap=5000 -a_pause=300 -a_proxyminute=0 \
    -a_keyins="<pholcus><golang>" -a_limit=10 -a_success=true -a_failure=true

规则编写

Pholcus 支持 静态规则(Go)动态规则(JS/XML) 两种方式。

静态规则(Go)

随软件一同编译,性能最优,适合重量级采集项目。在 sample/static_rules/ 下新建 Go 文件即可:

package rules

import (
    "net/http"
    "github.com/andeya/pholcus/app/downloader/request"
    "github.com/andeya/pholcus/app/spider"
)

func init() {
    mySpider.Register()
}

var mySpider = &spider.Spider{
    Name:         "示例爬虫",
    Description:  "示例爬虫 [Auto Page] [http://example.com]",
    EnableCookie: true,
    RuleTree: &spider.RuleTree{
        Root: func(ctx *spider.Context) {
            ctx.AddQueue(&request.Request{
                URL:  "http://example.com",
                Rule: "首页",
            })
        },
        Trunk: map[string]*spider.Rule{
            "首页": {
                ParseFunc: func(ctx *spider.Context) {
                    ctx.Output(map[int]interface{}{
                        0: ctx.GetText(),
                    })
                },
            },
        },
    },
}

更多示例见 sample/static_rules/,涵盖百度、京东、淘宝、知乎等 30+ 网站。

动态规则(JS/XML)

无需编译即可热加载,适合轻量级采集。将 .pholcus.xml 文件放入 dyn_rules/ 目录:

<Spider>
    <Name>百度搜索</Name>
    <Description>百度搜索 [Auto Page] [http://www.baidu.com]</Description>
    <Pausetime>300</Pausetime>
    <EnableLimit>false</EnableLimit>
    <EnableCookie>true</EnableCookie>
    <EnableKeyin>true</EnableKeyin>
    <NotDefaultField>false</NotDefaultField>
    <Namespace><Script></Script></Namespace>
    <SubNamespace><Script></Script></SubNamespace>
    <Root>
        <Script param="ctx">
        ctx.JsAddQueue({
            URL: "http://www.baidu.com/s?wd=" + ctx.GetKeyin(),
            Rule: "搜索结果"
        });
        </Script>
    </Root>
    <Rule name="搜索结果">
        <ParseFunc>
            <Script param="ctx">
            ctx.Output({
                "标题": ctx.GetDom().Find("title").Text(),
                "内容": ctx.GetText()
            });
            </Script>
        </ParseFunc>
    </Rule>
</Spider>

同时兼容 .pholcus.html 旧格式。<Script> 标签内自动包裹 CDATA,无需手动转义特殊字符。


下载器

Pholcus 内置三种下载引擎,通过 DownloaderID 切换:

ID 名称 说明
0 Surf 默认引擎。纯 Go HTTP 客户端,高并发,适合大多数静态页面采集
1 PhantomJS 基于 PhantomJS 的无头浏览器(已停止维护),可执行 JS,并发能力较低
2 Chrome 基于 Chromium(chromedp)的无头浏览器,可执行 JS、绕过安全验证,推荐用于反爬严格的站点

在静态规则(Go)中使用

import "github.com/andeya/pholcus/app/downloader/request"

// 使用默认 Surf 引擎(可省略 DownloaderID)
ctx.AddQueue(&request.Request{
    URL:  "https://example.com",
    Rule: "页面",
})

// 使用 Chrome 无头浏览器引擎
ctx.AddQueue(&request.Request{
    URL:          "https://www.baidu.com/s?wd=pholcus",
    Rule:         "搜索结果",
    DownloaderID: request.ChromeID,
})

在动态规则(JS/XML)中使用

<Script param="ctx">
ctx.JsAddQueue({
    URL: "https://www.baidu.com/s?wd=pholcus",
    Rule: "搜索结果",
    DownloaderID: 2
});
</Script>

Chrome 引擎说明

Chrome 引擎依赖本机安装的 Chromium / Google Chrome 浏览器,通过 chromedp 驱动。

适用场景:

  • 目标网站有 JS 渲染的内容(SPA / CSR 页面)
  • 目标网站有安全验证(如百度安全验证)需要浏览器执行 JS 后自动跳转
  • 需要模拟真实浏览器环境绕过反爬检测

环境要求:

  • 本机需安装 Chrome / Chromium 浏览器
  • macOS: brew install --cask google-chromebrew install chromium
  • Linux: apt install chromium-browseryum install chromium
  • Windows: 安装 Google Chrome 即可

注意事项:

  • Chrome 引擎每次请求会启动独立的无头浏览器实例,资源消耗高于 Surf
  • 建议仅在 Surf 引擎无法获取内容时使用 Chrome
  • Chrome 引擎内置了反自动化检测(隐藏 navigator.webdriver、禁用自动化标志等)

配置说明

运行时目录

├── pholcus                    可执行文件
├── dyn_rules/                 动态规则目录(可在 config.ini 中配置)
│   └── xxx.pholcus.xml        动态规则文件
└── pholcus_pkg/               运行时文件目录
    ├── config.ini             配置文件
    ├── proxy.lib              代理 IP 列表
    ├── phantomjs              PhantomJS 程序
    ├── text_out/              文本输出目录
    ├── file_out/              文件输出目录
    ├── logs/                  日志目录
    ├── history/               历史记录目录
    └── cache/                 临时缓存目录

代理 IP

pholcus_pkg/proxy.lib 文件中逐行写入代理地址:

http://183.141.168.95:3128
https://60.13.146.92:8088
http://59.59.4.22:8090

通过界面选择"代理 IP 更换频率"或命令行参数 -a_proxyminute 启用。

注意: macOS 下使用代理 IP 功能需要 root 权限,否则无法通过 ping 检测可用代理。


内置爬虫规则

分类 规则名称
搜索引擎 百度搜索、百度新闻、谷歌搜索、京东搜索、淘宝搜索
电商平台 京东、淘宝、考拉海购、蜜芽宝贝、顺丰海淘、Holland&Barrett
新闻资讯 中国新闻网、网易新闻、人民网
社交问答 知乎日报、知乎编辑推荐、悟空问答、微博粉丝
房产汽车 房天下二手房、汽车之家
数码科技 ZOL 手机、ZOL 电脑、ZOL 平板、乐蛙
分类信息 赶集公司、全国区号
社交工具 QQ 头像
学术期刊 IJGUC
其他 阿里巴巴、技版、文件下载测试

常见问题

请求队列中重复的 URL 会自动去重吗?

默认自动去重。如需允许重复请求,设置 Request.Reloadable = true

框架能否判断页面内容是否更新?

框架不内置页面变更检测,但可在规则中自定义实现。

请求成功的判定标准是什么?

以服务器是否返回响应流为准,而非 HTTP 状态码。即 404 页面也算"请求成功"。

请求失败后如何重试?

每个 URL 尝试下载指定次数后,若仍失败则进入 defer 队列。当前任务正常结束后自动重试。再次失败则保存至失败历史记录。下次执行同一规则时,可选择继承历史失败记录进行自动重试。


参与贡献

欢迎提交 Issue 和 Pull Request!

  1. Fork 本仓库
  2. 创建特性分支:git checkout -b feature/your-feature
  3. 提交更改:git commit -m 'Add your feature'
  4. 推送分支:git push origin feature/your-feature
  5. 提交 Pull Request

开源协议

本项目基于 Apache License 2.0 开源。


Created by andeya — 如果觉得有帮助,请给个 Star 支持!

Documentation

Overview

Package pholcus provides a distributed, high-concurrency web crawler written in pure Go.

Pholcus (Ghost Spider) targets web data collection and offers a powerful crawler for users with basic Go or JavaScript skills, focusing on rule customization.

It supports three operation modes: standalone, server, and client; three interfaces: Web, GUI, and command-line; simple flexible rules; batch task concurrency; and rich output formats (MySQL, MongoDB, Kafka, CSV, Excel, etc.) with shared demos. It also supports horizontal and vertical crawling modes, simulated login, and advanced features such as task pause and cancel.

Official QQ group: Go Big Data 42731170

Directories

Path Synopsis
app
Package app provides the main entry and task scheduling for the crawler application.
Package app provides the main entry and task scheduling for the crawler application.
aid/history
Package history provides persistence and inheritance of success and failure request records.
Package history provides persistence and inheritance of success and failure request records.
aid/proxy
Package proxy provides proxy IP pool management and online filtering.
Package proxy provides proxy IP pool management and online filtering.
crawler
Package crawler provides the core crawler engine for request scheduling and page downloading.
Package crawler provides the core crawler engine for request scheduling and page downloading.
distribute
Package distribute provides distributed task scheduling and master-slave node communication.
Package distribute provides distributed task scheduling and master-slave node communication.
distribute/teleport
Package teleport provides a high-concurrency API framework for distributed systems.
Package teleport provides a high-concurrency API framework for distributed systems.
downloader
Package downloader defines the page downloader interface.
Package downloader defines the page downloader interface.
downloader/request
Package request provides encapsulation and deduplication of crawl requests.
Package request provides encapsulation and deduplication of crawl requests.
downloader/surfer
Package surfer provides a high-concurrency web downloader written in Go.
Package surfer provides a high-concurrency web downloader written in Go.
downloader/surfer/agent
Package agent generates user agents strings for well known browsers and for custom browsers.
Package agent generates user agents strings for well known browsers and for custom browsers.
pipeline
Package pipeline provides the data collection and output pipeline.
Package pipeline provides the data collection and output pipeline.
pipeline/collector
Package collector implements result collection and output.
Package collector implements result collection and output.
pipeline/collector/data
Package data provides storage structure definitions for data and file cells.
Package data provides storage structure definitions for data and file cells.
scheduler
Package scheduler provides crawl task scheduling and resource allocation.
Package scheduler provides crawl task scheduling and resource allocation.
spider
Package spider provides spider rule definition, species registration, and parsing.
Package spider provides spider rule definition, species registration, and parsing.
spider/common
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.
Package cmd implements the command-line interface for Pholcus.
Package cmd implements the command-line interface for Pholcus.
common
beanstalkd
Package beanstalkd provides a client wrapper for Beanstalkd job queue.
Package beanstalkd provides a client wrapper for Beanstalkd job queue.
bytes
Package bytes provides byte unit conversion and parsing.
Package bytes provides byte unit conversion and parsing.
closer
Package closer provides utilities for closing resources with error logging.
Package closer provides utilities for closing resources with error logging.
gc
Package gc provides manual garbage collection to release heap memory.
Package gc provides manual garbage collection to release heap memory.
goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
kafka
Package kafka provides Kafka message queue sending wrapper.
Package kafka provides Kafka message queue sending wrapper.
mahonia
This package is a character-set conversion library for Go.
This package is a character-set conversion library for Go.
mgo
Package mgo provides MongoDB database connection and operation wrapper.
Package mgo provides MongoDB database connection and operation wrapper.
mysql
Package mysql provides MySQL database connection and operation wrapper.
Package mysql provides MySQL database connection and operation wrapper.
ping
Package ping provides ICMP network connectivity detection.
Package ping provides ICMP network connectivity detection.
pinyin
Package pinyin provides Chinese character to Pinyin conversion.
Package pinyin provides Chinese character to Pinyin conversion.
pool
Package pool provides a generic resource pool with dynamic growth and idle resource recycling.
Package pool provides a generic resource pool with dynamic growth and idle resource recycling.
queue
Package queue provides a bounded channel-based queue.
Package queue provides a bounded channel-based queue.
session
Package session provider
Package session provider
simplejson
Package simplejson provides simplified JSON parsing and manipulation.
Package simplejson provides simplified JSON parsing and manipulation.
util
Package util provides common utility functions such as MD5, random numbers, path handling, etc.
Package util provides common utility functions such as MD5, random numbers, path handling, etc.
websocket
Package websocket implements a client and server for the WebSocket protocol as specified in RFC 6455.
Package websocket implements a client and server for the WebSocket protocol as specified in RFC 6455.
Package config provides software configuration, path, and runtime parameter loading and management.
Package config provides software configuration, path, and runtime parameter loading and management.
Package exec provides entry points to launch CMD or Web interface based on run mode.
Package exec provides entry points to launch CMD or Web interface based on run mode.
gui
Package logs provides multi-output logging.
Package logs provides multi-output logging.
rules module
runtime
cache
Package cache provides common configuration and cache for task runtime.
Package cache provides common configuration and cache for task runtime.
status
Package status provides runtime mode, data header type, and status constant definitions.
Package status provides runtime mode, data header type, and status constant definitions.
Package web provides HTTP service, routing, and embedded resources for the Web interface.
Package web provides HTTP service, routing, and embedded resources for the Web interface.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL