Documentation
¶
Overview ¶
Package colly implements a HTTP scraping framework
Index ¶
- type Collector
- func (c *Collector) DisableCookies()
- func (c *Collector) Init()
- func (c *Collector) Limit(rule *LimitRule) error
- func (c *Collector) Limits(rules []*LimitRule) error
- func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
- func (c *Collector) OnRequest(f RequestCallback)
- func (c *Collector) OnResponse(f ResponseCallback)
- func (c *Collector) Post(URL string, requestData map[string]string) error
- func (c *Collector) SetRequestTimeout(timeout time.Duration)
- func (c *Collector) Visit(URL string) error
- func (c *Collector) Wait()
- func (c *Collector) WithTransport(transport *http.Transport)
- type Context
- type HTMLCallback
- type HTMLElement
- type LimitRule
- type Request
- type RequestCallback
- type Response
- type ResponseCallback
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Collector ¶
type Collector struct {
// UserAgent is the User-Agent string used by HTTP requests
UserAgent string
// MaxDepth limits the recursion depth of visited URLs.
// Set it to 0 for infinite recursion (default).
MaxDepth int
// AllowedDomains is a domain whitelist.
// Leave it blank to allow any domains to be visited
AllowedDomains []string
// AllowURLRevisit allows multiple downloads of the same URL
AllowURLRevisit bool
// MaxBodySize limits the retrieved response body. `0` means unlimited.
// The default value for MaxBodySize is 10240 (10MB)
MaxBodySize int
// contains filtered or unexported fields
}
Collector provides the scraper instance for a scraping job
func NewCollector ¶
func NewCollector() *Collector
NewCollector creates a new Collector instance with default configuration
func (*Collector) DisableCookies ¶
func (c *Collector) DisableCookies()
DisableCookies turns off cookie handling for this collector
func (*Collector) Init ¶
func (c *Collector) Init()
Init initializes the Collector's private variables and sets default configuration for the Collector
func (*Collector) OnHTML ¶
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery
func (*Collector) OnRequest ¶
func (c *Collector) OnRequest(f RequestCallback)
OnRequest registers a function. Function will be executed on every request made by the Collector
func (*Collector) OnResponse ¶
func (c *Collector) OnResponse(f ResponseCallback)
OnResponse registers a function. Function will be executed on every response
func (*Collector) Post ¶
Post starts collecting job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks
func (*Collector) SetRequestTimeout ¶
SetRequestTimeout overrides the default timeout (10 seconds) for this collector
func (*Collector) Visit ¶
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks
func (*Collector) Wait ¶
func (c *Collector) Wait()
Wait returns when the collector jobs are finished
func (*Collector) WithTransport ¶
WithTransport allows you to set a custom http.Transport for this collector.
type Context ¶
type Context struct {
// contains filtered or unexported fields
}
Context provides a tiny layer for passing data between callbacks
type HTMLCallback ¶
type HTMLCallback func(*HTMLElement)
HTMLCallback is a type alias for OnHTML callback functions
type HTMLElement ¶
type HTMLElement struct {
// Name is the name of the tag
Name string
Text string
// Request is the request object of the element's HTML document
Request *Request
// Response is the Response object of the element's HTML document
Response *Response
// DOM is the goquery parsed DOM object of the page. DOM is relative
// to the current HTMLElement
DOM *goquery.Selection
// contains filtered or unexported fields
}
HTMLElement is the representation of a HTML tag.
func (*HTMLElement) Attr ¶
func (h *HTMLElement) Attr(k string) string
Attr returns the selected attribute of a HTMLElement or empty string if no attribute found
type LimitRule ¶
type LimitRule struct {
// DomainRegexp is a regular expression to match against domains
DomainRegexp string
// DomainRegexp is a glob pattern to match against domains
DomainGlob string
// Delay is the duration to wait before creating a new request to the matching domains
Delay time.Duration
// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
Parallelism int
// contains filtered or unexported fields
}
LimitRule provides connection restrictions for domains. There can be two kind of limitations:
- Parallelism: Set limit for the number of concurrent requests to a domain
- Delay: Set rate limit for a domain (this means no parallelism on the matching domains)
type Request ¶
type Request struct {
// URL is the parsed URL of the HTTP request
URL *url.URL
// Headers contains the Request's HTTP headers
Headers *http.Header
// Ctx is a context between a Request and a Response
Ctx *Context
// Depth is the number of the parents of this request
Depth int
// contains filtered or unexported fields
}
Request is the representation of a HTTP request made by a Collector
func (*Request) AbsoluteURL ¶
AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed
type RequestCallback ¶
type RequestCallback func(*Request)
RequestCallback is a type alias for OnRequest callback functions
type Response ¶
type Response struct {
// StatusCode is the status code of the Response
StatusCode int
// Body is the content of the Response
Body []byte
// Ctx is a context between a Request and a Response
Ctx *Context
// Request is the Request object of the response
Request *Request
// Headers contains the Response's HTTP headers
Headers *http.Header
}
Response is the representation of a HTTP response made by a Collector
type ResponseCallback ¶
type ResponseCallback func(*Response)
ResponseCallback is a type alias for OnResponse callback functions
Directories
¶
| Path | Synopsis |
|---|---|
|
examples
|
|
|
basic
command
|
|
|
coursera_courses
command
|
|
|
max_depth
command
|
|
|
parallel
command
|
|
|
rate_limit
command
|
|
|
request_context
command
|