Crawler alternatives and similar packages
Based on the "HTTP" category.
Alternatively, view Crawler alternatives based on common mentions on social networks and blogs.
-
mint
Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 🌱 -
PlugAttack
A plug building toolkit for blocking and throttling abusive requests -
spell
Spell is a Web Application Messaging Protocol (WAMP) client implementation in Elixir. WAMP is an open standard WebSocket subprotocol that provides two application messaging patterns in one unified protocol: Remote Procedure Calls + Publish & Subscribe: http://wamp.ws/ -
web_socket
An exploration into a stand-alone library for Plug applications to easily adopt WebSockets. -
http_proxy
http proxy with Elixir. wait request with multi port and forward to each URIs -
explode
An easy utility for responding with standard HTTP/JSON error payloads in Plug- and Phoenix-based applications -
Mechanize
Build web scrapers and automate interaction with websites in Elixir with ease! -
SpiderMan
SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir. -
ivar
Ivar is an adapter based HTTP client that provides the ability to build composable HTTP requests. -
fuzzyurl
An Elixir library for non-strict parsing, manipulation, and wildcard matching of URLs. -
http_digex
HTTP Digest Auth Library to create auth header to be used with HTTP Digest Authentication -
lhttpc
What used to be here -- this is a backwards-compat user and repo m( -
Ralitobu.Plug
Elixir Plug for Ralitobu, the Rate Limiter with Token Bucket algorithm
InfluxDB - Power Real-Time Data Analytics at Scale
Do you think we are missing an alternative of Crawler or a related project?
Popular Comparisons
README
Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Features
- Crawl assets (javascript, css and images).
- Save to disk.
- Hook for scraping content.
- Restrict crawlable domains, paths or content types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set the maximum crawl depth.
- Set timeouts.
- Set retries strategy.
- Set crawler's user agent.
- Manually pause/resume/stop the crawler.
Architecture
Below is a very high level architecture diagram demonstrating how Crawler works.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
There are several ways to access the crawled page data:
- Use
Crawler.Store
- Tap into the registry(?) [
Crawler.Store.DB
](lib/crawler/store.ex) - Use your own scraper
- If the
:save_to
option is set, pages will be saved to disk in addition to the above mentioned places - Provide your own custom parser and manage how data is stored and accessed yourself
Configurations
Option | Type | Default Value | Description |
---|---|---|---|
:assets |
list | [] |
Whether to fetch any asset files, available options: "css" , "js" , "images" . |
:save_to |
string | nil |
When provided, the path for saving crawled pages. |
:workers |
integer | 10 |
Maximum number of concurrent workers for crawling. |
:interval |
integer | 0 |
Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. |
:max_depths |
integer | 3 |
Maximum nested depth of pages to crawl. |
:timeout |
integer | 5000 |
Timeout value for fetching a page, in ms. Can also be set to :infinity , useful when combined with Crawler.pause/1 . |
:user_agent |
string | Crawler/x.x.x (...) |
User-Agent value sent by the fetch requests. |
:url_filter |
module | Crawler.Fetcher.UrlFilter |
Custom URL filter, useful for restricting crawlable domains, paths or content types. |
:retrier |
module | Crawler.Fetcher.Retrier |
Custom fetch retrier, useful for retrying failed crawls. |
:modifier |
module | Crawler.Fetcher.Modifier |
Custom modifier, useful for adding custom request headers or options. |
:scraper |
module | Crawler.Scraper |
Custom scraper, useful for scraping content as soon as the parser parses it. |
:parser |
module | Crawler.Parser |
Custom parser, useful for handling parsing differently or to add extra functionalities. |
:encode_uri |
boolean | false |
When set to true apply the URI.encode to the URL to be crawled. |
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
See [Crawler.Fetcher.Retrier
](lib/crawler/fetcher/retrier.ex).
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
end
URL Filter
See [Crawler.Fetcher.UrlFilter
](lib/crawler/fetcher/url_filter.ex).
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
end
Scraper
See [Crawler.Scraper
](lib/crawler/scraper.ex).
defmodule CustomScraper do
@behaviour Crawler.Scraper.Spec
end
Parser
See [Crawler.Parser
](lib/crawler/parser.ex).
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
end
Modifier
See [Crawler.Fetcher.Modifier
](lib/crawler/fetcher/modifier.ex).
defmodule CustomModifier do
@behaviour Crawler.Fetcher.Modifier.Spec
end
Pause / Resume / Stop Crawler
Crawler provides pause/1
, resume/1
and stop/1
, see below.
{:ok, opts} = Crawler.crawl("http://elixir-lang.org")
Crawler.pause(opts)
Crawler.resume(opts)
Crawler.stop(opts)
Please note that when pausing Crawler, you would need to set a large enough :timeout
(or even set it to :infinity
) otherwise parser would timeout due to unprocessed links.
API Reference
Please see https://hexdocs.pm/crawler.
Changelog
Please see [CHANGELOG.md](CHANGELOG.md).
License
Licensed under MIT.
*Note that all licence references and agreements mentioned in the Crawler README section above
are relevant to that project's source code only.