Crawler alternatives and similar packages
Based on the "HTTP" category.
Alternatively, view Crawler alternatives based on common mentions on social networks and blogs.
-
mint
Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 🌱 -
PlugAttack
A plug building toolkit for blocking and throttling abusive requests -
spell
Spell is a Web Application Messaging Protocol (WAMP) client implementation in Elixir. WAMP is an open standard WebSocket subprotocol that provides two application messaging patterns in one unified protocol: Remote Procedure Calls + Publish & Subscribe: http://wamp.ws/ -
web_socket
An exploration into a stand-alone library for Plug applications to easily adopt WebSockets. -
http_proxy
http proxy with Elixir. wait request with multi port and forward to each URIs -
explode
An easy utility for responding with standard HTTP/JSON error payloads in Plug- and Phoenix-based applications -
Mechanize
Build web scrapers and automate interaction with websites in Elixir with ease! -
ivar
Ivar is an adapter based HTTP client that provides the ability to build composable HTTP requests. -
fuzzyurl
An Elixir library for non-strict parsing, manipulation, and wildcard matching of URLs. -
SpiderMan
SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir. -
lhttpc
What used to be here -- this is a backwards-compat user and repo m( -
http_digex
HTTP Digest Auth Library to create auth header to be used with HTTP Digest Authentication -
Ralitobu.Plug
Elixir Plug for Ralitobu, the Rate Limiter with Token Bucket algorithm
Learn Elixir in as little as 12 Weeks
Do you think we are missing an alternative of Crawler or a related project?
Popular Comparisons
README
Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Features
- Crawl assets (javascript, css and images).
- Save to disk.
- Hook for scraping content.
- Restrict crawlable domains, paths or content types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set the maximum crawl depth.
- Set timeouts.
- Set retries strategy.
- Set crawler's user agent.
- Manually pause/resume/stop the crawler.
Architecture
Below is a very high level architecture diagram demonstrating how Crawler works.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
There are several ways to access the crawled page data:
- Use
Crawler.Store
- Tap into the registry(?) [
Crawler.Store.DB
](lib/crawler/store.ex) - Use your own scraper
- If the
:save_to
option is set, pages will be saved to disk in addition to the above mentioned places - Provide your own custom parser and manage how data is stored and accessed yourself
Configurations
Option | Type | Default Value | Description |
---|---|---|---|
:assets |
list | [] |
Whether to fetch any asset files, available options: "css" , "js" , "images" . |
:save_to |
string | nil |
When provided, the path for saving crawled pages. |
:workers |
integer | 10 |
Maximum number of concurrent workers for crawling. |
:interval |
integer | 0 |
Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. |
:max_depths |
integer | 3 |
Maximum nested depth of pages to crawl. |
:timeout |
integer | 5000 |
Timeout value for fetching a page, in ms. Can also be set to :infinity , useful when combined with Crawler.pause/1 . |
:user_agent |
string | Crawler/x.x.x (...) |
User-Agent value sent by the fetch requests. |
:url_filter |
module | Crawler.Fetcher.UrlFilter |
Custom URL filter, useful for restricting crawlable domains, paths or content types. |
:retrier |
module | Crawler.Fetcher.Retrier |
Custom fetch retrier, useful for retrying failed crawls. |
:modifier |
module | Crawler.Fetcher.Modifier |
Custom modifier, useful for adding custom request headers or options. |
:scraper |
module | Crawler.Scraper |
Custom scraper, useful for scraping content as soon as the parser parses it. |
:parser |
module | Crawler.Parser |
Custom parser, useful for handling parsing differently or to add extra functionalities. |
:encode_uri |
boolean | false |
When set to true apply the URI.encode to the URL to be crawled. |
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
See [Crawler.Fetcher.Retrier
](lib/crawler/fetcher/retrier.ex).
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
end
URL Filter
See [Crawler.Fetcher.UrlFilter
](lib/crawler/fetcher/url_filter.ex).
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
end
Scraper
See [Crawler.Scraper
](lib/crawler/scraper.ex).
defmodule CustomScraper do
@behaviour Crawler.Scraper.Spec
end
Parser
See [Crawler.Parser
](lib/crawler/parser.ex).
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
end
Modifier
See [Crawler.Fetcher.Modifier
](lib/crawler/fetcher/modifier.ex).
defmodule CustomModifier do
@behaviour Crawler.Fetcher.Modifier.Spec
end
Pause / Resume / Stop Crawler
Crawler provides pause/1
, resume/1
and stop/1
, see below.
{:ok, opts} = Crawler.crawl("http://elixir-lang.org")
Crawler.pause(opts)
Crawler.resume(opts)
Crawler.stop(opts)
Please note that when pausing Crawler, you would need to set a large enough :timeout
(or even set it to :infinity
) otherwise parser would timeout due to unprocessed links.
API Reference
Please see https://hexdocs.pm/crawler.
Changelog
Please see [CHANGELOG.md](CHANGELOG.md).
License
Licensed under MIT.
*Note that all licence references and agreements mentioned in the Crawler README section above
are relevant to that project's source code only.