Popularity

9.4

Stable

Activity

7.9

Stars 918

Watchers 32

Forks 89

Last Commit 7 months ago

Monthly Downloads: 162

Programming language: Elixir

License: MIT License

Tags: HTTP

Latest version: v1.1.1

Crawler alternatives and similar packages

Based on the "HTTP" category.
Alternatively, view Crawler alternatives based on common mentions on social networks and blogs.

mint

9.6 6.9 Crawler VS mint

Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 🌱
gun

9.5 5.5 Crawler VS gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

finch

9.5 8.2 Crawler VS finch

Elixir HTTP client, focused on performance
Crawly

9.3 6.6 Crawler VS Crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
PlugAttack

8.4 0.0 Crawler VS PlugAttack

A plug building toolkit for blocking and throttling abusive requests
scrape

8.4 0.0 Crawler VS scrape

Scrape any website, article or RSS/Atom Feed with ease!
neuron

8.2 2.1 Crawler VS neuron

A GraphQL client for Elixir
Ace

8.1 0.0 Crawler VS Ace

HTTP web server and client, supports http1 and http2
webdriver

7.2 0.0 Crawler VS webdriver

WebDriver client for Elixir.
spell

6.3 0.0 Crawler VS spell

DISCONTINUED. Spell is a Web Application Messaging Protocol (WAMP) client implementation in Elixir. WAMP is an open standard WebSocket subprotocol that provides two application messaging patterns in one unified protocol: Remote Procedure Calls + Publish & Subscribe: http://wamp.ws/
web_socket

6.2 0.0 Crawler VS web_socket

An exploration into a stand-alone library for Plug applications to easily adopt WebSockets.
cauldron

5.9 0.0 Crawler VS cauldron

I wonder what kind of Elixir is boiling in there.
river

5.8 0.0 Crawler VS river

An HTTP/2 client for Elixir (a work in progress!)
AbsintheClient

5.8 3.7 Crawler VS AbsintheClient

A GraphQL client designed for Elixir Absinthe.
http_proxy

5.2 0.0 Crawler VS http_proxy

http proxy with Elixir. wait request with multi port and forward to each URIs
bolt

4.3 0.0 Crawler VS bolt

DISCONTINUED. Simple and fast http proxy living in the Erlang VM
sparql_client

4.2 5.0 Crawler VS sparql_client

A SPARQL client for Elixir
explode

4.2 2.7 Crawler VS explode

An easy utility for responding with standard HTTP/JSON error payloads in Plug- and Phoenix-based applications
Mechanize

4.1 10.0 Crawler VS Mechanize

Build web scrapers and automate interaction with websites in Elixir with ease!
mnemonic_slugs

3.6 0.0 Crawler VS mnemonic_slugs

An Elixir library for generating memorable slugs.
SpiderMan

3.5 5.1 Crawler VS SpiderMan

SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir.
etag_plug

3.3 2.8 Crawler VS etag_plug

A simple to use shallow ETag plug
Tube

3.3 0.0 Crawler VS Tube

WebSocket client library written in pure Elixir
ivar

3.3 0.0 Crawler VS ivar

Ivar is an adapter based HTTP client that provides the ability to build composable HTTP requests.
uri_template

3.1 0.0 Crawler VS uri_template

RFC 6570 compliant URI template processor for Elixir
fuzzyurl

3.1 0.0 Crawler VS fuzzyurl

An Elixir library for non-strict parsing, manipulation, and wildcard matching of URLs.
uri_query

3.0 5.2 Crawler VS uri_query

URI encode nested GET parameters and array values in Elixir
httprot

2.7 0.0 Crawler VS httprot

Prot prot prot.
yuri

2.3 0.0 Crawler VS yuri

Elixir module for easier URI manipulation.
http_digex

1.3 0.0 Crawler VS http_digex

HTTP Digest Auth Library to create auth header to be used with HTTP Digest Authentication
lhttpc

1.0 0.0 Crawler VS lhttpc

What used to be here -- this is a backwards-compat user and repo m(
plug_wait1

0.8 0.0 Crawler VS plug_wait1

Plug adapter for the wait1 protocol
Ralitobu.Plug

0.8 0.0 Crawler VS Ralitobu.Plug

Elixir Plug for Ralitobu, the Rate Limiter with Token Bucket algorithm

Do you think we are missing an alternative of Crawler or a related project?

Add another 'HTTP' Package

Popular Comparisons

README

Crawler

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawl assets (javascript, css and images).
Save to disk.
Hook for scraping content.
Restrict crawlable domains, paths or content types.
Limit concurrent crawlers.
Limit rate of crawling.
Set the maximum crawl depth.
Set timeouts.
Set retries strategy.
Set crawler's user agent.
Manually pause/resume/stop the crawler.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

Use Crawler.Store
Tap into the registry(?) [Crawler.Store.DB](lib/crawler/store.ex)
Use your own scraper
If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option	Type	Default Value	Description
`:assets`	list	`[]`	Whether to fetch any asset files, available options: `"css"`, `"js"`, `"images"`.
`:save_to`	string	`nil`	When provided, the path for saving crawled pages.
`:workers`	integer	`10`	Maximum number of concurrent workers for crawling.
`:interval`	integer	`0`	Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit.
`:max_depths`	integer	`3`	Maximum nested depth of pages to crawl.
`:timeout`	integer	`5000`	Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`.
`:user_agent`	string	`Crawler/x.x.x (...)`	User-Agent value sent by the fetch requests.
`:url_filter`	module	`Crawler.Fetcher.UrlFilter`	Custom URL filter, useful for restricting crawlable domains, paths or content types.
`:retrier`	module	`Crawler.Fetcher.Retrier`	Custom fetch retrier, useful for retrying failed crawls.
`:modifier`	module	`Crawler.Fetcher.Modifier`	Custom modifier, useful for adding custom request headers or options.
`:scraper`	module	`Crawler.Scraper`	Custom scraper, useful for scraping content as soon as the parser parses it.
`:parser`	module	`Crawler.Parser`	Custom parser, useful for handling parsing differently or to add extra functionalities.
`:encode_uri`	boolean	`false`	When set to `true` apply the `URI.encode` to the URL to be crawled.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See [Crawler.Fetcher.Retrier](lib/crawler/fetcher/retrier.ex).

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See [Crawler.Fetcher.UrlFilter](lib/crawler/fetcher/url_filter.ex).

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See [Crawler.Scraper](lib/crawler/scraper.ex).

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See [Crawler.Parser](lib/crawler/parser.ex).

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Modifier

See [Crawler.Fetcher.Modifier](lib/crawler/fetcher/modifier.ex).

defmodule CustomModifier do
  @behaviour Crawler.Fetcher.Modifier.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("http://elixir-lang.org")

Crawler.pause(opts)

Crawler.resume(opts)

Crawler.stop(opts)

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see [CHANGELOG.md](CHANGELOG.md).

License

Licensed under MIT.

*Note that all licence references and agreements mentioned in the Crawler README section above are relevant to that project's source code only.

Crawler

A high performance web crawler / scraper in Elixir.

Crawler alternatives and similar packages

Popular Comparisons

README

Crawler

Features

Architecture

Usage

Configurations

Custom Modules

Retrier

URL Filter

Scraper

Parser

Modifier

Pause / Resume / Stop Crawler

API Reference

Changelog

License