Crawly alternatives and similar packages
Based on the "HTTP" category.
Alternatively, view Crawly alternatives based on common mentions on social networks and blogs.
-
mint
Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 🌱 -
PlugAttack
A plug building toolkit for blocking and throttling abusive requests -
spell
Spell is a Web Application Messaging Protocol (WAMP) client implementation in Elixir. WAMP is an open standard WebSocket subprotocol that provides two application messaging patterns in one unified protocol: Remote Procedure Calls + Publish & Subscribe: http://wamp.ws/ -
web_socket
An exploration into a stand-alone library for Plug applications to easily adopt WebSockets. -
http_proxy
http proxy with Elixir. wait request with multi port and forward to each URIs -
explode
An easy utility for responding with standard HTTP/JSON error payloads in Plug- and Phoenix-based applications -
Mechanize
Build web scrapers and automate interaction with websites in Elixir with ease! -
ivar
Ivar is an adapter based HTTP client that provides the ability to build composable HTTP requests. -
fuzzyurl
An Elixir library for non-strict parsing, manipulation, and wildcard matching of URLs. -
SpiderMan
SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir. -
http_digex
HTTP Digest Auth Library to create auth header to be used with HTTP Digest Authentication -
Ralitobu.Plug
Elixir Plug for Ralitobu, the Rate Limiter with Token Bucket algorithm
Static code analysis for 29 languages.
Do you think we are missing an alternative of Crawly or a related project?
Popular Comparisons
README
Crawly
Overview
Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Requirements
- Elixir
~> 1.10
- Works on GNU/Linux, Windows, macOS X, and BSD.
Quickstart
- Add Crawly as a dependencies:
# mix.exs
defp deps do
[
{:crawly, "~> 0.13.0"},
{:floki, "~> 0.26.0"}
]
end
- Fetch dependencies:
$ mix deps.get
- Create a spider
# lib/crawly_example/books_to_scrape.ex
defmodule BooksToScrape do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://books.toscrape.com/"
@impl Crawly.Spider
def init() do: [start_urls: ["https://books.toscrape.com/"]]
@impl Crawly.Spider
def parse_item(response) do
# Parse response body to document
{:ok, document} = Floki.parse_document(response.body)
# Create item (for pages where items exists)
items =
document
|> Floki.find(".product_pod")
|> Enum.map(fn x ->
%{
title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
}
end)
next_requests =
document
|> Floki.find(".next a")
|> Floki.attribute("href")
|> Enum.map(fn url ->
Crawly.Utils.build_absolute_url(url, response.request.url)
|> Crawly.Utils.request_from_url()
end)
%{items: items, requests: next_requests}
end
end
- Configure Crawly
By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:
# in config.exs
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
{Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:url, :title]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title},
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
]
- Start the Crawl:
$ iex -S mix
iex(1)> Crawly.Engine.start_spider(EslSpider)
- Results can be seen with:
$ cat /tmp/EslSpider.jl
Need more help?
I have decided to create a public telegram channel, so it's now possible to be connected, and it's possible to ask questions and get answers faster!
Please join me on: https://t.me/crawlyelixir
Browser rendering
Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).
You can read more here:
Experimental UI
The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders.
Checkout the code from GitHub or try it online CrawlyUIDemo
See more at Experimental UI
Documentation
Roadmap
- [x] Pluggable HTTP client
- [x] Retries support
- [x] Cookies support
- [x] XPath support - can be actually done with meeseeks
- [ ] Project generators (spiders)
- [ ] UI for jobs management
Articles
- Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
- Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
- Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
- Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
- What is web scraping, and why you might want to use it?
- Using Elixir and Crawly for price monitoring
- Building a Chrome-based fetcher for Crawly
Example projects
- Blog crawler: https://github.com/oltarasenko/crawly-spider-example
- E-commerce websites: https://github.com/oltarasenko/products-advisor
- Car shops: https://github.com/oltarasenko/crawly-cars
- JavaScript based website (Splash example): https://github.com/oltarasenko/autosites
Contributors
We would gladly accept your contributions!
Documentation
Please find documentation on the HexDocs
Production usages
Using Crawly on production? Please let us know about your case!
Copyright and License
Copyright (c) 2019 Oleg Tarasenko
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
*Note that all licence references and agreements mentioned in the Crawly README section above
are relevant to that project's source code only.