Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Install / Use
/learn @postmodern/SpidrREADME
Spidr
Description
Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Features
- Follows:
atags.iframetags.frametags.- Cookie protected links.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- Meta-Refresh Redirects.
- HTTP Basic Auth protected links.
- Black-list or white-list URLs based upon:
- URL scheme.
- Host name
- Port number
- Full link
- URL extension
- Optional
/robots.txtsupport.
- Provides callbacks for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Every origin and destination URI of a link.
- Every URL that failed to be visited.
- Provides action methods to:
- Pause spidering.
- Skip processing of pages.
- Skip processing of links.
- Restore the spidering queue and history from a previous session.
- Custom User-Agent strings.
- Custom proxy settings.
- HTTPS support.
Examples
Start spidering from a URL:
Spidr.start_at('https://www.ruby-lang.org/en/') do |agent|
# ...
end
Spider a host:
Spidr.host('www.ruby-lang.org') do |agent|
# ...
end
Spider a domain (and any sub-domains):
Spidr.domain('ruby-lang.org') do |agent|
# ...
end
Spider a site:
Spidr.site('https://www.ruby-lang.org/') do |agent|
# ...
end
Spider multiple hosts:
Spidr.start_at('https://www.ruby-lang.org/en/', hosts: ['ruby-lang.org', /.*\.ruby-lang\.org\z/]) do |agent|
# ...
end
Do not spider certain links:
Spidr.site('https://www.ruby-lang.org/', ignore_links: [%r{\A/blog/}]) do |agent|
# ...
end
Do not spider links on certain ports:
Spidr.site('https://www.ruby-lang.org/', ignore_ports: [8000, 8010, 8080]) do |agent|
# ...
end
Do not spider links blacklisted in robots.txt:
Spidr.site('https://www.ruby-lang.org/', robots: true) do |agent|
# ...
end
Print out visited URLs:
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_url { |url| puts url }
end
Build a URL map of a site:
url_map = Hash.new { |hash,key| hash[key] = [] }
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
Print out the URLs that could not be requested:
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_failed_url { |url| puts url }
end
Finds all pages which have broken links:
url_map = Hash.new { |hash,key| hash[key] = [] }
spider = Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
spider.failures.each do |url|
puts "Broken link #{url} found in:"
url_map[url].each { |page| puts " #{page}" }
end
Search HTML and XML pages:
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_page do |page|
puts ">>> #{page.url}"
page.search('//meta').each do |meta|
name = (meta.attributes['name'] || meta.attributes['http-equiv'])
value = meta.attributes['content']
puts " #{name} = #{value}"
end
end
end
Print out the titles from every page:
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_html_page do |page|
puts page.title
end
end
Print out every HTTP redirect:
Spidr.host('www.ruby-lang.org') do |spider|
spider.every_redirect_page do |page|
puts "#{page.url} -> #{page.headers['Location']}"
end
end
Find what kinds of web servers a host is using, by accessing the headers:
servers = Set[]
Spidr.host('www.ruby-lang.org') do |spider|
spider.all_headers do |headers|
servers << headers['server']
end
end
Pause the spider on a forbidden page:
Spidr.host('www.ruby-lang.org') do |spider|
spider.every_forbidden_page do |page|
spider.pause!
end
end
Skip the processing of a page:
Spidr.host('www.ruby-lang.org') do |spider|
spider.every_missing_page do |page|
spider.skip_page!
end
end
Skip the processing of links:
Spidr.host('www.ruby-lang.org') do |spider|
spider.every_url do |url|
if url.path.split('/').find { |dir| dir.to_i > 1000 }
spider.skip_link!
end
end
end
Requirements
Install
$ gem install spidr
License
See {file:LICENSE.txt} for license information.
Related Skills
diffs
340.5kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
clearshot
Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.
openpencil
1.9kThe world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.
HappyColorBlend
HappyColorBlendVibe Project Guidelines Project Overview HappyColorBlendVibe is a Figma plugin for color palette generation with advanced tint/shade blending capabilities. It allows designers to
