Kimuraframework
Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs.
Install / Use
/learn @vifreefly/KimuraframeworkREADME
Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:
# google_spider.rb
require 'kimurai'
class GoogleSpider < Kimurai::Base
@start_urls = ['https://www.google.com/search?q=web+scraping+ai']
@delay = 1
def parse(response, url:, data: {})
results = extract(response) do
array :organic_results do
object do
string :title
string :snippet
string :url
end
end
array :sponsored_results do
object do
string :title
string :snippet
string :url
end
end
array :people_also_search_for, of: :string
string :next_page_link
number :current_page_number
end
save_to 'google_results.json', results, format: :json
if results[:next_page_link] && results[:current_page_number] < 3
request_to :parse, url: absolute_url(results[:next_page_link], base: url)
end
end
end
GoogleSpider.crawl!
How it works:
- On the first request,
extractsends the HTML + your schema to an LLM - The LLM generates XPath selectors and caches them in
google_spider.json - All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction
- Supports OpenAI, Anthropic, Gemini, or local LLMs via Nukitori
Traditional Mode
Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:
# github_spider.rb
require 'kimurai'
class GithubSpider < Kimurai::Base
@engine = :chrome
@start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
@delay = 3..5
def parse(response, url:, data: {})
response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
end
if next_page = response.at_xpath("//a[@rel='next']")
request_to :parse, url: absolute_url(next_page[:href], base: url)
end
end
def parse_repo_page(response, url:, data: {})
item = {}
item[:owner] = response.xpath("//a[@rel='author']").text.squish
item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
item[:repo_url] = url
item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish
save_to "results.json", item, format: :pretty_json
end
end
GithubSpider.crawl!
<details/>
<summary>Run: <code>$ ruby github_spider.rb</code></summary>
$ ruby github_spider.rb
I, [2025-12-16 12:15:48] INFO -- github_spider: Spider: started: github_spider
I, [2025-12-16 12:15:48] INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01] INFO -- github_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:16:01] INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06] INFO -- github_spider: Info: visits: requests: 2, responses: 2
I, [2025-12-16 12:16:06] INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11] INFO -- github_spider: Info: visits: requests: 3, responses: 3
I, [2025-12-16 12:16:11] INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13] INFO -- github_spider: Info: visits: requests: 4, responses: 4
I, [2025-12-16 12:16:13] INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
I, [2025-12-16 12:16:17] INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework
...
</details>
<details/>
<summary>results.json</summary>
[
{
"owner": "sparklemotion",
"repo_name": "mechanize",
"repo_url": "https://github.com/sparklemotion/mechanize",
"description": "Mechanize is a ruby library that makes automated web interaction easy.",
"tags": ["ruby", "web", "scraping"],
"watch_count": "79",
"star_count": "4.4k",
"fork_count": "480",
"last_commit": "Sep 30, 2025",
"position": 1
},
{
"owner": "jaimeiniesta",
"repo_name": "metainspector",
"repo_url": "https://github.com/jaimeiniesta/metainspector",
"description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
"tags": [],
"watch_count": "20",
"star_count": "1k",
"fork_count": "166",
"last_commit": "Oct 8, 2025",
"position": 2
},
{
"owner": "Germey",
"repo_name": "AwesomeWebScraping",
"repo_url": "https://github.com/Germey/AwesomeWebScraping",
"description": "List of libraries, tools and APIs for web scraping and data processing.",
"tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
"watch_count": "5",
"star_count": "253",
"fork_count": "33",
"last_commit": "Apr 5, 2024",
"position": 3
},
{
"owner": "vifreefly",
"repo_name": "kimuraframework",
"repo_url": "https://github.com/vifreefly/kimuraframework",
"description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
"tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
"watch_count": "28",
"star_count": "1k",
"fork_count": "158",
"last_commit": "Dec 12, 2025",
"position": 4
},
// ...
{
"owner": "citixenken",
"repo_name": "web_scraping_with_ruby",
"repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
"description": "",
"tags": [],
"watch_count": "1",
"star_count": "0",
"fork_count": "0",
"last_commit": "Aug 29, 2022",
"position": 118
}
]
</details><br>
Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:
# infinite_scroll_spider.rb
require 'kimurai'
class InfiniteScrollSpider < Kimurai::Base
@engine = :chrome
@start_urls = ["https://infinite-scroll.com/demo/full-page/"]
def parse(response, url:, data: {})
posts_headers_path = "//article/h2"
count = response.xpath(posts_headers_path).count
loop do
browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
response = browser.current_response
new_count = response.xpath(posts_headers_path).count
if count == new_count
logger.info "> Pagination is done" and break
else
count = new_count
logger.info "> Continue scrolling, current posts count is #{count}..."
end
end
posts_headers = response.xpath(posts_headers_path).map(&:text)
logger.info "> All posts from page: #{posts_headers.join('; ')}"
end
end
InfiniteScrollSpider.crawl!
<details/>
<summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
$ ruby infinite_scroll_spider.rb
I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
I, [2025-12-16 12:47:05] INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09] INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:47:11] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
I, [2025-12-16 12:47:13] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
I, [2025-12-16 12:47:15] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
I, [2025-12-16 12:47:17] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
I, [2025-12-16 12:47:19] INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > Pagination is done
I, [2025-12-16 12:47:21] INFO -- infinite_scroll_spider: > All posts from pa
Related Skills
node-connect
337.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.3kCommit, push, and open a PR
