Kimuraframework

Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs.

Generate Convert Improve

Install / Use

/learn @vifreefly/Kimuraframework

About this skill

Quality Score

0/100

README

<div align="center"> <a href="https://github.com/vifreefly/kimuraframework"> <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png"> </a> <h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1> </div>

# google_spider.rb
require 'kimurai'

class GoogleSpider < Kimurai::Base
  @start_urls = ['https://www.google.com/search?q=web+scraping+ai']
  @delay = 1

  def parse(response, url:, data: {})
    results = extract(response) do
      array :organic_results do
        object do
          string :title
          string :snippet
          string :url
        end
      end

      array :sponsored_results do
        object do
          string :title
          string :snippet
          string :url
        end
      end

      array :people_also_search_for, of: :string

      string :next_page_link
      number :current_page_number
    end

    save_to 'google_results.json', results, format: :json

    if results[:next_page_link] && results[:current_page_number] < 3
      request_to :parse, url: absolute_url(results[:next_page_link], base: url)
    end
  end
end

GoogleSpider.crawl!

How it works:

On the first request, extract sends the HTML + your schema to an LLM
The LLM generates XPath selectors and caches them in google_spider.json
All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction
Supports OpenAI, Anthropic, Gemini, or local LLMs via Nukitori

Traditional Mode

Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:

# github_spider.rb
require 'kimurai'

class GithubSpider < Kimurai::Base
  @engine = :chrome
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
  @delay = 3..5

  def parse(response, url:, data: {})
    response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
    end

    if next_page = response.at_xpath("//a[@rel='next']")
      request_to :parse, url: absolute_url(next_page[:href], base: url)
    end
  end

  def parse_repo_page(response, url:, data: {})
    item = {}

    item[:owner] = response.xpath("//a[@rel='author']").text.squish
    item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
    item[:repo_url] = url
    item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
    item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
    item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
    item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
    item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
    item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish

    save_to "results.json", item, format: :pretty_json
  end
end

GithubSpider.crawl!

<details/> <summary>Run: <code>$ ruby github_spider.rb</code></summary>

$ ruby github_spider.rb

I, [2025-12-16 12:15:48]  INFO -- github_spider: Spider: started: github_spider
I, [2025-12-16 12:15:48]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01]  INFO -- github_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:16:01]  INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06]  INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06]  INFO -- github_spider: Info: visits: requests: 2, responses: 2
I, [2025-12-16 12:16:06]  INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11]  INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11]  INFO -- github_spider: Info: visits: requests: 3, responses: 3
I, [2025-12-16 12:16:11]  INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13]  INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13]  INFO -- github_spider: Info: visits: requests: 4, responses: 4
I, [2025-12-16 12:16:13]  INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
I, [2025-12-16 12:16:17]  INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework

...

</details> <details/> <summary>results.json</summary>

[
  {
    "owner": "sparklemotion",
    "repo_name": "mechanize",
    "repo_url": "https://github.com/sparklemotion/mechanize",
    "description": "Mechanize is a ruby library that makes automated web interaction easy.",
    "tags": ["ruby", "web", "scraping"],
    "watch_count": "79",
    "star_count": "4.4k",
    "fork_count": "480",
    "last_commit": "Sep 30, 2025",
    "position": 1
  },
  {
    "owner": "jaimeiniesta",
    "repo_name": "metainspector",
    "repo_url": "https://github.com/jaimeiniesta/metainspector",
    "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
    "tags": [],
    "watch_count": "20",
    "star_count": "1k",
    "fork_count": "166",
    "last_commit": "Oct 8, 2025",
    "position": 2
  },
  {
    "owner": "Germey",
    "repo_name": "AwesomeWebScraping",
    "repo_url": "https://github.com/Germey/AwesomeWebScraping",
    "description": "List of libraries, tools and APIs for web scraping and data processing.",
    "tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
    "watch_count": "5",
    "star_count": "253",
    "fork_count": "33",
    "last_commit": "Apr 5, 2024",
    "position": 3
  },
  {
    "owner": "vifreefly",
    "repo_name": "kimuraframework",
    "repo_url": "https://github.com/vifreefly/kimuraframework",
    "description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
    "tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
    "watch_count": "28",
    "star_count": "1k",
    "fork_count": "158",
    "last_commit": "Dec 12, 2025",
    "position": 4
  },
  // ...
  {
    "owner": "citixenken",
    "repo_name": "web_scraping_with_ruby",
    "repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
    "description": "",
    "tags": [],
    "watch_count": "1",
    "star_count": "0",
    "fork_count": "0",
    "last_commit": "Aug 29, 2022",
    "position": 118
  }
]

</details><br>

Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:

# infinite_scroll_spider.rb
require 'kimurai'

class InfiniteScrollSpider < Kimurai::Base
  @engine = :chrome
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]

  def parse(response, url:, data: {})
    posts_headers_path = "//article/h2"
    count = response.xpath(posts_headers_path).count

    loop do
      browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
      response = browser.current_response

      new_count = response.xpath(posts_headers_path).count
      if count == new_count
        logger.info "> Pagination is done" and break
      else
        count = new_count
        logger.info "> Continue scrolling, current posts count is #{count}..."
      end
    end

    posts_headers = response.xpath(posts_headers_path).map(&:text)
    logger.info "> All posts from page: #{posts_headers.join('; ')}"
  end
end

InfiniteScrollSpider.crawl!

<details/> <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>

$ ruby infinite_scroll_spider.rb

I, [2025-12-16 12:47:05]  INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
I, [2025-12-16 12:47:05]  INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09]  INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09]  INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:47:11]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
I, [2025-12-16 12:47:13]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
I, [2025-12-16 12:47:15]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
I, [2025-12-16 12:47:17]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
I, [2025-12-16 12:47:19]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > Pagination is done
I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from pa

Related Skills

node-connect

337.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.3k

Commit, push, and open a PR