SkillAgentSearch skills...

Kimuraframework

Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs.

Install / Use

/learn @vifreefly/Kimuraframework
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <a href="https://github.com/vifreefly/kimuraframework"> <img width="312" height="200" src="https://hsto.org/webt/_v/mt/tp/_vmttpbpzbt-y2aook642d9wpz0.png"> </a> <h1>Kimurai: AI-First Web Scraping Framework for Ruby</h1> </div>

Write web scrapers in Ruby using a clean, AI-assisted DSL. Kimurai uses AI to figure out where the data lives, then caches the selectors and scrapes with pure Ruby. Get the intelligence of an LLM without the per-request latency or token costs:

# google_spider.rb
require 'kimurai'

class GoogleSpider < Kimurai::Base
  @start_urls = ['https://www.google.com/search?q=web+scraping+ai']
  @delay = 1

  def parse(response, url:, data: {})
    results = extract(response) do
      array :organic_results do
        object do
          string :title
          string :snippet
          string :url
        end
      end

      array :sponsored_results do
        object do
          string :title
          string :snippet
          string :url
        end
      end

      array :people_also_search_for, of: :string

      string :next_page_link
      number :current_page_number
    end

    save_to 'google_results.json', results, format: :json

    if results[:next_page_link] && results[:current_page_number] < 3
      request_to :parse, url: absolute_url(results[:next_page_link], base: url)
    end
  end
end

GoogleSpider.crawl!

How it works:

  1. On the first request, extract sends the HTML + your schema to an LLM
  2. The LLM generates XPath selectors and caches them in google_spider.json
  3. All subsequent requests use cached XPath — zero AI calls, pure fast Ruby extraction
  4. Supports OpenAI, Anthropic, Gemini, or local LLMs via Nukitori

Traditional Mode

Prefer writing your own selectors? Kimurai works great as a traditional scraper too — with headless antidetect Chromium, Firefox, or simple HTTP requests:

# github_spider.rb
require 'kimurai'

class GithubSpider < Kimurai::Base
  @engine = :chrome
  @start_urls = ["https://github.com/search?q=ruby+web+scraping&type=repositories"]
  @delay = 3..5

  def parse(response, url:, data: {})
    response.xpath("//div[@data-testid='results-list']//div[contains(@class, 'search-title')]/a").each do |a|
      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
    end

    if next_page = response.at_xpath("//a[@rel='next']")
      request_to :parse, url: absolute_url(next_page[:href], base: url)
    end
  end

  def parse_repo_page(response, url:, data: {})
    item = {}

    item[:owner] = response.xpath("//a[@rel='author']").text.squish
    item[:repo_name] = response.xpath("//strong[@itemprop='name']").text.squish
    item[:repo_url] = url
    item[:description] = response.xpath("//div[h2[text()='About']]/p").text.squish
    item[:tags] = response.xpath("//div/a[contains(@title, 'Topic')]").map { |a| a.text.squish }
    item[:watch_count] = response.xpath("//div/h3[text()='Watchers']/following-sibling::div[1]/a/strong").text.squish
    item[:star_count] = response.xpath("//div/h3[text()='Stars']/following-sibling::div[1]/a/strong").text.squish
    item[:fork_count] = response.xpath("//div/h3[text()='Forks']/following-sibling::div[1]/a/strong").text.squish
    item[:last_commit] = response.xpath("//div[@data-testid='latest-commit-details']//relative-time/text()").text.squish

    save_to "results.json", item, format: :pretty_json
  end
end

GithubSpider.crawl!
<details/> <summary>Run: <code>$ ruby github_spider.rb</code></summary>
$ ruby github_spider.rb

I, [2025-12-16 12:15:48]  INFO -- github_spider: Spider: started: github_spider
I, [2025-12-16 12:15:48]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=ruby+web+scraping&type=repositories
I, [2025-12-16 12:16:01]  INFO -- github_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:16:01]  INFO -- github_spider: Browser: started get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06]  INFO -- github_spider: Browser: finished get request to: https://github.com/sparklemotion/mechanize
I, [2025-12-16 12:16:06]  INFO -- github_spider: Info: visits: requests: 2, responses: 2
I, [2025-12-16 12:16:06]  INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11]  INFO -- github_spider: Browser: finished get request to: https://github.com/jaimeiniesta/metainspector
I, [2025-12-16 12:16:11]  INFO -- github_spider: Info: visits: requests: 3, responses: 3
I, [2025-12-16 12:16:11]  INFO -- github_spider: Browser: started get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13]  INFO -- github_spider: Browser: finished get request to: https://github.com/Germey/AwesomeWebScraping
I, [2025-12-16 12:16:13]  INFO -- github_spider: Info: visits: requests: 4, responses: 4
I, [2025-12-16 12:16:13]  INFO -- github_spider: Browser: started get request to: https://github.com/vifreefly/kimuraframework
I, [2025-12-16 12:16:17]  INFO -- github_spider: Browser: finished get request to: https://github.com/vifreefly/kimuraframework

...
</details> <details/> <summary>results.json</summary>
[
  {
    "owner": "sparklemotion",
    "repo_name": "mechanize",
    "repo_url": "https://github.com/sparklemotion/mechanize",
    "description": "Mechanize is a ruby library that makes automated web interaction easy.",
    "tags": ["ruby", "web", "scraping"],
    "watch_count": "79",
    "star_count": "4.4k",
    "fork_count": "480",
    "last_commit": "Sep 30, 2025",
    "position": 1
  },
  {
    "owner": "jaimeiniesta",
    "repo_name": "metainspector",
    "repo_url": "https://github.com/jaimeiniesta/metainspector",
    "description": "Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...",
    "tags": [],
    "watch_count": "20",
    "star_count": "1k",
    "fork_count": "166",
    "last_commit": "Oct 8, 2025",
    "position": 2
  },
  {
    "owner": "Germey",
    "repo_name": "AwesomeWebScraping",
    "repo_url": "https://github.com/Germey/AwesomeWebScraping",
    "description": "List of libraries, tools and APIs for web scraping and data processing.",
    "tags": ["javascript", "ruby", "python", "golang", "php", "awesome", "captcha", "proxy", "web-scraping", "aswsome-list"],
    "watch_count": "5",
    "star_count": "253",
    "fork_count": "33",
    "last_commit": "Apr 5, 2024",
    "position": 3
  },
  {
    "owner": "vifreefly",
    "repo_name": "kimuraframework",
    "repo_url": "https://github.com/vifreefly/kimuraframework",
    "description": "Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites",
    "tags": ["crawler", "scraper", "scrapy", "headless-chrome", "kimurai"],
    "watch_count": "28",
    "star_count": "1k",
    "fork_count": "158",
    "last_commit": "Dec 12, 2025",
    "position": 4
  },
  // ...
  {
    "owner": "citixenken",
    "repo_name": "web_scraping_with_ruby",
    "repo_url": "https://github.com/citixenken/web_scraping_with_ruby",
    "description": "",
    "tags": [],
    "watch_count": "1",
    "star_count": "0",
    "fork_count": "0",
    "last_commit": "Aug 29, 2022",
    "position": 118
  }
]
</details><br>

Okay, that was easy. How about JavaScript rendered websites with dynamic HTML? Let's scrape a page with infinite scroll:

# infinite_scroll_spider.rb
require 'kimurai'

class InfiniteScrollSpider < Kimurai::Base
  @engine = :chrome
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]

  def parse(response, url:, data: {})
    posts_headers_path = "//article/h2"
    count = response.xpath(posts_headers_path).count

    loop do
      browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
      response = browser.current_response

      new_count = response.xpath(posts_headers_path).count
      if count == new_count
        logger.info "> Pagination is done" and break
      else
        count = new_count
        logger.info "> Continue scrolling, current posts count is #{count}..."
      end
    end

    posts_headers = response.xpath(posts_headers_path).map(&:text)
    logger.info "> All posts from page: #{posts_headers.join('; ')}"
  end
end

InfiniteScrollSpider.crawl!
<details/> <summary>Run: <code>$ ruby infinite_scroll_spider.rb</code></summary>
$ ruby infinite_scroll_spider.rb

I, [2025-12-16 12:47:05]  INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
I, [2025-12-16 12:47:05]  INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09]  INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
I, [2025-12-16 12:47:09]  INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
I, [2025-12-16 12:47:11]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 5...
I, [2025-12-16 12:47:13]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 9...
I, [2025-12-16 12:47:15]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 11...
I, [2025-12-16 12:47:17]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 13...
I, [2025-12-16 12:47:19]  INFO -- infinite_scroll_spider: > Continue scrolling, current posts count is 15...
I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > Pagination is done
I, [2025-12-16 12:47:21]  INFO -- infinite_scroll_spider: > All posts from pa

Related Skills

View on GitHub
GitHub Stars1.1k
CategoryDevelopment
Updated12d ago
Forks161

Languages

Ruby

Security Score

100/100

Audited on Mar 14, 2026

No findings