SkillAgentSearch skills...

Gammo

A pure Ruby HTML5-compliant parser with CSS selector and XPath 1.0 traversal

Install / Use

/learn @namusyaka/Gammo
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Gammo - A pure-Ruby HTML5 parser

Testing GitHub issues GitHub forks GitHub stars GitHub license Documentation

Gammo provides a pure Ruby HTML5-compliant parser and CSS selector / XPath support for traversing the DOM tree built by Gammo. The implementation of the HTML5 parsing algorithm in Gammo conforms the WHATWG specification. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm, these implementations are provided without any external dependencies.

Gammo, its naming is inspired by Gumbo. But Gammo is a fried tofu fritter made with vegetables.

require 'gammo'
require 'open-uri'

parser = URI.open('https://google.com') { |f| Gammo.new(f.read) }
document = parser.parse #=> #<Gammo::Node::Document>

puts document.css('title').first.inner_text #=> 'Google'

Overview

Features

Tokenizaton

Gammo::Tokenizer implements the tokenization algorithm in WHATWG. You can get tokens in order by calling Gammo::Tokenizer#next_token.

Here is a simple example for performing only the tokenizer.

def dump_for(token)
  puts "data: #{token.data}, class: #{token.class}"
end

tokenizer = Gammo::Tokenizer.new('<!doctype html><input type="button"><frameset>')
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken

The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.

Token types

The tokens generated by the tokenizer will be categorized into one of the following types:

<table> <thead> <tr> <th>Token type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>Gammo::Tokenizer::ErrorToken</code></td> <td>Represents an error token, it usually means end-of-string.</td> </tr> <tr> <td><code>Gammo::Tokenizer::TextToken</code></td> <td>Represents a text token like "foo" which is inner text of elements.</td> </tr> <tr> <td><code>Gammo::Tokenizer::StartTagToken</code></td> <td>Represents a start tag token like <code>&lt;a&gt;</code>.</td> </tr> <tr> <td><code>Gammo::Tokenizer::EndTagToken</code></td> <td>Represents an end tag token like <code>&lt;/a&gt;</code>.</td> </tr> <tr> <td><code>Gammo::Tokenizer::SelfClosingTagToken</code></td> <td>Represents a self closing tag token like <code>&lt;img /&gt;</code></td> </tr> <tr> <td><code>Gammo::Tokenizer::CommentToken</code></td> <td>Represents a comment token like <code>&lt;!-- comment --&gt;</code>.</td> </tr> <tr> <td><code>Gammo::Tokenizer::DoctypeToken</code></td> <td>Represents a doctype token like <code>&lt;!doctype html&gt;</code>.</td> </tr> </tbody> </table>

Parsing

Gammo::Parser implements processing in the tree-construction stage based on the tokenization described above.

A successfully parsed parser has the document accessor as the root document (this is the same as the return value of the Gammo::Parser#parse). From the document accessor, you can traverse the DOM tree constructed by the parser.

require 'gammo'
require 'pp'

document = Gammo.new('<!doctype html><input type="button">').parse

def dump_for(node, strm)
  strm << node.to_h
  return unless node && (child = node.first_child)
  while child
    dump_for(child, (strm.last[:children] ||= []))
    child = child.next_sibling
  end
  strm
end

pp dump_for(document, [])

Notes

Currently, it's not possible to traverse the DOM tree with css selector or xpath like Nokogiri. However, Gammo plans to implement these features in the future.

Node

The nodes generated by the parser will be categorized into one of the following types:

<table> <thead> <tr> <th>Node type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>Gammo::Node::Error</code></td> <td>Represents error node, it usually means end-of-string.</td> </tr> <tr> <td><code>Gammo::Node::Text</code></td> <td>Represents the text node like "foo" which is inner text of elements.</td> </tr> <tr> <td><code>Gammo::Node::Document</code></td> <td>Represents the root document type. It's always returned by <code>Gammo::Parser#document</code>.</td> </tr> <tr> <td><code>Gammo::Node::Element</code></td> <td>Represents any elements of HTML like <code>&lt;p&gt;</code>.</td> </tr> <tr> <td><code>Gammo::Node::Comment</code></td> <td>Represents comments like <code>&lt;!-- foo --&gt;</code></td> </tr> <tr> <td><code>Gammo::Node::Doctype</code></td> <td>Represents doctype like <code>&lt;!doctype html&gt;</code></td> </tr> </tbody> </table>

For some nodes such as Gammo::Node::Element and Gammo::Node::Document, they contain pointers to nodes that can be referenced by itself, such as Gammo::Node#next_sibling or Gammo::Node#first_child. In addition, APIs such as Gammo::Node#append_child and Gammo::Node#remove_child that perform operations defined in DOM living standard are also provided.

DOM Tree Traversal

CSS selector and XPath-1.0 are the way for traversing DOM tree built by Gammo.

XPath 1.0 (experimental)

Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo. Here is a simple example:

document = Gammo.new('<!doctype html><input type="button">').parse
node_set = document.xpath('//input[@type="button"]') #=> "<Gammo::XPath::NodeSet>"

node_set.length #=> 1
node_set.first #=> "<Gammo::Node::Element>"

Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.

Example

Before proceeding at the details of XPath support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:

document = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
</head>
<body>
  <h1>namusyaka.com</h1>
  <p class="description">Here is a sample web site.</p>
  <ul>
    <li>hello</li>
    <li>world</li>
  </ul>
  <ul id="links">
    <li>Google <a href="https://google.com/">google.com</a></li>
    <li>GitHub <a href="https://github.com/namusyaka">github.com/namusyaka</a></li>
  </ul>
</body>
</html>
EOS

The following XPath expression gets all li elements and prints those text contents:

document.xpath('//li').each do |elm|
  puts elm.inner_text
end

The following XPath expression gets all li elements under the ul element having the id=links attribute:

document.xpath('//ul[@id="links"]/li').

Related Skills

View on GitHub
GitHub Stars197
CategoryDevelopment
Updated7d ago
Forks6

Languages

Ruby

Security Score

95/100

Audited on Mar 27, 2026

No findings