Gumbo.jl
Julia wrapper around Google's gumbo C library for parsing HTML
Install / Use
/learn @JuliaWeb/Gumbo.jlREADME
Gumbo.jl
Gumbo.jl is a Julia wrapper around the gumbo library for parsing HTML.
[!WARNING]
The underlying C library is currently unmaintained. Use at your own risk.
Getting started is very easy:
julia> using Gumbo
julia> parsehtml("<h1> Hello, world! </h1>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:
<HTML>
<head></head>
<body>
<h1>
Hello, world!
</h1>
</body>
</HTML>
Read on for further documentation.
Installation
using Pkg
Pkg.add("Gumbo")
or activate Pkg mode in the REPL by typing ], and then:
add Gumbo
Basic usage
The workhorse is the parsehtml function, which takes a single
argument, a valid UTF8 string, which is interpreted as HTML data to be
parsed, e.g.:
parsehtml("<h1> Hello, world! </h1>")
Parsing an HTML file named filenamecan be done using:
julia> parsehtml(read(filename, String))
The result of a call to parsehtml is an HTMLDocument, a type which
has two fields: doctype, which is the doctype of the parsed document
(this will be the empty string if no doctype is provided), and root,
which is a reference to the HTMLElement that is the root of the
document.
Note that gumbo is a very permissive HTML parser, designed to gracefully handle the insanity that passes for HTML out on the wild, wild web. It will return a valid HTML document for any input, doing all sorts of algorithmic gymnastics to twist what you give it into valid HTML.
If you want an HTML validator, this is probably not your library. That
said, parsehtml does take an optional Bool keyword argument,
strict which, if true, causes an InvalidHTMLError to be thrown
if the call to the gumbo C library produces any errors.
HTML types
This library defines a number of types for representing HTML.
HTMLDocument
HTMlDocument is what is returned from a call to parsehtml it has a
doctype field, which contains the doctype of the parsed document,
and a root field, which is a reference to the root of the document.
HTMLNodes
A document contains a tree of HTML Nodes, which are represented as
children of the HTMLNode abstract type. The first of these is
HTMLElement.
HTMLElement
mutable struct HTMLElement{T} <: HTMLNode
children::Vector{HTMLNode}
parent::HTMLNode
attributes::Dict{String, String}
end
HTMLElement is probably the most interesting and frequently used
type. An HTMLElement is parameterized by a symbol representing its
tag. So an HTMLElement{:a} is a different type from an
HTMLElement{:body}, etc. An empty HTMLElement of a given tag can be
constructed as follows:
julia> HTMLElement(:div)
HTMLElement{:div}:
<div></div>
HTMLElements have a parent field, which refers to another
HTMLNode. parent will always be an HTMLElement, unless the
element has no parent (as is the case with the root of a document), in
which case it will be a NullNode, a special type of HTMLNode which
exists for just this purpose. Empty HTMLElements constructed as in
the example above will also have a NullNode for a parent.
HTMLElements also have children, which is a vector of
HTMLElement containing the children of this element, and
attributes, which is a Dict mapping attribute names to values.
HTMLElements implement getindex, setindex!, and push!;
indexing into or pushing onto an HTMLElement operates on its
children array.
There are a number of convenience methods for working with HTMLElements:
-
tag(elem)get the tag of this element as a symbol -
attrs(elem)return the attributes dict of this element -
children(elem)return the children array of this element -
getattr(elem, name)get the value of attributenameor raise aKeyError. Also supports being called with a default value (getattr(elem, name, default)) or function (getattr(f, elem, name)). -
setattr!(elem, name, value)set the value of attributenametovalue
HTMLText
type HTMLText <: HTMLNode
parent::HTMLNode
text::String
end
Represents text appearing in an HTML document. For example:
julia> doc = parsehtml("<h1> Hello, world! </h1>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:
<HTML>
<head></head>
<body>
<h1>
Hello, world!
</h1>
</body>
</HTML>
julia> doc.root[2][1][1]
HTML Text: Hello, world!
This type is quite simple, just a reference to its parent and the
actual text it represents (this is also accessible by a text
function). You can construct HTMLText instances as follows:
julia> HTMLText("Example text")
HTML Text: Example text
Just as with HTMLElements, the parent of an instance so constructed
will be a NullNode.
Tree traversal
Use the iterators defined in AbstractTrees.jl, e.g.:
julia> using AbstractTrees
julia> using Gumbo
julia> doc = parsehtml("""
<html>
<body>
<div>
<p></p> <a></a> <p></p>
</div>
<div>
<span></span>
</div>
</body>
</html>
""");
julia> for elem in PreOrderDFS(doc.root) println(tag(elem)) end
HTML
head
body
div
p
a
p
div
span
julia> for elem in PostOrderDFS(doc.root) println(tag(elem)) end
head
p
a
p
div
span
div
body
HTML
julia> for elem in StatelessBFS(doc.root) println(tag(elem)) end
HTML
head
body
div
div
p
a
p
span
julia>
TODOS
- support CDATA
- support comments
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
