AdvancedHTMLParser
Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.
Install / Use
/learn @kata198/AdvancedHTMLParserREADME
AdvancedHTMLParser
AdvancedHTMLParser is an Advanced HTML Parser, with support for adding, removing, modifying, and formatting HTML.
It aims to provide the same interface as you would find in a compliant browser through javascript ( i.e. all the getElement methods, appendChild, etc), an XPath implementation, as well as many more complex and sophisticated features not available through a browser. And most importantly, it's in python!
There are many potential applications, not limited to:
- Webpage Scraping / Data Extraction
- Testing and Validation
- HTML Modification/Insertion
- Outputting your website
- Debugging
- HTML Document generation
- Web Crawling
- Formatting HTML documents or web pages
It is especially good for servlets/webpages. It is quick to take an expertly crafted page in raw HTML / css, and have your servlet's ingest with AdvancedHTMLParser and create/insert data elements into the existing view using a simple and well-known interface ( javascript-like + HTML DOM ).
Another useful scenario is creating automated testing suites which can operate much more quickly and reliably (and at a deeper function-level), unlike in-browser testing suites.
Full API
Can be found http://htmlpreview.github.io/?https://github.com/kata198/AdvancedHTMLParser/blob/master/doc/AdvancedHTMLParser.html?vers=8.1.8 .
Examples
Various examples can be found in the "tests" directory. A very old, simple example can also be found as "example.py" in the root directory.
Short Doc
The Package and Modules
The top-level module in this package is "AdvancedHTMLParser."
import AdvancedHTMLParser
Most everything "public" is available through this top-level module, but some corner-case usages may require importing from a submodule. All of these associations can be found through the pydocs.
For example, to access AdvancedTag, the recommended path is just to import the top-level, and use dot-access:
import AdvancedHTMLParser
myTag = AdvancedHTMLParser.AdvancedTag('div')
However, you can also import AdvancedTag through this top-level module:
import AdvancedHTMLParser
from AdvancedHTMLParser import AdvancedTag
Or, you can import from the specific sub-module, directly:
import AdvancedHTMLParser
from AdvancedHTMLParser.Tags import AdvancedTag
All examples below are written as if "import AdvancedHTMLParser" has already been performed, and all relations in examples are based off usages from the top-level import, only.
AdvancedHTMLParser
Think of this like "document" in a browser.
The AdvancedHTMLParser can read in a file (or string) of HTML, and will create a modifiable DOM tree from it. It can also be constructed manually from AdvancedHTMLParser.AdvancedTag objects.
To populate an AdvancedHTMLParser from existing HTML:
parser = AdvancedHTMLParser.AdvancedHTMLParser()
# Parse an HTML string into the document
parser.parseStr(htmlStr)
# Parse an HTML file into the document
parser.parseFile(filename)
The parser then exposes many "standard" functions as you'd find on the web for accessing the data, and some others:
getElementsByTagName - Returns a list of all elements matching a tag name
getElementsByName - Returns a list of all elements with a given name attribute
getElementById - Returns a single AdvancedTag (or None) if found an element matching the provided ID
getElementsByClassName - Returns a list of all elements containing one or more space-separated class names
getElementsByAttr - Returns a list of all elements matching a paticular attribute/value pair.
getElementsByXPathExpression - Return a TagCollection (list) of all elements matching a given XPath expression
getElementsWithAttrValues - Returns a list of all elements with a specific attribute name containing one of a list of values
getElementsCustomFilter - Provide a function/lambda that takes a tag argument, and returns True to "match" it. Returns all matched objects
getRootNodes - Get a list of nodes at root level (0)
getAllNodes - Get all the nodes contained within this document
getHTML - Returns string of HTML representing this DOM
getFormattedHTML - Returns a formatted string (using AdvancedHTMLFormatter; see below) of the HTML. Takes as argument an indent (defaults to four spaces)
getMiniHTML - Returns a "mini" HTML representation which disregards all whitespace and indentation beyond the functional single-space
The results of all of these getElement* functions are TagCollection objects. This is a special kind of list which contains additional functions. See the "TagCollection" section below for more info.
These objects can be modified, and will be reflected in the parent DOM.
The parser also contains some expected properties, like
head - The "head" tag associated with this document, or None
body - The "body" tag associated with this document, or None
forms - All "forms" on this document as a TagCollection
General Attributes
In general, attributes can be accessed with dot-syntax, i.e.
tagEm.id = "Hello"
will set the "id" attribute. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.
setAttribute, getAttribute, and removeAttribute are more explicit and recommended ways of getting/setting/deleting attributes on elements.
The same names are used in python as in the javascript/DOM, such as 'className' corrosponding to a space-separated string of the 'class' attribute, 'classList' corrosponding to a list of classes, etc.
Style Attribute
Style attributes can be manipulated just like in javascript, so element.style.position = 'relative' for setting, or element.style.position for access.
You can also assign the tag.style as a string, like:
myTag.style = "display: block; float: right; font-weight: bold"
in addition to individual properties:
myTag.style.display = 'block'
myTag.style.float = 'right'
myTag.style.fontWeight = 'bold'
You can remove style properties by setting its value to an empty string.
For example, to clear "display" property:
myTag.style.display = ''
A standard method setProperty can also obe used to set or remove individual properties
For example:
myTag.style.setProperty("display", "block") # Set display: block
myTag.style.setProperty("display", '') # Clear display: property
The naming conventions are the same as in javascript, like "element.style.paddingTop" for "padding-top" attribute.
TagCollection
A TagCollection can be used like a list. Every element has a unique uuid associated with it, and a TagCollection will ensure that the same element does not appear twice within its list (so it acts like an ordered set)
It also exposes the various getElement* functions which operate on the elements within the list (and their children).
For example:
# Filter off the parser all tags with "item" in class
tagCollection = document.getElementsByClassName('item')
# Return all nodes which are nested within any class="item" object
# and also contains the class name "onsale"
itemsWithOnSaleClass = tagCollection.getElementsByClassName('onsale')
To operate just on items in the list, you can use the TagCollection method, filterCollection, which takes a lambda/function and returns True to retain that tag in the return.
For example:
# Filter off the parser all tags with "item" in class
tagCollection = document.getElementsByClassName('item')
# Provide a lambda to filter this collection, returning in tagCollection2
# those items which have a "value" attribute > 20 and contains at least
# 1 child element with "specialPrice" class
tagCollection2 = tagCollection.filterCollection( lambda node : int(node.getAttribute('value') or 0) > 20 and len(node.getElementsByClassName('specialPrice')) > 1 )
TagCollections also support advanced filtering (find/filter methods), see "Advanced Filtering" section below.
AdvancedTag
The AdvancedTag represents a single tag and its inner text. It exposes many of the functions and properties you would expect to be present if using javascript. each AdvancedTag also supports the same getElementsBy* functions as the parser.
It adds several additional that are not found in javascript, such as peers and arbitrary attribute searching.
some of these include:
appendText - Append text to this element
appendChild - Append a child to this element
appendBlock - Append a block (text or AdvancedTag) to this element
append - alias of appendBlock
removeChild - Removes a child
removeText - Removes first occurance of some text from any text nodes
removeTextAll - Removes ALL occurances of some text from any text nodes
insertBefore - Inserts a child before an existing child
insertAfter - Inserts a child after an existing child
getChildren - Returns the children as a list
getStartTag - Start Tag, with attributes
getEndTag - End Tag
getPeersByName - Gets "peers" (elements with same parent, at same level in tree) with a given name
getPeersByAttr - Gets peers by an arbitrary attribute/value combination
getPeersWithAttrValues - Gets peers by an arbitrary attribute/values combination.
getPeersByClassName - Gets peers that contain a given class name
getElement\* - Same as above, but act on the children of this element.
getParentElementCustomFilter - Takes a lambda/function and applies on all parents of this element upward until the document root. Returns the first node that when passed to this function returns True, or None if no matches on any parent nodes
getHTML / toHTML / asHTML - Get the HTML representation using this node as a root (so start t
Related Skills
node-connect
346.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
107.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
107.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
346.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
