SkillAgentSearch skills...

AdvancedHTMLParser

Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.

Install / Use

/learn @kata198/AdvancedHTMLParser

README

AdvancedHTMLParser

AdvancedHTMLParser is an Advanced HTML Parser, with support for adding, removing, modifying, and formatting HTML.

It aims to provide the same interface as you would find in a compliant browser through javascript ( i.e. all the getElement methods, appendChild, etc), an XPath implementation, as well as many more complex and sophisticated features not available through a browser. And most importantly, it's in python!

There are many potential applications, not limited to:

  • Webpage Scraping / Data Extraction
  • Testing and Validation
  • HTML Modification/Insertion
  • Outputting your website
  • Debugging
  • HTML Document generation
  • Web Crawling
  • Formatting HTML documents or web pages

It is especially good for servlets/webpages. It is quick to take an expertly crafted page in raw HTML / css, and have your servlet's ingest with AdvancedHTMLParser and create/insert data elements into the existing view using a simple and well-known interface ( javascript-like + HTML DOM ).

Another useful scenario is creating automated testing suites which can operate much more quickly and reliably (and at a deeper function-level), unlike in-browser testing suites.

Full API

Can be found http://htmlpreview.github.io/?https://github.com/kata198/AdvancedHTMLParser/blob/master/doc/AdvancedHTMLParser.html?vers=8.1.8 .

Examples

Various examples can be found in the "tests" directory. A very old, simple example can also be found as "example.py" in the root directory.

Short Doc

The Package and Modules

The top-level module in this package is "AdvancedHTMLParser."

import AdvancedHTMLParser

Most everything "public" is available through this top-level module, but some corner-case usages may require importing from a submodule. All of these associations can be found through the pydocs.

For example, to access AdvancedTag, the recommended path is just to import the top-level, and use dot-access:

import AdvancedHTMLParser

myTag = AdvancedHTMLParser.AdvancedTag('div')

However, you can also import AdvancedTag through this top-level module:

import AdvancedHTMLParser

from AdvancedHTMLParser import AdvancedTag

Or, you can import from the specific sub-module, directly:

import AdvancedHTMLParser

from AdvancedHTMLParser.Tags import AdvancedTag

All examples below are written as if "import AdvancedHTMLParser" has already been performed, and all relations in examples are based off usages from the top-level import, only.

AdvancedHTMLParser

Think of this like "document" in a browser.

The AdvancedHTMLParser can read in a file (or string) of HTML, and will create a modifiable DOM tree from it. It can also be constructed manually from AdvancedHTMLParser.AdvancedTag objects.

To populate an AdvancedHTMLParser from existing HTML:

parser = AdvancedHTMLParser.AdvancedHTMLParser()

# Parse an HTML string into the document
parser.parseStr(htmlStr)

# Parse an HTML file into the document
parser.parseFile(filename)

The parser then exposes many "standard" functions as you'd find on the web for accessing the data, and some others:

getElementsByTagName   - Returns a list of all elements matching a tag name

getElementsByName      - Returns a list of all elements with a given name attribute

getElementById         - Returns a single AdvancedTag (or None) if found an element matching the provided ID

getElementsByClassName - Returns a list of all elements containing one or more space-separated class names

getElementsByAttr       - Returns a list of all elements matching a paticular attribute/value pair.

getElementsByXPathExpression - Return a TagCollection (list) of all elements matching a given XPath expression

getElementsWithAttrValues - Returns a list of all elements with a specific attribute name containing one of a list of values

getElementsCustomFilter - Provide a function/lambda that takes a tag argument, and returns True to "match" it. Returns all matched objects

getRootNodes            - Get a list of nodes at root level (0)

getAllNodes             - Get all the nodes contained within this document

getHTML                 - Returns string of HTML representing this DOM

getFormattedHTML        - Returns a formatted string (using AdvancedHTMLFormatter; see below) of the HTML. Takes as argument an indent (defaults to four spaces)

getMiniHTML             - Returns a "mini" HTML representation which disregards all whitespace and indentation beyond the functional single-space

The results of all of these getElement* functions are TagCollection objects. This is a special kind of list which contains additional functions. See the "TagCollection" section below for more info.

These objects can be modified, and will be reflected in the parent DOM.

The parser also contains some expected properties, like

head                    - The "head" tag associated with this document, or None

body                    - The "body" tag associated with this document, or None

forms                   - All "forms" on this document as a TagCollection

General Attributes

In general, attributes can be accessed with dot-syntax, i.e.

tagEm.id = "Hello"

will set the "id" attribute. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.

setAttribute, getAttribute, and removeAttribute are more explicit and recommended ways of getting/setting/deleting attributes on elements.

The same names are used in python as in the javascript/DOM, such as 'className' corrosponding to a space-separated string of the 'class' attribute, 'classList' corrosponding to a list of classes, etc.

Style Attribute

Style attributes can be manipulated just like in javascript, so element.style.position = 'relative' for setting, or element.style.position for access.

You can also assign the tag.style as a string, like:

myTag.style = "display: block; float: right; font-weight: bold"

in addition to individual properties:

myTag.style.display = 'block'
myTag.style.float = 'right'
myTag.style.fontWeight = 'bold'

You can remove style properties by setting its value to an empty string.

For example, to clear "display" property:

myTag.style.display = ''

A standard method setProperty can also obe used to set or remove individual properties

For example:

myTag.style.setProperty("display", "block") # Set display: block

myTag.style.setProperty("display", '') # Clear display: property

The naming conventions are the same as in javascript, like "element.style.paddingTop" for "padding-top" attribute.

TagCollection

A TagCollection can be used like a list. Every element has a unique uuid associated with it, and a TagCollection will ensure that the same element does not appear twice within its list (so it acts like an ordered set)

It also exposes the various getElement* functions which operate on the elements within the list (and their children).

For example:

# Filter off the parser all tags with "item" in class
tagCollection = document.getElementsByClassName('item')

# Return all nodes which are nested within any class="item" object
#  and also contains the class name "onsale"
itemsWithOnSaleClass = tagCollection.getElementsByClassName('onsale')

To operate just on items in the list, you can use the TagCollection method, filterCollection, which takes a lambda/function and returns True to retain that tag in the return.

For example:

# Filter off the parser all tags with "item" in class
tagCollection = document.getElementsByClassName('item')

# Provide a lambda to filter this collection, returning in tagCollection2
#   those items which have a "value" attribute > 20 and contains at least
#   1 child element with "specialPrice" class
tagCollection2 = tagCollection.filterCollection( lambda node : int(node.getAttribute('value') or 0) > 20 and len(node.getElementsByClassName('specialPrice')) > 1 )

TagCollections also support advanced filtering (find/filter methods), see "Advanced Filtering" section below.

AdvancedTag

The AdvancedTag represents a single tag and its inner text. It exposes many of the functions and properties you would expect to be present if using javascript. each AdvancedTag also supports the same getElementsBy* functions as the parser.

It adds several additional that are not found in javascript, such as peers and arbitrary attribute searching.

some of these include:

appendText              - Append text to this element

appendChild             - Append a child to this element

appendBlock             - Append a block (text or AdvancedTag) to this element

append                  - alias of appendBlock

removeChild             - Removes a child

removeText              - Removes first occurance of some text from any text nodes

removeTextAll           - Removes ALL occurances of some text from any text nodes

insertBefore            - Inserts a child before an existing child

insertAfter             - Inserts a child after an existing child

getChildren             - Returns the children as a list

getStartTag             - Start Tag, with attributes

getEndTag               - End Tag

getPeersByName          - Gets "peers" (elements with same parent, at same level in tree) with a given name

getPeersByAttr          - Gets peers by an arbitrary attribute/value combination

getPeersWithAttrValues  - Gets peers by an arbitrary attribute/values combination.

getPeersByClassName   - Gets peers that contain a given class name

getElement\*            - Same as above, but act on the children of this element.

getParentElementCustomFilter - Takes a lambda/function and applies on all parents of this element upward until the document root. Returns the first node that when passed to this function returns True, or None if no matches on any parent nodes

getHTML / toHTML / asHTML - Get the HTML representation using this node as a root (so start t

Related Skills

View on GitHub
GitHub Stars101
CategoryDevelopment
Updated1mo ago
Forks25

Languages

Python

Security Score

100/100

Audited on Feb 23, 2026

No findings