A REXX XML PARSER

the big picture

Function

This is a Rexx XML parser that you can append to your own Rexx source. You can then parse xml files into an in-memory model and access the model via a DOM-like API.

This version has been tested on:

z/OS v2.3 TSO (using TSO/REXX)
Windows (using Regina Rexx 3.9.1)
Linux (using Regina Rexx 3.9.1)

Installation

Copy the distribution library to your Rexx library.
Execute the REXXPP INCLUDE pre-processor by running:

REXXPP yourlib(PRETTY) PRETTYPP

This will append PARSEXML to PRETTY and create PRETTYPP. You will be prompted for the location of each include file that cannot be found. Obviously, you can just use the editor if you prefer!
Execute PRETTYPP to parse the TESTXML file by:

PRETTYPP yourlib(TESTXML) (NOBLANKS

If you supply any options, be sure to put a space before the opening parenthesis. A closing parenthesis is optional.
Repeat steps 2 and 3 for each of the other sample Rexx procedures if you want.

Usage

Initialize the parser by:

call initParser [options...]
Parse the XML file to build an in-memory model

returncode = parseFile('filename')

...or...

returncode = parseString('xml in a string')

Navigate the in-memory model with the DOM API. For example:

say 'The document element is called',
                      getName(getDocumentElement())
say 'Children of the document element are:'
node = getFirstChild(getDocumentElement())
do while node <> ''
    if isElementNode(node)
    then say 'Element node:' getName(node)
    else say '   Text node:' getText(node)
    node = getNextSibling(node)
end

Optionally, destroy the in-memory model:

call destroyParser

Input

The input to the parser consists of either an XML file or a string containing:

An optional XML prolog:

0 or 1 XML declaration:

<?xml version="1.0" encoding="..." ...?>

0 or more comments, Processing Instructions (PI's), and whitespace:
```

<?target string?>
```

0 or 1 document type declaration. Formats:

<!DOCTYPE root SYSTEM "sysid">
<!DOCTYPE root PUBLIC "pubid" SYSTEM "sysid">
<!DOCTYPE root [internal dtd]>

An XML body:

1 Document element containing 0 or more child elements. For example:

<doc attr1="value1" attr2="value2"...>
  Text of doc element
  <child1 attr1="value1">
    Text of child1 element
  </child1>
  More text of doc element
  <!-- an empty child element follows -->
  <child2/>
  Even more text of doc element
</doc>

Elements may contain:
- Unparsed character data:
```
<![CDATA[...unparsed data...]]>
```
- Entity references:
```
&name;
```
- Character references:
```
&#nnnnn;
&#xXXXX;
```

An XML epilog (which is ignored):
- 0 or more comments, Processing Instructions (PIs), and whitespace.

Application Programming Interface

The basic setup/teardown API calls are:

initParser [options]

Initialises the parser's global variables and remembers any runtime options you specify. The options recognized are:

| Option | Meaning | | -------- | ------------------------------ | | NOBLANKS | Suppress whitespace-only nodes | | DEBUG | Display some debugging info | | DUMP | Display the parse tree |

parseFile(filename)

Parses the XML data in the specified filename and builds an in-memory model that can be accessed via the DOM API (see below).

parseString(text)

Parses the XML data in the specified string.

destroyParser

Destroys the in-memory model and miscellaneous global variables.
In addition, the following utility API calls can be used:

removeWhitespace(text)

Returns the supplied text string but with all whitespace characters removed, multiple spaces replaced with single spaces, and leading and trailing spaces removed.

removeQuotes(text)

Returns the supplied text string but with any enclosing apostrophes or double-quotes removed.

escapeText(text)

Returns the supplied text string but with special characters encoded (for example, < becomes <)

toString(node)

Walks the document tree (beginning at the specified node) and returns a string in XML format.

DOM API

The DOM (or DOM-like) calls that you can use are listed below:

Document query/navigation API calls

getRoot()

Returns the node number of the root node. This can be used in calls requiring a node argument. In this implementation, getDocumentElement() and getRoot() are (incorrectly) synonymous - this may change, so you should use getDocumentElement() in preference to getRoot().

getDocumentElement()

Returns the node number of the document element. The document element is the topmost element node. You should use this in preference to getRoot() (see above).

getName(node)

Returns the name of the specified node.

getNodeValue(node) or getText(node)

Returns the text content of an unnamed node. A node without a name can only contain text. It cannot have attributes or children.

getAttributeCount(node)

Returns the number of attributes present on the specified node.

getAttributeMap(node)

Builds a map of the attributes of the specified node. The map can be accessed via the following variables:

| Variable | Content | | ----------------- | ------------------------------- | | g.0ATTRIBUTE.0 | The number of attributes mapped | | g.0ATTRIBUTE.n | The name of attribute number n (in order of appearance) where n > 0 | | g.0ATTRIBUTE.name | The value of the attribute called name |

getAttributeName(node,n)

Returns the name of the nth attribute of the specified node (1 is first, 2 is second, etc).

getAttributeNames(node)

Returns a space-delimited list of the names of the attributes of the specified node.

getAttribute(node,name)

Returns the value of the attribute called name of the specified node.

getAttribute(node,n)

Returns the value of the nth attribute of the specified node (1 is first, 2 is second, etc).

setAttribute(node,name,value)

Updates the value of the attribute called 'name' of the specified node. If no attribute exists with that name, then one is created.

setAttributes(node,name1,value1,name2,value2,...)

Updates the attributes of the specified node. Zero or more name/value pairs are be specified as the arguments.

hasAttribute(node,name)

Returns 1 if the specified node has an attribute with the specified name, else 0.

getParentNode(node) or getParent(node)

Returns the node number of the specified node's parent. If the node number returned is 0, then the specified node is the root node. All nodes have a parent (except the root node).

getFirstChild(node)

Returns the node number of the specified node's first child node.

getLastChild(node)

Returns the node number of the specified node's last child node.

getChildNodes(node) or getChildren(node)

Returns a space-delimited list of node numbers of the children of the specified node. You can use this list to step through the children as follows:
```
children = getChildren(node)
say 'Node' node 'has' words(children) 'children'
do i = 1 to words(children)
  child = word(children,i)
  say 'Node' child 'is' getName(child)
end
```
getChildrenByName(node,name)

Returns a space-delimited list of node numbers of the immediate children of the specified node which are called name. Names are case-sensitive.

getElementsByTagName(node,name)

Returns a space-delimited list of node numbers of the descendants of the specified node which are called name. Names are case-sensitive.

getNextSibling(node)

Returns the node number of the specified node's next sibling node. That is, the next node sharing the same parent.

getPreviousSibling(node)

Returns the node number of the specified node's previous sibling node. That is, the previous node sharing the same parent.

getProcessingInstruction(name)

Returns the value of the Processing Instruction (PI) with the specified target name.

getProcessingInstructionList()

Returns a space-delimited list of the names of all PI target names.

getNodeType(node)

Returns a number representing the specified node's type. The possible values can be compared to the following

Rexxxmlparser

Install / Use

README