Erlsom
XML parser for Erlang
Install / Use
/learn @willemdj/ErlsomREADME
Erlsom
- Introduction
- SAX Mode
- Simple DOM Mode
- Data Binder Mode
- Installation
- Examples
- Character encoding
- Creation of atoms
- Limitations
- Reference
<a name="introduction">Introduction</a>
Erlsom is an Erlang library to parse (and generate) XML documents.
Erlsom can be used in a couple of very different modes:
-
As a SAX parser. This is a more or less standardized model for parsing XML. Every time the parser has processed a meaningful part of the XML document (such as a start tag), it will tell your application about this. The application can process this information (potentially in parallel) while the parser continues to parse the rest of the document. The SAX parser will allow you to efficiently parse XML documents of arbitrary size, but it may take some time to get used to it. If you invest some effort, you may find that it fits very well with the Erlang programming model (personally I have always been very happy about my choice to use a SAX parser as the basis for the rest of Erlsom).
-
As a simple sort of DOM parser. Erlsom can translate your XML to the ‘simple form’ that is used by Xmerl. This is a form that is easy to understand, but you have to search your way through the output to get to the information that you need.
-
As a ‘data binder’ Erlsom can translate the XML document to an Erlang data structure that corresponds to an XML Schema. It has the advantage over the SAX parser that it validates the XML document, and that you know exactly what the layout of the output will be. This makes it easy to access the elements that you need in a very direct way. (Look here for a general description of XML data binding.)
For all modes the following applies:
-
If the document is too big to fit into memory, or if the document arrives in some kind of data stream, it can be passed to the parser in blocks of arbitrary size.
-
The parser can work directly on binaries. There is no need to transform binaries to lists before passing the data to Erlsom. Using binaries as input has a positive effect on the memory usage and on the speed (provided that you are using Erlang 12B or later - if you are using an older Erlang version the speed will be better if you transform binaries to lists). The binaries can be latin-1, utf-8 or utf-16 encoded.
-
The parser has an option to produce output in binary form (only the character data: names of elements and attributes are always strings). This may be convenient if you want to minimize the memory usage, and/or if you need the result in binary format for further processing. Note that it will slow down the parser slightly. If you select this option the encoding of the result will be utf-8 (irrespective of the encoding of the input document).
<a name="example">Example XML document</a>
Unless otherwise indicated, the examples in the next sections will use the following, very simple XML document as input:
<foo attr="baz"><bar>x</bar><bar>y</bar></foo>
This document is stored in a file called "minimal.xml", and read into a variable called Xml by the following commands in the shell:
1> {ok, Xml} = file:read_file("minimal.xml").
{ok,<<"<foo attr=\"baz\"><bar>x</bar><bar>y</bar></foo>\r\n">>}
The following, corresponding XSD ("minimal.xsd") is used in the first example for the data binder:
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="foo" type="foo_type"/>
<xsd:complexType name="foo_type">
<xsd:sequence>
<xsd:element name="bar" type="xsd:string"
maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attr" type="xsd:string"/>
</xsd:complexType>
</xsd:schema>
<a name="sax">SAX Mode</a>
The example below shows how the example XML can be processed using the SAX parser:
2> erlsom:parse_sax(Xml, [], fun(Event, Acc) -> io:format("~p~n", [Event]), Acc end).
startDocument
{startElement,[],"foo",[],[{attribute,"attr",[],[],"baz"}]}
{startElement,[],"bar",[],[]}
{characters,"x"}
{endElement,[],"bar",[]}
{startElement,[],"bar",[],[]}
{characters,"y"}
{endElement,[],"bar",[]}
{endElement,[],"foo",[]}
endDocument
{ok,[],"\r\n"}
The function erlsom:parse_sax takes as its arguments: the XML document, an accumulator value and an ‘event processing function’. This function will process the parts of the XML documents that have been parsed. In this example, this function simply prints these events.
The next example does something slightly more meaningful: it counts the number of times the "bar" element occurs in the XML document. Ok, maybe not very useful, but at least this example will produce a result, not only side effects.
3> CountBar = fun(Event, Acc) -> case Event of {startElement, _, "bar", _, _} -> Acc + 1; _ -> Acc end end.
#Fun<erl_eval.12.113037538>
4> erlsom:parse_sax(Xml, 0, CountBar).
{ok,2,"\r\n"}
To describe it in a rather formal way: parse_sax(Xml, Acc0, Fun) calls Fun(Event, AccIn) on successive ‘XML events’ that result from parsing Xml, starting with AccIn == Acc0. Fun/2 must return a new accumulator which is passed to the next call. The function returns {ok, AccOut, Tail}, where AccOut is the final value of the accumulator and Tail the list of characters that follow after the last tag of the XML document. In this example AccOut == 2, since the tag occurs twice.
(Notice how similar this is to lists:foldl(Fun, Acc0, Sax_events), assuming that Sax_events is the list of Sax events - I more or less copied this description from the documentation of the lists module.)
It may still not be very clear to you how this SAX parser can be used to produce useful results. There are some additional examples in the examples directory of the Erlsom distribution. If you are still not convinced you can try to decipher the source code for the ‘data binder’ mode (erlsom_parse.erl) - this was also built on top of the SAX parser.
<a name="sax_events">SAX Events</a>
startDocument
endDocument
Will NOT be sent out in case of an error
{startPrefixMapping, Prefix, URI}
Begin the scope of a prefix - URI namespace mapping Will be sent immediately before the corresponding startElement event.
{endPrefixMapping, Prefix}
End the scope of a prefix - URI namespace mapping Will be sent immediately before the corresponding endElement event.
{startElement, Uri, LocalName, Prefix, [Attributes]}
The beginning of an element. There will be a corresponding endElement (even when the element is empty). All three name components will be provided.
[Attributes] is a list of attribute records, see sax.hrl. Namespace attributes (xmlns:*) will not be reported. There will be NO attribute values for defaulted attributes!
Providing 'Prefix' in stead of 'Qualified name' is probably not quite in line with the SAX spec, but it appears to be more convenient.
{endElement, Uri, LocalName, Prefix}
The end of an element.
{characters, Characters}
Character data. All character data will be in one chunk, except if there is a CDATA section included inside a character section. In that case there will be separate events for the characters before the CDATA, the CDATA section and the characters following it (if any, of course).
{ignorableWhitespace, Characters}
If a character data section (as it would be reported by the 'characters' event, see above) consists ONLY of whitespace, it will be reported as ignorableWhitespace.
{processingInstruction, Target, Data}
{error, Description}
{internalError, Description}
<a name="DOM">Simple DOM Mode</a>
This mode translates the XML document to a generic data structure. It doesn’t really follow the DOM standard, but in stead it provides a very simple format. In fact, it is very similar to format that is defined as the ‘simple-form’ in the Xmerl documentation.
An example will probably be sufficient to explain it:
erlsom:simple_form(Xml).
{ok,{"foo",
[{"attr","baz"}],
[{"bar",[],["x"]},{"bar",[],["y"]}]},
"\r\n"}
Result = {ok, Element, Tail}, where Element = {Tag, Attributes, Content}, Tag is a string (there is an option that allows you to format Tag differently, see the reference section below), Attributes = [{AttributeName, Value}], and Content is a list of Elements and/or strings.
<a name="binder">Data Binder Mode</a>
In this mode, Erlsom parses XML documents that are associated with an XSD (or Schema). It checks whether the XML document conforms to the Schema, and it translates the document to an Erlang structure that is based on the types defined in the Schema. This section tries to explain the relation between the Schema and the Erlang data structure that is produced by Erlsom.
First a quick example using the same XML that was used for the other modes. Before we can parse the document we need to ‘compile’ the XML Schema (similar to how you might compile a regular expression).
10> {ok, Model} = erlsom:compile_xsd_file("minimal.xsd").
{ok,{model,[{typ…
Now you can use this compiled model:
11> {ok, Result, _} = erlsom:scan(Xml, Model).
{ok,{foo_type,[],"baz",["x","y"]},"\r\n"}
Assuming that you have defined a suitable record #foo_type{} (erlsom:write_xsd_hrl_file() can do it for you), you can use in your program (won’t work in the shell):
BarValues = Result#foo_type.bar,
AttrValue = Result#foo_type.attr,
Nice and compact, as you see, but it ma
