SkillAgentSearch skills...

Sgrep

A tool to search and index text, SGML, XML and HTML files using structured patterns

Install / Use

/learn @neeraj9/Sgrep
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

    README file for sgrep version 1.91-alpha - 
a tool to search and index text, SGML, XML and HTML files using
structured patterns

    Copyright (C) 1998  University of Helsinki, 
                        Department of Computer Science

    Authors: Jani Jaakkola           Jani.Jaakkola@cs.helsinki.fi
             Pekka Kilpelainen       Pekka.Kilpelainen@cs.helsinki.fi

This README file is intended for describing the new features of sgrep-1.91a. If you want to know what sgrep is and what the old features are, see: http://www.cs.helsinki.fi/~jjaakkol/sgrep.html

See the section "NEW QUERY LANGUAGE FEATURES" for description of the new operators available in version 1.91a.

Sgrep-1.91 supports 16-bit wide characters and Unicode in XML-documents. See the section "WIDE CHARACTER SUPPORT" for information on wide characters and UTF-8 and UTF-16 encodings.

This file (and newer versions of this file) is available from http://www.cs.helsinki.fi/~jjaakkol/sgrep/README.txt

Sgrep is distributed under GNU General Public License. See file COPYING for details.

This piece of software is still under development. This means that:

  • New features might be included before final sgrep-2.0 release.
  • Existing features might be changed.
  • It is guaranteed to have bugs.
  • All suggestions are welcome.
  • All available documentation of the new features is contained in this file.

NEW FEATURES

Major new features since sgrep-1.0 which are already present:

  • Indexing of both structure and content.
  • SGML/XML/HTML scanner.
  • Official Win32 binary.
  • sgtool has been dumped. It never really worked and even when it did, it wasn't very useful.
  • Should be completely compatible with older versions of sgrep.
  • Sgrep now supports direct containment. In SGML and XML world this means, that you can query children or parents of given elements.
  • Sgrep uses GNU autoconf
  • Also the sources are now available
  • Operators for supporting direct containment
  • Nearness operators
  • 16-bit wide characters and Unicode support

Features which will be present in sgrep-2.0:

  • Proper documentation
  • Support for querying notations, element type declarations and attribute list declarations inside SGML/XML document prolog
  • Scanning of all well-formed XML-documents.

Features probably won't be present in sgrep-2.0:

  • Regular expressions, since they are probably better handled by other software, like Perl. However, sgrep still needs some new options for better perl support.

Win32-BINARY RELEASE

The Win32-binary release contains both sgrep binary and m4 binary. Sgrep binary is compiled with MSVC and requires no additional libraries.

Please note that the examples in this README file and in the sgrep WWW-pages have been written using sh shell-syntax. When you use sgrep under the windows shell, "COMMAND.COM" you have to either use the -f option or translate query from:

% sgrep 'word("foo") or word("bar")' foobar

to

C:> sgrep "word("foo") or word("bar")" foobar

Alternatively, you can install bash from the Cygnus Cygwin project.

The m4 binary comes from the Cygnus Cygwin project. See http://sourceware.cygnus.com/cygwin/ for details. Included binary release of m4 requires the cygwin.dll DLL-library. Both of them are distributed under GNU General Public License (GPL). See file COPYING for details.


SGML-SCANNER

Sgrep has a built-in scanner for XML, SGML and HTML-documents. This means that complex macros for querying SGML-files are no longer needed. However, sgrep still does not contain a full blown SGML-parser: the thing which it does contain could be described as an SGML-scanner. It does not recognize any syntax errors, it does not provide a parse tree and it does not provide any event stream. It just recognizes regions from SGML-files corresponding to different SGML tokens: start tags, end tags, attributes, etc.

Since version 1.90a the SGML-scanner maintains an element stack. This is needed for the ability to support direct containment in queries to SGML/XML-files. Query language primitive 'elements' returns all elements of queried XML/SGML-documents. (see the 'childrening' and 'parenting' operators for examples).

Since version 1.91a sgrep has support for 16-bit wide characters in query terms and support for UTF-8 and UTF-16 encodings in the SGML-scanner. See the "WIDE CHARACTER SUPPORT" below.

SGML has many features which make it very difficult to parse. The SGML-scanner implemented in sgrep does not attempt to be a complete and error free SGML-parser; valid SGML-documents might confuse it. However, my goal is that all well formed XML-documents will be parsed correctly.

The scanner has two modes:

  • SGML/HTML-mode o Names are case insensitive o PIs end with '>'
  • XML-mode o Names are case sensitive o PIs end with '?>'

Sgrep will recognize empty XML elements (<ELEMENT/>) in both modes.

The scanner does not automatically include entity references. However, it can automatically add external parsed entities defined in the internal document type definition subset to scanned files. Eg. if you have a line <!ENTITY chapter1 SYSTEM "chapter1.sgml"> in your document, the scanner can automatically include file "chapter1.sgml" to the list of scanned files, when the scanner sees this line in the internal document type definition subset. To use this feature you need to use "-g include-entities" option.


WIDE CHARACTER SUPPORT

Sgrep version 1.91a introduces 16-bit wide character support in index terms and in the SGML-parser.

Since the sgrep query language is still strictly 8-bit, wide characters in queries need to be encoded. I chose to use encoding which looks just like character entity references in SGML: "#<decimal number>;" for character number in decimal and "#x<hex number>;" for character number in hexadecimal. Therefore the ISO-8859-1 letter a with two dots on top of it, 'ä' assuming you are reading this file with ISO-8859-1 font 'ä'-entity in HTML, 'ä' as a decimal character reference and '&#e4;' as a hexadecimal character reference can be encoded in sgrep query either as "#228;" or as "#xe4;".

So the finnish word "älämölö" ("älämölö" in HTML) can be queried either with query like 'word("älämölö")', since sgrep query language supports 8-bit characters, or with encoded query like 'word("#228;l#228;m#248;l#248")' or 'word("#xe4;l#xe4;m#xf8;l#xf8")'.

The SGML-parser supports UTF-8 and UTF-16 encodings. You can select the encoding with the -g option:

  • "-g encoding=utf-8" selects UTF-8 encoding. This is the default if you are using the SGML-scanner in XML-mode (with -g xml option).
  • "-g encoding=utf-16" selects UTF-16 encoding. Note that currently (in version 1.91a), this is a synonym for "-g encoding=utf-8" since sgrep switches automatically to UTF-16 mode from UTF-8 mode when it sees the byte order mark (this also means, that you must have the byte order mark, if you are using UTF-16).
  • "-g encoding=iso-8859-1" selects iso-8859-1 encoding, which is also the default encoding when SGML-scanner is any other mode than XML.

The SGML-scanner recognizes character entity references currently only in character data content. Character entity references in attribute values or entity literals are not recognized. No other entity references than character entity references are expanded, not even "&", ">" and "<". I plan to fix this before next release.

The XML-scanner recognizes the encoding parameter in XML-declarations and can switch encoding accordingly (if not overridden with -g encoding option). Currently "us-ascii", "iso-8859-1", "utf-8" and "utf-16" encodings are recognized. Note that in XML-mode the SGML-parser interprets all characters classified as "Letter" in the XML-spesification as word characters by default.

Currently only the SGML-scanner is aware of different encodings. The output module does not do any conversions: it just dumps the result regions from query files exatly as they were encoded there, even when different files use different encodings (this probably needs to be fixed).

Here is an example using Murata Makotos example XML-documents in Japanese (see http://www.oasis-open.org/cover/xmlJapaneseExamples.html ). The unicode character 0x771f represents word Murata in japanese.

% sgrep -o"%f:%l\n" -g xml 'word("#x771f")' pr-xml-little-endian.xml pr-xml-utf-16.xml pr-xml-utf-8.xml weekly-utf-8.xml pr-xml-little-endian.xml:2 pr-xml-utf-16.xml:2 pr-xml-utf-8.xml:3


NEW QUERY LANGUAGE FEATURES

The example file "example.sgml" and its DTD "example.dtd" are included in this distribution.

New query language features in version 1.91a and later:

  • near(distance)

Finds regions of left hand side and right hand side having at most 'distance' bytes bytes between them. 'A near(0) B' would return regions of A and B which "touch" each other (in other words, there is no bytes between them. I know that using bytes is not the best way to measure distance in a text search engine, but the way Sgrep works makes this kind of query very fast. If you really need nearnes operator with words as a measure of

Related Skills

View on GitHub
GitHub Stars10
CategoryDevelopment
Updated1y ago
Forks0

Languages

C

Security Score

60/100

Audited on Feb 6, 2025

No findings