Pub2TEI
Service for converting and enhancing heterogeneous publisher XML formats into TEI
Install / Use
/learn @kermitt2/Pub2TEIREADME
Pub2TEI
Project goal
This project aims at converting XML documents encoded in various scientific publisher formats into a common TEI XML format. Often called document ingestion, converting a myriad of heterogeneous publisher formats into a common working format is a painful and time-consuming sub-task for building scientific digital library applications.
The target TEI XML is the same as the Grobid TEI XML format, which makes possible to ingest various publisher XML or PDF into the same XML format, avoiding then to write multiple specific parsers. The publisher XML transformation should normally preserve all the information from the source XML.
In addition to avoid any XML publisher information loss, the converter offers various possibilities to enhanced the publisher XML:
-
when the input publisher XML has raw strings for affiliations and bibliographical references Grobid can be used automatically to further parses the raw string into a structured representation that is added to the final TEI document,
-
a sentence segmentation is possible for the final TEI document with the same sentence segmenter as Grobid,
-
the converter service will fix various problems like empty nodes, duplicated XML identifiers, and invalid NCName for attribute values.
With Pub2TEI, it is thus possible to obtain TEI XML documents with at least the same level of structuring as Grobid TEI XML created from PDF, while preserving the high quality of publisher full text XML encoding.
Coverage
The following publisher's XML formats should be properly processed:
- BMJ: metadata, header, bibliography, body
- Elsevier (journals and conferences): metadata, header, bibliography, body
- IOP: metadata, header, bibliography.
- NPG (Nature): metadata, header, bibliography, body
- NLM/JATS: metadata, header, bibliography, body
- OUP: metadata, header, bibliography, body
- PNAS: metadata, header, bibliography, body
- RSC: metadata, header, bibliography, body
- Sage: metadata, header
- ScholarOne: metadata, header
- Springer: metadata, header, bibliography, body
- Wiley: metadata, header, bibliography, body
Coverage of NLM and JATS should be comprehensive (all known versions), so covering in particular PMC, PLOS and bioRxiv XML. However, unfortunately, JATS is so loose that a new JATS flavor might require some stylesheet adjustements. In case you observe some issues in the resulting TEI XML for a new JATS publisher flavor, please fill an issue in this project.
How it works
The project offers a web service for transforming and enhancing publisher XML in an efficient parallelized manner.
It uses a set of stylesheets for converting XML documents encoded in various scientific publisher formats into a common TEI XML format. These style sheets have been first developed in the context of the European Project PEER and have been then further extended over the last years, in particular in the context of the ISTEX project. Depending on the publishers (see above), the encoding of bibliographical information, abstracts, citation and full texts are supported.
Enhancement is then realized by Grobid, selecting the appropriate model dynamically from the publisher XML based on the identified raw fields that can be further structured.
The simplest way to run the converter is to use the docker image and the web service API. The docker image contains all the required stylesheets, the Grobid Deep learning models, sentence segmenter utility and XSLT 2.0 processor for XML transformation. The service compiles the stylesheets at start and keep them "warm" for the transformation requests.
Running the project with Docker
Start the Pub2TEI service as follow:
docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2
As visible, by default, the service is started on the port :8060, which can be changed as follow for port :8080:
docker run --rm --gpus all --init --ulimit core=0 -p 8080:8060 grobid/pub2tei:0.2
Python client
After starting the service, to process easily directories of XML files, a simple Python client is provided:
git clone https://github.com/kermitt2/Pub2TEI
cd client
python3 pub2tei_client.py --help
usage: pub2tei_client.py [-h] --input INPUT [--output OUTPUT] [--config CONFIG] [--n N] [--consolidate_references] [--segment_sentences]
[--generate_ids] [--grobid_refine] [--force] [--verbose]
Client for Pub2TEI services
optional arguments:
-h, --help show this help message and exit
--input INPUT path to the directory containing XML files to process: .xml
--output OUTPUT path to the directory where to put the results (optional)
--config CONFIG path to the config file, default is ./config.json
--n N concurrency for service usage
--consolidate_references
use GROBID for consolidation of the bibliographical references
--segment_sentences segment sentences in the text content of the document with additional <s> elements
--generate_ids Generate idenfifier for each text item
--grobid_refine use Grobid to structure/enhance raw fields: affiliations, references, person, dates
--force force re-processing pdf input files when tei output files already exist
--verbose print information about processed files in the console
For example for processing recursively all the .xml files in a given directory, with sentence segmentation, the resulting transformed files being written alongside the input files:
python3 pub2tei_client.py --input ~/test/input/ --segment_sentences
For processing recursively all the .xml files in a given input directory, with results in a given output directory, using Grobid to further enhance the transformed document and consolidate the references:
python3 pub2tei_client.py --input ~/test/input/ --output ~/test/output/ --grobid_refine --consolidate_references
Note that the consolidation is realized with the consolidation service indicated in the configuration file of the Pub2TEI server (under pub2tei/resources/config/config.yml, this selected consolidation service overrides the consolidation service possibly indicated in the Grobid configuration file).
Web services
Tranform a publisher XML into TEI XML format, with optional enhancements.
| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |-------------------------|--- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST | multipart/form-data | application/xml | input | required | publisher XML file to be processed |
| | | | segmentSentences | optional | Boolean, if true the paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | grobidRefine | optional | Boolean, if true the raw affiliations and raw biblographical reference strings will be parsed with Grobid and the resulting structured information added in the transformed TEI XML |
| | | | consolidateReferences | optional | Consolidate all the biblographical references, consolidateReferences is a string of value 0 (no consolidation, default value) or 1 (consolidate and inject all extra metadata), or 2 (consolidate the citation and inject DOI only). |
| | | | generateIDs | optional | Inject the attribute xml:id in the textual elements (title, note, term, keywords, p, s) |
Response status codes:
| HTTP Status code | reason | |--- |--- | | 200 | Successful operation. | | 204 | Process was completed, but no content could be provided | | 400 | Wrong request, missing parameters, missing header | | 500 | Indicate an internal service error, further described by a provided message |
Assuming that the service is started on the default port :8060 of a local machine, here is a curl example:
