SkillAgentSearch skills...

JATSdecoder

A text extraction and manipulation toolset for NISO-JATS coded XML files

Install / Use

/learn @ingmarboeschen/JATSdecoder
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

JATSdecoder

A metadata and text extraction and text manipulation tool set for the statistical programming language R.

JATSdecoder facilitates text mining projects on scientific articles by enabling an individual selection of metadata and text parts. Its function JATSdecoder() extracts metadata, sectioned text and reference list from NISO-JATS coded XML files. The function study.character() uses the JATSdecoder() result to perform fine-tuned text extraction tasks to identify key study characteristics like statistical methods used, alpha-error, statistical results reported in text and others.

Note:

  • PDF article collections can be converted to NISO-JATS coded XML files with the open source software CERMINE.
  • To extract statistical test results reported in simple/unpublished PDF documents with JATSdecoder::get.stats(), the R package pdftools and its function pdf_text() may help to extract textual content (be aware that tabled content may cause corrupt text).

Note too:

  • A minimal web app to extract statistical results from textual resources with get.stats() is hosted at:
    https://get-stats.app
  • An interactive web application to analyze study characteristics of articles stored in the PubMed Central database and perform an individual article selection by study characteristcs is hosted at:
    https://scianalyzer.com/

JATSdecoder supplies some convenient functions to work with textual input in general. Its function text2sentences() is especially designed to break floating text with scientific content (references, results) into sentences. text2num() unifies representations of written numbers and special annotations (percent, fraction, e+10) into digit numbers. You can extract adjustable n words around a pattern match in a sentence with ngram(). letter.convert() unifies hexadecimal to Unicode characters and, if CERMINE generated CERMXML files are processed, special error correction and special letter uniformization is performed, which is extremely relevant for get.stats()'s ability to extract and recompute statistical results in text.

The contained functions are listed below. For a detailed description, see the documentation on CRAN.

  • JATSdecoder::JATSdecoder() uses functions that can be applied stand alone on NISO-JATS coded XML files or text input:

    • get.title() # extracts title
    • get.author() # extracts author/s as vector
    • get.aff() # extracts involved affiliation/s as vector
    • get.journal() # extracts journal
    • get.vol() # extracts journal volume as vector
    • get.doi() # extracts Digital Object Identifier
    • get.history() # extracts publishing history as vector with available date stamps
    • get.country() # extracts country/countries of origin as vector with unique countries
    • get.type() # extracts document type
    • get.subject() # extracts subject/s as vector
    • get.keywords() # extracts keyword/s as vector
    • get.abstract() # extracts abstract
    • get.text() # extracts sections and text as list
    • get.references() # extracts reference list as vector
  • JATSdecoder::study.character() applies several functions on specific elements of the JATSdecoder() result. These functions can be used stand alone on any plain textual input:

    • get.n.studies() # extracts number of studies from sections or abstract
    • get.alpha.error() # extracts alpha error from text
    • get.method() # extracts statistical methods from method and result section with ngram()
    • get.stats() # extracts statistical results reported in text (abstract and full text, method and result section, result section only) and compare extracted recalculated p-values if possible
    • get.software() # extracts software name/s mentioned in method and result section with dictionary search
    • get.R.package() # extracts mentioned R package/s in method and result section with dictionary search on all available R packages created with available.packages()
    • get.power() # extracts power (1-beta-error) if mentioned in text
    • get.assumption() # extracts mentioned assumptions from method and result section with dictionary search
    • get.multiple.comparison() # extracts correction method for multiple testing from method and result section with dictionary search
    • get.sig.adjectives() # extracts common inadequate adjectives used before significant and not significant
  • JATSdecoder helper functions are helpful for many text mining projects and straight forward to use on any textual input:

    • text2sentences() # breaks floating text into sentences
    • text2num() # converts spelled out numbers, fractions, potencies, percentages and numbers denoted with e+num to decimals
    • ngram() # creates ±n-gram bag of words around a pattern match in text
    • strsplit2() # splits text at pattern match with option "before" or "after" and without removing the pattern match
    • grep2() # extension of grep(). Allows connecting multiple search patterns with logical AND operator
    • letter.convert() # unifies many and converts most hexadecimal and HTML characters to Unicode and performs CERMINE specific error correction
    • which.term() # returns hit vector for a set of patterns to search for in text (can be reduced to hits only)

Built With

How to cite JATSdecoder

JATSdecoder: A Metadata and Text Extraction and Manipulation Tool Set. Ingmar Böschen (2026). R package version 1.3.0

Resources

Articles:

  • Böschen, I. (2021). Software review: The JATSdecoder package—extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central’s open access database. Scientometrics. https://doi.org/10.1007/s11192-021-04162-z. [link to repo]

  • Böschen, I. (2021). Evaluation of JATSdecoder as an automated text extraction tool for statistical results in scientific reports. Scientific Reports 11, 19525. https://doi.org/10.1038/s41598-021-98782-3. [link to repo]

  • Böschen, I. (2023). Evaluation of the extraction of methodological study characteristics with JATSdecoder. Scientific Reports 13, 139. https://doi.org/10.1038/s41598-022-27085-y. [link to repo]

  • Böschen, I. (2023). Changes in methodological study characteristics in psychology between 2010-2021. PLOS ONE 18(5). https://doi.org/10.1371/journal.pone.0283353. [link to repo]

  • Böschen, I. (2024). statcheck is flawed by design and no valid spell checker for statistical results. https://arxiv.org/abs/2408.07948. [link to repo]

  • Böschen, I. (2026). Extraction of tabulated statistical results with tableParser. https://arxiv.org/abs/2408.07948. [link to repo]

Evaluation data and code:

https://github.com/ingmarboeschen/JATSdecoderEvaluation/

JATSdecoder on CRAN:

https://CRAN.R-project.org/package=JATSdecoder/

<!-- GETTING STARTED -->

Getting Started

To install JATSdecoder run the following steps:

Installation

Option 1: Install JATSdecoder from CRAN

install.packages("JATSdecoder")

Option 2: Install JATSdecoder from github with the devtools package

if(require(devtools)!=TRUE) install.packages("devtools")
devtools::install_github("ingmarboeschen/JATSdecoder")
<!-- USAGE EXAMPLES -->

Usage for a single XML file

Here, a simple download of a NISO-JATS coded XML file is performed with download.file():

# load package
library(JATSdecoder)
# download example XML file via URL
URL <- "https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0114876&type=manuscript"
download.file(URL,"file.xml")
# convert full article to list with metadata, sectioned text and reference list
JATSdecoder("file.xml")
# extract specific content (here: abstract)
JATSdecoder("file.xml",output="abstract")
get.abstract("file.xml")
# extract study characteristics as list
study.character("file.xml")
# extract specific study characteristic (here: statistical results)
study.character("file.xml",output=c("stats","standardStats")) 
# reduce to checkable results only
study.character("file.xml",output="standardStats",stats.mode="checkable")
# compare with result of statcheck's function checkHTML() (Epskamp & Nuijten, 2018)
install.packages("statcheck")
library(statcheck)
checkHTML("file.xml")

# extract results with get.stats() from simple/unpublished manuscripts with pdftools::pdf_text()
x<-pdftools::pdf_text("path2file.pdf")
x<-unlist(strsplit(x,"\\n"))
JATSdecoder::get.stats(x)

Usage for a collection of XML files

The PubMed Central database offers more than 5.4 million documents related to the biology an

View on GitHub
GitHub Stars21
CategoryDevelopment
Updated1d ago
Forks2

Languages

R

Security Score

95/100

Audited on Mar 31, 2026

No findings