SkillAgentSearch skills...

Tidypmc

Parse full text XML documents from Pubmed Central

Install / Use

/learn @ropensci/Tidypmc
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Build
Status Coverage
status CRAN_Status_Badge Downloads Total
Downloads

tidypmc

The Open Access subset of Pubmed Central (PMC) includes 2.5 million articles from biomedical and life sciences journals. The full text XML files are freely available for text mining from the REST service or FTP site but can be challenging to parse. For example, section tags are nested to arbitrary depths, formulas and tables may return incomprehensible text blobs and superscripted references are pasted at the end of words. The functions in the tidypmc package are intended to return readable text and maintain the document structure, so gene names and other terms can be associated with specific sections, paragraphs, sentences or table rows.

Installation

Use remotes to install the package.

remotes::install_github("ropensci/tidypmc")

Load XML

Download a single XML document like PMC2231364 from the REST service using the pmc_xml function.

library(tidypmc)
library(tidyverse)
doc <- pmc_xml("PMC2231364")
doc
#  {xml_document}
#  <article article-type="research-article" xmlns:xlink="http://www.w3.org/1999/xlink">
#  [1] <front>\n  <journal-meta>\n    <journal-id journal-id-type="nlm-ta">BMC Microbiol</journal-id ...
#  [2] <body>\n  <sec>\n    <title>Background</title>\n    <p><italic>Yersinia pestis </italic>is th ...
#  [3] <back>\n  <ack>\n    <sec>\n      <title>Acknowledgements</title>\n      <p>We thank Dr. Chen ...

The europepmc package includes additional functions to search PMC and download full text. Be sure to include the OPEN_ACCESS field in the search since these are the only articles with full text XML available.

library(europepmc)
yp <- epmc_search("title:(Yersinia pestis virulence) OPEN_ACCESS:Y")
#  19 records found, returning 19
select(yp, pmcid, pubYear, title) %>%
  print(n=5)
#  # A tibble: 19 x 3
#    pmcid      pubYear title                                                                          
#    <chr>      <chr>   <chr>                                                                          
#  1 PMC5505154 2017    Crystal structure of Yersinia pestis virulence factor YfeA reveals two polyspe…
#  2 PMC3521224 2012    Omics strategies for revealing Yersinia pestis virulence.                      
#  3 PMC2704395 2009    Involvement of the post-transcriptional regulator Hfq in Yersinia pestis virul…
#  4 PMC2736372 2009    The NlpD lipoprotein is a novel Yersinia pestis virulence factor essential for…
#  5 PMC3109262 2011    A comprehensive study on the role of the Yersinia pestis virulence markers in …
#  # … with 14 more rows

Save all 19 results to a list of XML documents using the epmc_ftxt or pmc_xml function.

docs <- map(yp$pmcid, epmc_ftxt)

See the PMC FTP vignette for details on parsing the large XML files on the FTP site with 10,000 articles each.

Parse XML

The package includes five functions to parse the xml_document.

| R function | Description | | :-------------- | :-------------------------------------------------------------------------- | | pmc_text | Split section paragraphs into sentences with full path to subsection titles | | pmc_caption | Split figure, table and supplementary material captions into sentences | | pmc_table | Convert table nodes into a list of tibbles | | pmc_reference | Format references cited into a tibble | | pmc_metadata | List journal and article metadata in front node |

The pmc_text function uses the tokenizers package to split section paragraphs into sentences. The function also removes any tables, figures or formulas that are nested within paragraph tags, replaces superscripted references with brackets, adds carets and underscores to other superscripts and subscripts and includes the full path to the subsection title.

txt <- pmc_text(doc)
#  Note: removing disp-formula nested in sec/p tag
txt
#  # A tibble: 194 x 4
#     section    paragraph sentence text                                                                         
#     <chr>          <int>    <int> <chr>                                                                        
#   1 Title              1        1 Comparative transcriptomics in Yersinia pestis: a global view of environment…
#   2 Abstract           1        1 Environmental modulation of gene expression in Yersinia pestis is critical f…
#   3 Abstract           1        2 Using cDNA microarray technology, we have analyzed the global gene expressio…
#   4 Abstract           2        1 To provide us with a comprehensive view of environmental modulation of globa…
#   5 Abstract           2        2 Almost all known virulence genes of Y. pestis were differentially regulated …
#   6 Abstract           2        3 Clustering enabled us to functionally classify co-expressed genes, including…
#   7 Abstract           2        4 Collections of operons were predicted from the microarray data, and some of …
#   8 Abstract           2        5 Several regulatory DNA motifs, probably recognized by the regulatory protein…
#   9 Abstract           3        1 The comparative transcriptomics analysis we present here not only benefits o…
#  10 Background         1        1 Yersinia pestis is the etiological agent of plague, alternatively growing in…
#  # … with 184 more rows
count(txt, section, sort=TRUE)
#  # A tibble: 21 x 2
#     section                                                                                                   n
#     <chr>                                                                                                 <int>
#   1 Results and Discussion; Clustering analysis and functional classification of co-expressed gene clust…    22
#   2 Background                                                                                               20
#   3 Results and Discussion; Virulence genes in response to multiple environmental stresses                   20
#   4 Methods; Collection of microarray expression data                                                        17
#   5 Results and Discussion; Computational discovery of regulatory DNA motifs                                 16
#   6 Methods; Gel mobility shift analysis of Fur binding                                                      13
#   7 Results and Discussion; Verification of predicted operons by RT-PCR                                      10
#   8 Abstract                                                                                                  8
#   9 Methods; Discovery of regulatory DNA motifs                                                               8
#  10 Methods; Clustering analysis                                                                              7
#  # … with 11 more rows

Load the tidytext package for further text processing.

library(tidytext)
x1 <- unnest_tokens(txt, word, text) %>%
  anti_join(stop_words) %>%
  filter(!word %in% 1:100)
#  Joining, by = "word"
filter(x1, str_detect(section, "^Results"))
#  # A tibble: 1,269 x 4
#     section                paragraph sentence word         
#     <chr>                      <int>    <int> <chr>        
#   1 Results and Discussion         1        1 comprehensive
#   2 Results and Discussion         1        1 analysis     
#   3 Results and Discussion         1        1 sets         
#   4 Results and Discussion         1        1 microarray   
#   5 Results and Discussion         1        1 expression   
#   6 Results and Discussion         1        1 data         
#   7 Results and Discussion         1        1 dissect      
#   8 Results and Discussion         1        1 bacterial    
#   9 Results and Discussion         1        1 adaptation   
#  10 Results and Discussion         1        1 environments 
#  # … with 1,259 more rows
filter(x1, str_detect(section, "^Results")) %>%
  count(word, sort = TRUE)
#  # A tibble: 595 x 2
#     word           n
#     <chr>      <int>
#   1 genes         45
#   2 cluster       24
#   3 expression    21
#   4 pestis        21
#   5 data          19
#   6 dna           15
#   7 gene          15
#   8 figure        13
#   9 fur           12
#  10 operons       12
#  # … with 585 more rows

The pmc_table function formats tables by collapsing multiline headers, expanding rowspan and colspan attributes and adding subheadings into a new column.

tbls <- pmc_table(doc)
#  Parsing 4 tables
#  Adding footnotes to Table 1
map_int(tbls, nrow)
#  Table 1 Table 2 Table 3 Table 4 
#       39      23       4      34
tbls[[1]]
#  # A tibble: 39 x 5
#     subheading              `Potential operon (r va… `Gene ID`   `Putative or predicted functi… `Reference (s)`
#     <chr>                   <chr>                 

Related Skills

View on GitHub
GitHub Stars36
CategoryDevelopment
Updated1mo ago
Forks8

Languages

R

Security Score

75/100

Audited on Feb 16, 2026

No findings