PuRSSE
Pubmed Research Search String Extraction (PuRSSE)
Install / Use
/learn @NCBI-Hackathons/PuRSSEREADME
PuRSSE (PubMed Research Search String Extraction)
<img src="https://raw.githubusercontent.com/NCBI-Hackathons/SystematicReviews/master/PuRSSE.png" height=30 /> This project aims to create a pipeline for taking a set of known PMIDs and discovering the Shortest Precise Search Strategy (SPSS) for PubMed that (a) retrieves all the original PMIDs, and (b) retrieves other articles related to the original topic(s) extracted from the known PMIDs. This process uses the article-level metadata provided by NLM (title and abstract plus MeSH terms and keywords). Topic modeling, specifically TF-IDF, word embedding and latent Dirichlet allocation (LDA), are used on the title and abstract. TF-IDF and word embedding are used on the MeSH terms. The topics generated will be used to create search strings constructed from the corresponding MeSH headings and keywords.
A demo of the projected process <a href="http://htmlpreview.github.io/?https://github.com/NCBI-Hackathons/PuRSSE/blob/master/demo/PuRSSEDemo.html">can be viewed here</a>:<br> Known PMIDs (paste list or search into interface) --> retrieve articles' metadata --> topic modeling and clustering --> search string construction using MeSH --> New Search Strategy (cut & paste search into PubMed.gov)
Team Members
Melanie Huston, James Lavinder, Richard Lusk, Franklin Sayre
Three Goals/Projects
- Create clusters of articles based on topic modeling (TF-IDF, LDA, Word Embedding) from any PubMed-compliant XML file
- Based on a set of known articles, find other articles that are similar using topic modeling (via either direct similarity comparison OR by building a new search string from metadata associated with topic clusters)
- Compare known set of PMIDs with larger set of PMIDs to verify 100% recall of known set and ideal larger set size for optimal precision
Why is this useful?
- It's cool!
- Researchers need ways of doing topic modeling on PubMed literature easily
- Creating a "shortest precise search strategy" based on a set of known PMIDs that retrieves those PMIDs and others like them could be useful for systematic reviews and other information retrieval tasks
- Researchers/instructors need ways of quickly getting precision and recall scores for a set of PMIDs within another set of PMIDs
How could this be used for systematic reviews
The first stage of creating a systematic review often involves taking a known set of articles (mostly available in PubMed and with PMIDs) and then iteratively looking through metadata and keywords to create an extensive search string that can find both those target articles and other similar articles, without retrieving too much. This could potentially be used to help with that process by recommending a search string.
How could this be used for topic modelling PubMed literature
This could be used to help with topic modeling PubMed literature by providing a pipeline that takes a PubMed compliant XML file (generated from PubMed.gov, or downloaded from PubMed FTP servers, or retrieved through EDirect) and outputing a set of topic models. This could be attached to other projects.
To do/Issues
- document ways of getting PubMed compliant XML files (ftp, PubMed.gov)
- see if EDirect gives compliant XML
- determine optimum way(s) to model topics (metadata and methods)
- create front end interface for end users
- find optimum method for stemming medical terms
- gain expertise in MeSH hierarchy for search string creation
Process
Get PubMed Data
- Download XML from NLM FTP. Approx 200GB. Benefits: all the data all the time. Negitives: with addition of new publications the list becomes out of date quickly, requires a lot of memory
- API. Slow. Doesn't require server space. Can't get everything.
- Pubrunner
- EDirect Local Data Cache
Extract useful metadata from XML
- Metadata was extracted from PubMed XML files using python's lmxl model. The following features were extracted:
- PMIDs
- Title
- Abstract Text
- MeSH Major Headings
- MeSH Subheadings
- Keywords
- The title and abstract are concatanted together, this is for older PubMed records with no abstract, and then are cleaned for processing. The cleaning process includes tokenizing, lowercasing, stemming and the removal of stop words. ** Note: full text was not included in this extraction. **
NLP Modeling on Title and Abstract
- TF-IDF is run on both Abstract+Title
- Document embeddings (Doc2Vec) are run on documents for features to measure similarity across documents
- Latent Dirichlet Allocation (LDA) is run for Topic Modeling purposes on Title and Abstract
NLP Modeling on MeSH terms
- TF-IDF across MeSH terms to find co-occurence
- Word embeddings (Word2vec) are run across the entire corpus of words
Retrieve most similar documents to initial ones provided
- Determining document similarity through NLP feature vectors and similarity metric (e.g. cosine similarity)
- apply a nearest-neighbors approach (K-D Tree) to retrieve the most similar documents to initial list of PMIDs
- Retrieve documents using pre-set number PMIDs (e.g. n=200)
Test against known PMIDs
- Use a corpus of a previously performed systematic review to validate results (Gout)
- Embed relevant articles (~300) among a larger semi-random corpus of ~10,000 other PubMed abstracts
- Measure performance through standard measures of precision and recall
Map MeSH & Keyword Strings associated with newest retrieved documents
- Using the newly retrieve PMIDs, map to the associated MeSH terms and identify terms most relevant to the original search documents
- Apply a similar nearest-neighbor approach as done with PMIDs abstracts, but now retrieving new MeSH terms
- Traverse the hierarchical structure of MeSH ID strings and subset terms based on their similarity (i.e. Levenshtein Distance)
- Combine the results of both approaches (nearest neighbor search, MeSH string matching) to form a final subset list of MeSH terms
Create Shortest Precise Search String
- Penalize longer search strings - apply higher weight to MeSH terms deeper in MeSH tree
- Sort and prioritize final MeSH terms based on frequency found within retrieved PubMed articles
Miscellaneous excercises
- Perform hierarchical clustering on new MeSH terms in order to facilitate topic discovery and as an additional validation procedure to our approach
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
19.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
