MedEmbed
MedEmbed is a collection of embedding models fine-tuned specifically for medical and clinical data.
Install / Use
/learn @abhinand5/MedEmbedREADME
MedEmbed: Medical-Focused Embedding Models
MedEmbed is a collection of embedding models fine-tuned specifically for medical and clinical data, aimed at enhancing performance in healthcare-related natural language processing (NLP) tasks.

Note: These models are a work in progress, successive releases will be aimed at improving the performance even better and creating a medical embedding leaderboard on MTEB. Also, the work is almost done for late-interation based retrievers based on ColBERT.
Blog Post: Click Here
Model Download Links: v0.1
Dataset Download Links: v1
Support the Project
Developing MedEmbed requires significant resources. If you find it valuable, consider supporting the project. Your contribution helps sustain and improve this open-source initiative.
Project Overview
MedEmbed provides high-quality embedding models tailored for use in medical and clinical contexts. These models are designed to capture the nuances and complexities of medical terminology and concepts, making them particularly useful for a wide range of healthcare-related NLP tasks.
Key Features
- Fine-tuned embedding models focused on medical and clinical data
- Improved performance on healthcare-specific NLP tasks
- Multiple model variants to suit different use cases and computational requirements
- Extensive evaluation on medical NLP benchmarks
Model Variants
MedEmbed includes several model variants, each fine-tuned using different strategies:
- MedEmbed-Small-v1
- MedEmbed-Base-v1
- MedEmbed-Large-v1
Note: We have also finetuned ColBERT-v2 models, benchmarking is in progress.
Performance
Our models have been evaluated on various medical NLP benchmarks for retrieval, including:
- ArguAna
- MedicalQARetrieval
- NFCorpus
- PublicHealthQA
- TRECCOVID
Key Findings
-
Small Models:
- MedEmbed-Small-v1 consistently outperformed the base
BAAI/bge-small-en-v1.5model across all benchmarks.
- MedEmbed-Small-v1 consistently outperformed the base
-
Base Models:
- MedEmbed-Base-v0 showed significant improvements over the base
BAAI/bge-base-en-v1.5 model.
- MedEmbed-Base-v0 showed significant improvements over the base
-
Large Models:
MedEmbed-Large-v0demonstrated superior performance compared to the baseBAAI/bge-large-en-v1.5 model.
-
Cross-Size Comparison:
- In a comparison across different model sizes,
MedEmbed-Large-v0showed the best overall performance. - Notably, the medical-tuned small and base models often outperformed the larger base models, indicating significant improvements from domain-specific fine-tuning.
- In a comparison across different model sizes,
Note: More comparisons will be added with other frontier models along with a table.
Data Generation and Training Process
Our models are trained using a sophisticated synthetic data generation pipeline, leveraging the power of large language models and real-world clinical data.
Synthetic Data Generation Process

-
Clinical Notes: We start with a large corpus of patient data clinical notes from PubMed Central (PMC).
-
LLM Processing: These notes are processed through the LLaMA 2 70B model to generate high-quality query-response pairs.
-
Negative Sampling: We perform negative sampling to create challenging negative examples.
-
Triplet Formation: The positive and negative examples are combined to form triplets (query, positive response, negative response).
-
Contrastive Learning: These triplets are used to train our models using contrastive learning techniques inspired by ColBERT and BERT.
This innovative approach allows us to leverage the vast knowledge encoded in large language models while grounding our training data in real-world clinical information, resulting in embedding models that are both comprehensive and medically accurate.
Getting Started
Usage
[To be added]
Note: This project is actively evolving. We recommend using this repository as a template for similar projects, as significant code modifications may be necessary to adapt it to your specific needs. Please check for updates regularly and be prepared to adjust your implementation accordingly.
Contributing
[To be added]
Citation
If you use MedEmbed in your research, please cite our work:
@software{balachandran2024medembed,
author = {Balachandran, Abhinand},
title = {MedEmbed: Medical-Focused Embedding Models},
year = {2024},
url = {https://github.com/abhinand5/MedEmbed}
}
License
This project is licensed under the Apache License Version 2.0. See the LICENSE file for details.
Contact
For any queries regarding the codebase or research, please reach out to Abhinand Balachandran at abhinandb.ml@gmail.com.
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.

