Affilgood
AffilGood provides annotated datasets and tools to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts.
Install / Use
/learn @sirisacademic/AffilgoodREADME
AffilGood 🕺🏾
AffilGood is a Python library for extracting and structuring research institution information from raw affiliation strings (e.g. those found in scientific publications, project beneficiaries, or metadata dumps).
It is designed to work in real-world, multilingual, noisy settings, while remaining:
- 🧩 modular
- 🛡️ defensive
- 🧪 fully testable
- 🔌 easy to extend
AffilGood focuses on stable output semantics: regardless of which internal components are enabled, the public output schema remains consistent.

📄 Publication This repository accompanies the paper "AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis", published at the Scholarly Document Processing (SDP) Workshop @ ACL 2024.
• Paper: https://aclanthology.org/2024.sdp-1.13/ • Slides: https://docs.google.com/presentation/d/1wX7zInjoUrjO1hRL3U8tpSzxU6KOX0FknTaEqSf6ML0
✨ What AffilGood does
Given an affiliation string like:
SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France
AffilGood can:
- detect institutions (ORG), sub-organizations (SUBORG), and subunits (SUB) via NER
- link institutions to registries (ROR) using a three-stage cascade pipeline
- translate non-Latin scripts (Chinese, Japanese, Arabic, Russian, etc.) before processing
- enrich results with geolocation (city, country, NUTS regions, coordinates)
- fill missing locations from ROR data when NER misses geographic entities
- detect language of affiliation strings
- structure everything into a stable, user-friendly schema
🚀 Quick start
Installation
git clone https://github.com/sirisacademic/affilgood.git
cd affilgood
pip install -e ".[all]"
🐍 Python ≥ 3.10 recommended
Download data files
AffilGood requires pre-built data files (ROR registry, FAISS index, NUTS shapefiles) that are too large for the git repository. They are hosted as a GitHub Release asset.
Automatic (recommended):
python setup_data.py
Manual:
- Download
affilgood-data-v2.0.0.zipfrom HuggingFace - Extract into the repo root:
unzip affilgood-data-v2.0.0.zip -d .
Verify:
python setup_data.py # will report ✓ for each file if already extracted
The data files include:
| File | Size | Description |
|---|---|---|
| ror_records.jsonl | ~80 MB | ROR registry (active + inactive records) |
| faiss.index | ~200 MB | Pre-built HNSW index (1024-dim, inner product) |
| faiss_ids.json | ~10 MB | Record IDs for each index vector |
| faiss_texts.json | ~40 MB | Indexed text variants |
| NUTS shapefiles | ~5 MB | EU NUTS region boundaries |
Basic usage
from affilgood import AffilGood
ag = AffilGood()
result = ag.process("Universitat Autònoma de Barcelona, Spain")
print(result)
Recommended configuration (best accuracy)
from affilgood import AffilGood
ag = AffilGood(
enable_entity_linking=True,
linking_config={
"reranker": None, # retrieval-only (Acc@1=0.905)
"threshold": 0.5,
},
enable_language_detect=True,
language_config={"method": "combined_langdetect"},
enable_normalization=True,
add_nuts=True,
verbose=True,
)
result = ag.process("SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France")
🧩 Pipeline overview
AffilGood runs a defensive, modular pipeline with seven stages:
Input → Span → Language → Translation → NER → Entity Linking → Geocoding → Output
| Stage | Description | Default |
|---|---|---|
| 1. Span identification | Splits multi-affiliation strings | Always on |
| 2. Language detection | Detects language of each span | Off (enable_language_detect=True) |
| 3. Translation | Translates non-Latin scripts to English | Off (translate_config={...}) |
| 4. NER | Extracts ORG, SUBORG, SUB, CITY, COUNTRY | Always on |
| 5. Entity linking | Links ORG/SUBORG to ROR registry | On (enable_entity_linking=True) |
| 6. Geocoding | Resolves locations via OSM Nominatim | Off (enable_normalization=True) |
| 6b. ROR→Geocode feedback | Fills missing locations from ROR data | Automatic when both EL and geocoding are enabled |
Design guarantees
Each stage is optional, never crashes the pipeline, never deletes previous results, and operates on a shared, well-defined internal schema.
🔗 Entity linking
Entity linking matches NER-extracted organizations against the ROR (Research Organization Registry) using a three-stage cascade:
Stage 1 — Direct match
Exact name + country lookup against all ROR names, aliases, acronyms, and labels. Handles ~35% of entities at ~98% precision with zero latency.
Features:
- Unicode-safe normalization — "Selçuk Üniversitesi" + "TÜRKİYE" matches correctly (Turkish İ, accents)
- Inactive record resolution — INRA (withdrawn) automatically resolves to its successor INRAE (active)
- Acronym support — "CNRS" + "France" resolves directly when unambiguous
Stage 2 — Dense retrieval
FAISS HNSW index with the SIRIS-Lab/affilgood-dense-retriever encoder (1024-dim XLM-RoBERTa). Queries use structured tokens matching the encoder's training format:
[MENTION] Univ Montpellier [CITY] Montpellier [COUNTRY] France
Key feature: multi-variant queries — each entity generates 2–4 geographic variants (ORG+CITY+COUNTRY, ORG+COUNTRY, ORG only) and results are merged by max score. This is critical for R@1=0.905.
Stage 3 — LLM judge (optional)
For low-confidence results, a small instruction-following LLM sees all candidates simultaneously and picks the best match. Uses first-token logit scoring (one forward pass, no generation). Handles acronym confusion, same-name disambiguation, and complex affiliation chains.
Optional: Cross-encoder reranking with score fusion
A cross-encoder reranker can be added between retrieval and final selection. Retrieval and reranker scores are fused (alpha * retrieval + (1-alpha) * reranker) to prevent the reranker from overriding correct retriever results.
⚙️ Configuration guide
Minimal (NER only, no linking)
ag = AffilGood()
With entity linking (recommended)
ag = AffilGood(
enable_entity_linking=True,
linking_config={
"reranker": None, # retrieval-only mode
"threshold": 0.5, # cosine similarity threshold
},
)
With geocoding and NUTS regions
ag = AffilGood(
enable_entity_linking=True,
linking_config={
"reranker": None,
"threshold": 0.5,
},
enable_normalization=True,
add_nuts=True,
)
With language detection
ag = AffilGood(
enable_language_detect=True,
language_config={"method": "combined_langdetect"},
enable_entity_linking=True,
linking_config={"reranker": None, "threshold": 0.5},
enable_normalization=True,
add_nuts=True,
verbose=True,
)
With non-Latin script translation
ag = AffilGood(
enable_language_detect=True,
language_config={"method": "combined_langdetect"},
translate_config={
"model_name": "Qwen/Qwen2.5-0.5B-Instruct", # ~1GB
"device": "cpu",
},
enable_entity_linking=True,
linking_config={"reranker": None, "threshold": 0.5},
enable_normalization=True,
verbose=True,
)
# Chinese affiliation → translated → NER → linked → geocoded
result = ag.process("清华大学计算机科学与技术系, 北京, 中国")
Translation auto-detects and only activates for non-Latin scripts: Chinese, Japanese, Korean, Arabic, Russian, Persian, Greek, Thai, Hindi, Ukrainian, and more.
With cross-encoder reranking + score fusion
ag = AffilGood(
enable_entity_linking=True,
linking_config={
"reranker": "cross_encoder",
"reranker_model": "cometadata/jina-reranker-v2-multilingual-affiliations-v5",
"score_fusion_alpha": 0.5, # 0=reranker only, 1=retriever only
"threshold": 0.5,
},
)
With LLM judge for hard cases
ag = AffilGood(
enable_entity_linking=True,
linking_config={
"reranker": None,
"threshold": 0.5,
"llm_judge": "Qwen/Qwen2.5-0.5B-Instruct", # ~1GB, or 3B for better accuracy
"llm_threshold": 0.7, # invoke LLM when retrieval score < 0.7
},
)
Full configuration (all features)
ag = AffilGood(
enable_entity_linking=True,
device="cpu",
linking_config={
"data_dir": str(data_dir),
"encoder_model": "SIRIS-Lab/affilgood-dense-retriever",
"threshold": 0.038,
"reranker": "cross_encoder",
"reranker_model": "cometadata/jina-reranker-v2-multilingual-affiliations-large",
"reranker_threshold": 0.038,
"llm_judge": "Qwen/Qwen2.5-1.5B-Instruct",
"llm_threshold": 0.3,
},
enable_language_detect=True,
language_config={"method": "combined_langdetect"},
verbose=True,
enable_normalization=True,
add_nuts=True,
)
Custom data directory (pre-built index)
linking_config={
"data_dir": "/path/to/entity_linking/data",
...
}
📤 Output schema
Normalized output (default)
result = ag.process("SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France")
{
"raw_text": "SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France",
"outputs": [
{
"input": "SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France",
"institutions": [
{
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
47PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
snap-vis-manager
The planning agent for the snap-vis project. Coordinates other specialized agents and manages the overall project roadmap.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
