Affilgood

AffilGood provides annotated datasets and tools to improve the accuracy of attributing scientific works to research organizations, especially in multilingual and complex contexts.

Generate Convert Improve

Install / Use

/learn @sirisacademic/Affilgood

About this skill

Quality Score

0/100

README

`AffilGood` 🕺🏾

AffilGood is a Python library for extracting and structuring research institution information from raw affiliation strings (e.g. those found in scientific publications, project beneficiaries, or metadata dumps).

It is designed to work in real-world, multilingual, noisy settings, while remaining:

🧩 modular
🛡️ defensive
🧪 fully testable
🔌 easy to extend

AffilGood focuses on stable output semantics: regardless of which internal components are enabled, the public output schema remains consistent.

AffilGood Pipeline

📄 Publication This repository accompanies the paper "AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis", published at the Scholarly Document Processing (SDP) Workshop @ ACL 2024.

• Paper: https://aclanthology.org/2024.sdp-1.13/ • Slides: https://docs.google.com/presentation/d/1wX7zInjoUrjO1hRL3U8tpSzxU6KOX0FknTaEqSf6ML0

✨ What `AffilGood` does

Given an affiliation string like:

SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France

AffilGood can:

detect institutions (ORG), sub-organizations (SUBORG), and subunits (SUB) via NER
link institutions to registries (ROR) using a three-stage cascade pipeline
translate non-Latin scripts (Chinese, Japanese, Arabic, Russian, etc.) before processing
enrich results with geolocation (city, country, NUTS regions, coordinates)
fill missing locations from ROR data when NER misses geographic entities
detect language of affiliation strings
structure everything into a stable, user-friendly schema

🚀 Quick start

Installation

git clone https://github.com/sirisacademic/affilgood.git
cd affilgood
pip install -e ".[all]"

🐍 Python ≥ 3.10 recommended

Download data files

AffilGood requires pre-built data files (ROR registry, FAISS index, NUTS shapefiles) that are too large for the git repository. They are hosted as a GitHub Release asset.

Automatic (recommended):

python setup_data.py

Manual:

Download affilgood-data-v2.0.0.zip from HuggingFace
Extract into the repo root:

unzip affilgood-data-v2.0.0.zip -d .

Verify:

python setup_data.py  # will report ✓ for each file if already extracted

The data files include:

| File | Size | Description | |---|---|---| | ror_records.jsonl | ~80 MB | ROR registry (active + inactive records) | | faiss.index | ~200 MB | Pre-built HNSW index (1024-dim, inner product) | | faiss_ids.json | ~10 MB | Record IDs for each index vector | | faiss_texts.json | ~40 MB | Indexed text variants | | NUTS shapefiles | ~5 MB | EU NUTS region boundaries |

Basic usage

from affilgood import AffilGood

ag = AffilGood()
result = ag.process("Universitat Autònoma de Barcelona, Spain")
print(result)

Recommended configuration (best accuracy)

from affilgood import AffilGood

ag = AffilGood(
    enable_entity_linking=True,
    linking_config={
        "reranker": None,         # retrieval-only (Acc@1=0.905)
        "threshold": 0.5,
    },
    enable_language_detect=True,
    language_config={"method": "combined_langdetect"},
    enable_normalization=True,
    add_nuts=True,
    verbose=True,
)

result = ag.process("SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France")

🧩 Pipeline overview

AffilGood runs a defensive, modular pipeline with seven stages:

Input → Span → Language → Translation → NER → Entity Linking → Geocoding → Output

| Stage | Description | Default | |---|---|---| | 1. Span identification | Splits multi-affiliation strings | Always on | | 2. Language detection | Detects language of each span | Off (enable_language_detect=True) | | 3. Translation | Translates non-Latin scripts to English | Off (translate_config={...}) | | 4. NER | Extracts ORG, SUBORG, SUB, CITY, COUNTRY | Always on | | 5. Entity linking | Links ORG/SUBORG to ROR registry | On (enable_entity_linking=True) | | 6. Geocoding | Resolves locations via OSM Nominatim | Off (enable_normalization=True) | | 6b. ROR→Geocode feedback | Fills missing locations from ROR data | Automatic when both EL and geocoding are enabled |

Design guarantees

Each stage is optional, never crashes the pipeline, never deletes previous results, and operates on a shared, well-defined internal schema.

🔗 Entity linking

Entity linking matches NER-extracted organizations against the ROR (Research Organization Registry) using a three-stage cascade:

Stage 1 — Direct match

Exact name + country lookup against all ROR names, aliases, acronyms, and labels. Handles ~35% of entities at ~98% precision with zero latency.

Features:

Unicode-safe normalization — "Selçuk Üniversitesi" + "TÜRKİYE" matches correctly (Turkish İ, accents)
Inactive record resolution — INRA (withdrawn) automatically resolves to its successor INRAE (active)
Acronym support — "CNRS" + "France" resolves directly when unambiguous

Stage 2 — Dense retrieval

FAISS HNSW index with the SIRIS-Lab/affilgood-dense-retriever encoder (1024-dim XLM-RoBERTa). Queries use structured tokens matching the encoder's training format:

[MENTION] Univ Montpellier [CITY] Montpellier [COUNTRY] France

Key feature: multi-variant queries — each entity generates 2–4 geographic variants (ORG+CITY+COUNTRY, ORG+COUNTRY, ORG only) and results are merged by max score. This is critical for R@1=0.905.

Stage 3 — LLM judge (optional)

For low-confidence results, a small instruction-following LLM sees all candidates simultaneously and picks the best match. Uses first-token logit scoring (one forward pass, no generation). Handles acronym confusion, same-name disambiguation, and complex affiliation chains.

Optional: Cross-encoder reranking with score fusion

A cross-encoder reranker can be added between retrieval and final selection. Retrieval and reranker scores are fused (alpha * retrieval + (1-alpha) * reranker) to prevent the reranker from overriding correct retriever results.

⚙️ Configuration guide

Minimal (NER only, no linking)

ag = AffilGood()

With entity linking (recommended)

ag = AffilGood(
    enable_entity_linking=True,
    linking_config={
        "reranker": None,       # retrieval-only mode
        "threshold": 0.5,       # cosine similarity threshold
    },
)

With geocoding and NUTS regions

ag = AffilGood(
    enable_entity_linking=True,
    linking_config={
        "reranker": None,
        "threshold": 0.5,
    },
    enable_normalization=True,
    add_nuts=True,
)

With language detection

ag = AffilGood(
    enable_language_detect=True,
    language_config={"method": "combined_langdetect"},
    enable_entity_linking=True,
    linking_config={"reranker": None, "threshold": 0.5},
    enable_normalization=True,
    add_nuts=True,
    verbose=True,
)

With non-Latin script translation

ag = AffilGood(
    enable_language_detect=True,
    language_config={"method": "combined_langdetect"},
    translate_config={
        "model_name": "Qwen/Qwen2.5-0.5B-Instruct",   # ~1GB
        "device": "cpu",
    },
    enable_entity_linking=True,
    linking_config={"reranker": None, "threshold": 0.5},
    enable_normalization=True,
    verbose=True,
)

# Chinese affiliation → translated → NER → linked → geocoded
result = ag.process("清华大学计算机科学与技术系, 北京, 中国")

Translation auto-detects and only activates for non-Latin scripts: Chinese, Japanese, Korean, Arabic, Russian, Persian, Greek, Thai, Hindi, Ukrainian, and more.

With cross-encoder reranking + score fusion

ag = AffilGood(
    enable_entity_linking=True,
    linking_config={
        "reranker": "cross_encoder",
        "reranker_model": "cometadata/jina-reranker-v2-multilingual-affiliations-v5",
        "score_fusion_alpha": 0.5,   # 0=reranker only, 1=retriever only
        "threshold": 0.5,
    },
)

With LLM judge for hard cases

ag = AffilGood(
    enable_entity_linking=True,
    linking_config={
        "reranker": None,
        "threshold": 0.5,
        "llm_judge": "Qwen/Qwen2.5-0.5B-Instruct",   # ~1GB, or 3B for better accuracy
        "llm_threshold": 0.7,   # invoke LLM when retrieval score < 0.7
    },
)

Full configuration (all features)


ag = AffilGood(
    enable_entity_linking=True,
    device="cpu",
    linking_config={
        "data_dir": str(data_dir),
        "encoder_model": "SIRIS-Lab/affilgood-dense-retriever",
        "threshold": 0.038,
        "reranker": "cross_encoder",
        "reranker_model": "cometadata/jina-reranker-v2-multilingual-affiliations-large",
        "reranker_threshold": 0.038,
        "llm_judge": "Qwen/Qwen2.5-1.5B-Instruct",
        "llm_threshold": 0.3,
    },
    enable_language_detect=True,
    language_config={"method": "combined_langdetect"},
    verbose=True,
    enable_normalization=True,
    add_nuts=True,
)

Custom data directory (pre-built index)

linking_config={
    "data_dir": "/path/to/entity_linking/data",
    ...
}

📤 Output schema

Normalized output (default)

result = ag.process("SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France")

{
  "raw_text": "SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France",
  "outputs": [
    {
      "input": "SELMET, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France",
      "institutions": [
        {

Related Skills

A beautifully designed, floating Pomodoro timer that respects your workspace.

product-manager-skills

PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.

snap-vis-manager

The planning agent for the snap-vis project. Coordinates other specialized agents and manages the overall project roadmap.

devplan-mcp-server

MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.

sirisacademic

View profile

View on GitHub

GitHub Stars16

CategoryProduct

Updated1d ago

Forks3

sirisacademic/affilgood

Languages

Python

Security Score

90/100

Audited on Apr 5, 2026

No findings