SkillAgentSearch skills...

Spacyfishing

A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata

Install / Use

/learn @Lucaterre/Spacyfishing

README

<img src="./docs/spacyfishing-logo.png" width="200" align="right">

spaCy fishing

Python Version PyPI version License: MIT Tests Built with spaCy

A spaCy wrapper for entity-fishing, a tool for named entity recognition, linking and disambiguation against Wikidata.

This extension allows using entity-fishing tool as a spaCy pipeline component to disambiguate and link named entities (with custom or pretrained NER spaCy models) to the Wikidata knowledge base (KB).

Table of contents

Installation

normal

pip install spacyfishing

development

git clone https://github.com/Lucaterre/spacyfishing.git
virtualenv --python=/usr/bin/python3.8 venv
source venv/bin/activate
pip install -r requirements_dev.txt

Usage

First, install a pre-trained spaCy language model for the NER task:

python -m spacy download en_core_web_sm

Note that it is possible to use custom NER models.

Simple example

import spacy

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

nlp_model_en.add_pipe("entityfishing")

doc_en = nlp_model_en(text_en)

for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)

Batching example

import spacy

texts_en = [
  "Victor Hugo and Honoré de Balzac are French writers who lived in Paris.",
  "Momofuko Ando is Taiwanese Japanese Business Magnate that invented instant ramen."
]

nlp_model_en = spacy.load("en_core_web_sm")

nlp_model_en.add_pipe("entityfishing")

# set number of documents to be processed at once via batch_size
docs_en = nlp_model_en.pipe(texts_en, batch_size=128)

for doc_en in docs_en:
    for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)
('Momofuko Ando', 'PERSON', 'Q317858', 'https://www.wikidata.org/wiki/Q317858', 0.4598)
('Taiwanese', 'NORP', 'Q707908', 'https://www.wikidata.org/wiki/Q707908', 0.5424)
('Japanese', 'NORP', 'Q188712', 'https://www.wikidata.org/wiki/Q188712', 0.4956)

Get extra information from Wikidata

By default, the component, as seen previously, attaches to the span only the QID, the Wikidata URL and the score. However, it is possible to retrieve other information such as a short description of the entity, a standardized term, or other identifiers from knowledge bases related to Wikidata concepts, for example Geonames id, VIAF id, etc.

For accessing extra information about a wikidata entity, specify True in the extra_info parameter in the component configuration:


import spacy

text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."

nlp_model_en = spacy.load("en_core_web_sm")

# specify configuration:
nlp_model_en.add_pipe("entityfishing", config={"extra_info": True})

doc_en = nlp_model_en(text_en)

# Access to description with ent._.description:
for ent in doc_en.ents:
        print((ent.text, ent.label_, ent._.kb_qid, ent._.normal_term, ent._.description, ent._.src_description, ent._.other_ids))
('Victor Hugo', 'PERSON', 'Q535', 'Victor Hugo', "'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').", 'wikipedia-en', [{'propertyName': 'Sycomore ID', 'propertyId': 'P1045', 'value': '8795'}, {'propertyName': 'image', 'propertyId': 'P18', 'value': 'Victor Hugo.jpg'}, {'propertyName': 'signature', 'propertyId': 'P109', 'value': 'Victor Hugo Signature.svg'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q82955'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q214917'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q6625963'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q15296811'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q8178443'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q11774202'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q11774156'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q36180'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q644687'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q3579035'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q49757'}, {'propertyName': 'country of citizenship', 'propertyId': 'P27', 'value': 'Q142'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q3271923'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q2082427'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q663856'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q3083678'}, {'propertyName': 'father', 'propertyId': 'P22', 'value': 'Q2299673'}, {'propertyName': 'mother', 'propertyId': 'P25', 'value': 'Q3491058'}, {'propertyName': 'spouse', 'propertyId': 'P26', 'value': 'Q2825429'}, {'propertyName': 'place of birth', 'propertyId': 'P19', 'value': 'Q37776'}, {'propertyName': 'place of interment', 'propertyId': 'P119', 'value': 'Q188856'}, {'propertyName': 'sex or gender', 'propertyId': 'P21', 'value': 'Q6581097'}, {'propertyName': 'VIAF ID', 'propertyId': 'P214', 'value': '9847974'}, {'propertyName': 'BnF ID', 'propertyId': 'P268', 'value': '11907966z'}, {'propertyName': 'GND ID', 'propertyId': 'P227', 'value': '118554654'}, {'propertyName': 'Commons category', 'propertyId': 'P373', 'value': 'Victor Hugo'}, {'propertyName': 'Library of Congress authority ID', 'propertyId': 'P244', 'value': 'n79091479'}, {'propertyName': 'place of death', 'propertyId': 'P20', 'value': 'Q90'}, {'propertyName': 'MusicBrainz artist ID', 'propertyId': 'P434', 'value': 'c0c99c8f-4779-4c35-9497-67d60a73310a'}, {'propertyName': 'unmarried partner', 'propertyId': 'P451', 'value': 'Q440119'}, {'propertyName': 'unmarried partner', 'propertyId': 'P451', 'value': 'Q3271708'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q161806'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q12759592'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q2822385'}, {'propertyName': 'NDL Auth ID', 'propertyId': 'P349', 'value': '00443985'}, {'propertyName': 'SUDOC authorities', 'propertyId': 'P269', 'value': '026927608'}, {'propertyName': 'date of death', 'propertyId': 'P570', 'value': {'time': '+1885-05-22T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}}, {'propertyName': 'date of birth', 'propertyId': 'P569', 'value': {'time': '+1802-02-26T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}}, {'propertyName': 'NKCR AUT ID', 'propertyId': 'P691', 'value': 'jn19990003739'}, {'propertyName': 'given name', 'propertyId': 'P735', 'value': 'Q539581'}, {'propertyName': 'given name', 'propertyId': 'P735', 'value': 'Q632104'}, {'propertyName': "topic's main category", 'propertyId': 'P910', 'value': 'Q7367470'}, {'propertyName': 'educated at', 'propertyId': 'P69', 'value': 'Q209842'}, {'propertyName': 'educated at', 'propertyId': 'P69', 'value': 'Q1059546'}, {'propertyName': 'ISNI', 'propertyId': 'P213', 'value': '0000 0001 2120 0982'}, {'propertyName': 'ULAN ID', 'propertyId': 'P245', 'value': '500032572'}, {'propertyName': 'instance of', 'propertyId': 'P31', 'value': 'Q5'}, {'propertyName': 'SELIBR', 'propertyId': 'P906', 'value': '206651'}, {'propertyName': 'NLA (Australia) ID', 'propertyId': 'P409', 'value': '35212404'}, {'propertyName': 'BNE ID', 'propertyId': 'P950', 'value': 'XX874892'}, {'propertyName': 'BAV ID', 'propertyId': 'P1017', 'value': 'ADV10201285'}, {'propertyName': 'National The

Related Skills

View on GitHub
GitHub Stars169
CategoryDevelopment
Updated1mo ago
Forks8

Languages

Python

Security Score

100/100

Audited on Feb 19, 2026

No findings