Spacyfishing
A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata
Install / Use
/learn @Lucaterre/SpacyfishingREADME
spaCy fishing
A spaCy wrapper for entity-fishing, a tool for named entity recognition, linking and disambiguation against Wikidata.
This extension allows using entity-fishing tool as a spaCy pipeline component to disambiguate and link named entities (with custom or pretrained NER spaCy models) to the Wikidata knowledge base (KB).
Table of contents
- Installation
- Usage (examples)
- Configuration parameters
- Attributes
- Recommendations
- Visualise results
- External resources
- About
Installation
normal
pip install spacyfishing
development
git clone https://github.com/Lucaterre/spacyfishing.git
virtualenv --python=/usr/bin/python3.8 venv
source venv/bin/activate
pip install -r requirements_dev.txt
Usage
First, install a pre-trained spaCy language model for the NER task:
python -m spacy download en_core_web_sm
Note that it is possible to use custom NER models.
Simple example
import spacy
text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."
nlp_model_en = spacy.load("en_core_web_sm")
nlp_model_en.add_pipe("entityfishing")
doc_en = nlp_model_en(text_en)
for ent in doc_en.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)
Batching example
import spacy
texts_en = [
"Victor Hugo and Honoré de Balzac are French writers who lived in Paris.",
"Momofuko Ando is Taiwanese Japanese Business Magnate that invented instant ramen."
]
nlp_model_en = spacy.load("en_core_web_sm")
nlp_model_en.add_pipe("entityfishing")
# set number of documents to be processed at once via batch_size
docs_en = nlp_model_en.pipe(texts_en, batch_size=128)
for doc_en in docs_en:
for ent in doc_en.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.url_wikidata, ent._.nerd_score))
('Victor Hugo', 'PERSON', 'Q535', 'https://www.wikidata.org/wiki/Q535', 0.972)
('Honoré de Balzac', 'PERSON', 'Q9711', 'https://www.wikidata.org/wiki/Q9711', 0.9724)
('French', 'NORP', 'Q121842', 'https://www.wikidata.org/wiki/Q121842', 0.3739)
('Paris', 'GPE', 'Q90', 'https://www.wikidata.org/wiki/Q90', 0.5652)
('Momofuko Ando', 'PERSON', 'Q317858', 'https://www.wikidata.org/wiki/Q317858', 0.4598)
('Taiwanese', 'NORP', 'Q707908', 'https://www.wikidata.org/wiki/Q707908', 0.5424)
('Japanese', 'NORP', 'Q188712', 'https://www.wikidata.org/wiki/Q188712', 0.4956)
Get extra information from Wikidata
By default, the component, as seen previously, attaches to the span only the QID, the Wikidata URL and the score. However, it is possible to retrieve other information such as a short description of the entity, a standardized term, or other identifiers from knowledge bases related to Wikidata concepts, for example Geonames id, VIAF id, etc.
For accessing extra information about a wikidata entity, specify True in the extra_info parameter in the component configuration:
import spacy
text_en = "Victor Hugo and Honoré de Balzac are French writers who lived in Paris."
nlp_model_en = spacy.load("en_core_web_sm")
# specify configuration:
nlp_model_en.add_pipe("entityfishing", config={"extra_info": True})
doc_en = nlp_model_en(text_en)
# Access to description with ent._.description:
for ent in doc_en.ents:
print((ent.text, ent.label_, ent._.kb_qid, ent._.normal_term, ent._.description, ent._.src_description, ent._.other_ids))
('Victor Hugo', 'PERSON', 'Q535', 'Victor Hugo', "'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').", 'wikipedia-en', [{'propertyName': 'Sycomore ID', 'propertyId': 'P1045', 'value': '8795'}, {'propertyName': 'image', 'propertyId': 'P18', 'value': 'Victor Hugo.jpg'}, {'propertyName': 'signature', 'propertyId': 'P109', 'value': 'Victor Hugo Signature.svg'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q82955'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q214917'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q6625963'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q15296811'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q8178443'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q11774202'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q11774156'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q36180'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q644687'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q3579035'}, {'propertyName': 'occupation', 'propertyId': 'P106', 'value': 'Q49757'}, {'propertyName': 'country of citizenship', 'propertyId': 'P27', 'value': 'Q142'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q3271923'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q2082427'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q663856'}, {'propertyName': 'child', 'propertyId': 'P40', 'value': 'Q3083678'}, {'propertyName': 'father', 'propertyId': 'P22', 'value': 'Q2299673'}, {'propertyName': 'mother', 'propertyId': 'P25', 'value': 'Q3491058'}, {'propertyName': 'spouse', 'propertyId': 'P26', 'value': 'Q2825429'}, {'propertyName': 'place of birth', 'propertyId': 'P19', 'value': 'Q37776'}, {'propertyName': 'place of interment', 'propertyId': 'P119', 'value': 'Q188856'}, {'propertyName': 'sex or gender', 'propertyId': 'P21', 'value': 'Q6581097'}, {'propertyName': 'VIAF ID', 'propertyId': 'P214', 'value': '9847974'}, {'propertyName': 'BnF ID', 'propertyId': 'P268', 'value': '11907966z'}, {'propertyName': 'GND ID', 'propertyId': 'P227', 'value': '118554654'}, {'propertyName': 'Commons category', 'propertyId': 'P373', 'value': 'Victor Hugo'}, {'propertyName': 'Library of Congress authority ID', 'propertyId': 'P244', 'value': 'n79091479'}, {'propertyName': 'place of death', 'propertyId': 'P20', 'value': 'Q90'}, {'propertyName': 'MusicBrainz artist ID', 'propertyId': 'P434', 'value': 'c0c99c8f-4779-4c35-9497-67d60a73310a'}, {'propertyName': 'unmarried partner', 'propertyId': 'P451', 'value': 'Q440119'}, {'propertyName': 'unmarried partner', 'propertyId': 'P451', 'value': 'Q3271708'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q161806'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q12759592'}, {'propertyName': 'member of', 'propertyId': 'P463', 'value': 'Q2822385'}, {'propertyName': 'NDL Auth ID', 'propertyId': 'P349', 'value': '00443985'}, {'propertyName': 'SUDOC authorities', 'propertyId': 'P269', 'value': '026927608'}, {'propertyName': 'date of death', 'propertyId': 'P570', 'value': {'time': '+1885-05-22T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}}, {'propertyName': 'date of birth', 'propertyId': 'P569', 'value': {'time': '+1802-02-26T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}}, {'propertyName': 'NKCR AUT ID', 'propertyId': 'P691', 'value': 'jn19990003739'}, {'propertyName': 'given name', 'propertyId': 'P735', 'value': 'Q539581'}, {'propertyName': 'given name', 'propertyId': 'P735', 'value': 'Q632104'}, {'propertyName': "topic's main category", 'propertyId': 'P910', 'value': 'Q7367470'}, {'propertyName': 'educated at', 'propertyId': 'P69', 'value': 'Q209842'}, {'propertyName': 'educated at', 'propertyId': 'P69', 'value': 'Q1059546'}, {'propertyName': 'ISNI', 'propertyId': 'P213', 'value': '0000 0001 2120 0982'}, {'propertyName': 'ULAN ID', 'propertyId': 'P245', 'value': '500032572'}, {'propertyName': 'instance of', 'propertyId': 'P31', 'value': 'Q5'}, {'propertyName': 'SELIBR', 'propertyId': 'P906', 'value': '206651'}, {'propertyName': 'NLA (Australia) ID', 'propertyId': 'P409', 'value': '35212404'}, {'propertyName': 'BNE ID', 'propertyId': 'P950', 'value': 'XX874892'}, {'propertyName': 'BAV ID', 'propertyId': 'P1017', 'value': 'ADV10201285'}, {'propertyName': 'National The
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
