Coreferee
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages
Install / Use
/learn @msg-systems/CorefereeREADME
Coreferee
Author: <a href="mailto:richard@explosion.ai">Richard Paul Hudson, Explosion AI</a>
- 1. Introduction
- 2. Interacting with the data model
- 3. How it works
- 4. Adding support for a new language
- 5. Adding support for a custom spaCy model
- 6. Version history
- 7. Open issues/requests for assistance
<a id="introduction"></a>
1. Introduction
<a id="the-basic-idea"></a>
1.1 The basic idea
Coreferences are situations where two or more words within a text refer to the same entity, e.g. John went home because he was tired. Resolving coreferences is an important general task within the natural language processing field.
Coreferee is a Python 3 library (tested with versions 3.6—3.10) that is used together with spaCy (tested with versions 3.0.0—3.3.0) to resolve coreferences within English, French, German and Polish texts. It is designed so that it is easy to add support for new languages. It uses a mixture of neural networks and programmed rules.
The library was originally developed at msg systems, but is now being maintained at Explosion AI. Please direct any new issues or discussions to the Explosion repository.
<a id="getting-started"></a>
1.2 Getting started
<a id="getting-started-en"></a>
1.2.1 English
Presuming you have already installed spaCy and one of the English spacy models, install Coreferee from the command line by typing:
python3 -m pip install coreferee
python3 -m coreferee install en
Note that:
- the required command may be
pythonrather thanpython3on some operating systems; - in order to use the transformer-based spaCy model
en_core_web_trfwith Coreferee, you will need to install the spaCy modelen_core_web_lgas well (see the explanation here).
Then open a Python prompt (type python3 or python at the command line):
>>> import coreferee, spacy
>>> nlp = spacy.load('en_core_web_trf')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000002DE8E9256D0>
>>>
>>> doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")
>>>
>>> doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
>>>
>>> doc[16]._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
2: [He(16); wife(19)], they(21), They(26), they(31)
>>>
>>> doc._.coref_chains.resolve(doc[31])
[Peter, wife]
>>>
<a id="getting-started-fr"></a>
1.2.2 French
Presuming you have already installed spaCy and one of the French spacy models, install Coreferee from the command line by typing:
python3 -m pip install coreferee
python3 -m coreferee install fr
Note that the required command may be python rather than python3 on some operating systems.
Then open a Python prompt (type python3 or python at the command line):
>>> import coreferee, spacy
>>> nlp = spacy.load('fr_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000001F556B4FF10>
>>>
>>> doc = nlp("Même si elle était très occupée par son travail, Julie en avait marre. Alors, elle et son mari décidèrent qu'ils avaient besoin de vacances. Ils allèrent en Espagne car ils adoraient le pays")
>>>
>>> doc._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
1: travail(8), en(11)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
3: Espagne(32), pays(37)
>>>
>>> doc[17]._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
>>>
>>> doc._.coref_chains.resolve(doc[34])
[Julie, mari]
>>>
<a id="getting-started-de"></a>
1.2.3 German
Presuming you have already installed spaCy and one of the German spacy models, install Coreferee from the command line by typing:
python3 -m pip install coreferee
python3 -m coreferee install de
Note that the required command may be python rather than python3 on some operating systems.
Then open a Python prompt (type python3 or python at the command line):
>>> import coreferee, spacy
>>> nlp = spacy.load('de_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000026E84C63B50>
>>>
>>> doc = nlp("Weil er mit seiner Arbeit sehr beschäftigt war, hatte Peter davon genug. Er und seine Frau haben entschieden, dass ihnen ein Urlaub gut tun würde. Sie sind nach Spanien gefahren, weil ihnen das Land sehr gefiel.")
>>>
>>> doc._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
1: Arbeit(4), davon(11)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
3: Spanien(32), Land(38)
>>>
>>> doc[14]._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
>>>
>>> doc._.coref_chains.resolve(doc[36])
[Peter, Frau]
>>>
<a id="getting-started-pl"></a>
1.2.4 Polish
Presuming you have already installed spaCy and one of the Polish spacy models, install Coreferee from the command line by typing:
python3 -m pip install coreferee
python3 -m coreferee install pl
Note that the required command may be python rather than python3 on some operating systems.
Then open a Python prompt (type python3 or python at the command line):
>>> import coreferee, spacy
>>> nlp = spacy.load('pl_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000027304C63B50>
>>>
>>> doc = nlp("Ponieważ bardzo zajęty był swoją pracą, Janek miał jej dość. Postanowili z jego żoną, że potrzebują wakacji. Pojechali do Hiszpanii, bo bardzo im się ten kraj podobał.")
>>>
>>> doc._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
1: pracą(5), jej(9)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
3: Hiszpanii(23), kraj(30)
>>>
>>> doc[12]._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
>>>
>>> doc._.coref_chains.resolve(doc[27])
[Janek, żoną]
>>>
<a id="background-information"></a>
1.3 Background information
Handling coreference resolution successfully requires training corpora that have been manually annotated with coreferences. The state of the art in coreference resolution is progressing rapidly, but is largely focussed on techniques that require training corpora that are larger than what is available for most languages and software developers. The CONLL 2012 training corpus, which is most widely used, has the following restrictions:
-
CONLL 2012 covers English, Chinese and Arabic; there is nothing of comparable size for most other languages. For example, the corpus we used to train Coreferee for German is around a tenth of the size of CONLL 2012;
-
CONLL 2012 is not publicly available and has a relatively restrictive license.
Earlier versions of spaCy had an extension, Neuralcoref, that was excellent but that was never made publicly available for any language other than English. The aim of Coreferee, on the other hand, is to get coreference resolution working for a variety of languages: our focus is less on necessarily achieving the best possible precision and recall for English than on enabling the functionality to be reproduced for new languages as easily and as quickly as possible. Because training data is in such short supply for most languages and is very effort-intensive to produce, it is important to use what is available as effectively as possible.
There are three essential strategies that human readers employ to recognise coreferences within a text:
-
Hard grammatical rules that completely preclude entities within a text from coreferring, e.g. The house stood tall. They went on walking. Such rules play an especially important role in languages that have grammatical gender, which includes most continental European languages.
-
Pragmatic tendencies, e.g. a word that begins a sentence and that is a grammatical subject is more likely than a word that is in the middle of a sentence and that forms part of a prepositional phrase to be referred back to by a
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
