SkillAgentSearch skills...

Coreferee

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Install / Use

/learn @msg-systems/Coreferee
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Coreferee

Author: <a href="mailto:richard@explosion.ai">Richard Paul Hudson, Explosion AI</a>

<a id="introduction"></a>

1. Introduction

<a id="the-basic-idea"></a>

1.1 The basic idea

Coreferences are situations where two or more words within a text refer to the same entity, e.g. John went home because he was tired. Resolving coreferences is an important general task within the natural language processing field.

Coreferee is a Python 3 library (tested with versions 3.6—3.10) that is used together with spaCy (tested with versions 3.0.0—3.3.0) to resolve coreferences within English, French, German and Polish texts. It is designed so that it is easy to add support for new languages. It uses a mixture of neural networks and programmed rules.

The library was originally developed at msg systems, but is now being maintained at Explosion AI. Please direct any new issues or discussions to the Explosion repository.

<a id="getting-started"></a>

1.2 Getting started

<a id="getting-started-en"></a>

1.2.1 English

Presuming you have already installed spaCy and one of the English spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install en

Note that:

  • the required command may be python rather than python3 on some operating systems;
  • in order to use the transformer-based spaCy model en_core_web_trf with Coreferee, you will need to install the spaCy model en_core_web_lg as well (see the explanation here).

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('en_core_web_trf')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000002DE8E9256D0>
>>>
>>> doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")
>>>
>>> doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
>>>
>>> doc[16]._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
2: [He(16); wife(19)], they(21), They(26), they(31)
>>>
>>> doc._.coref_chains.resolve(doc[31])
[Peter, wife]
>>>

<a id="getting-started-fr"></a>

1.2.2 French

Presuming you have already installed spaCy and one of the French spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install fr

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('fr_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x000001F556B4FF10>
>>>
>>> doc = nlp("Même si elle était très occupée par son travail, Julie en avait marre. Alors, elle et son mari décidèrent qu'ils avaient besoin de vacances. Ils allèrent en Espagne car ils adoraient le pays")
>>>
>>> doc._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
1: travail(8), en(11)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
3: Espagne(32), pays(37)
>>>
>>> doc[17]._.coref_chains.print()
0: elle(2), son(7), Julie(10), elle(17), son(19)
2: [elle(17); mari(20)], ils(23), Ils(29), ils(34)
>>>
>>> doc._.coref_chains.resolve(doc[34])
[Julie, mari]
>>>

<a id="getting-started-de"></a>

1.2.3 German

Presuming you have already installed spaCy and one of the German spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install de

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('de_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000026E84C63B50>
>>>
>>> doc = nlp("Weil er mit seiner Arbeit sehr beschäftigt war, hatte Peter davon genug. Er und seine Frau haben entschieden, dass ihnen ein Urlaub gut tun würde. Sie sind nach Spanien gefahren, weil ihnen das Land sehr gefiel.")
>>>
>>> doc._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
1: Arbeit(4), davon(11)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
3: Spanien(32), Land(38)
>>>
>>> doc[14]._.coref_chains.print()
0: er(1), seiner(3), Peter(10), Er(14), seine(16)
2: [Er(14); Frau(17)], ihnen(22), Sie(29), ihnen(36)
>>>
>>> doc._.coref_chains.resolve(doc[36])
[Peter, Frau]
>>>

<a id="getting-started-pl"></a>

1.2.4 Polish

Presuming you have already installed spaCy and one of the Polish spacy models, install Coreferee from the command line by typing:

python3 -m pip install coreferee
python3 -m coreferee install pl

Note that the required command may be python rather than python3 on some operating systems.

Then open a Python prompt (type python3 or python at the command line):

>>> import coreferee, spacy
>>> nlp = spacy.load('pl_core_news_lg')
>>> nlp.add_pipe('coreferee')
<coreferee.manager.CorefereeBroker object at 0x0000027304C63B50>
>>>
>>> doc = nlp("Ponieważ bardzo zajęty był swoją pracą, Janek miał jej dość. Postanowili z jego żoną, że potrzebują wakacji. Pojechali do Hiszpanii, bo bardzo im się ten kraj podobał.")
>>>
>>> doc._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
1: pracą(5), jej(9)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
3: Hiszpanii(23), kraj(30)
>>>
>>> doc[12]._.coref_chains.print()
0: był(3), swoją(4), Janek(7), Postanowili(12), jego(14)
2: [Postanowili(12); żoną(15)], potrzebują(18), Pojechali(21), im(27)
>>>
>>> doc._.coref_chains.resolve(doc[27])
[Janek, żoną]
>>>

<a id="background-information"></a>

1.3 Background information

Handling coreference resolution successfully requires training corpora that have been manually annotated with coreferences. The state of the art in coreference resolution is progressing rapidly, but is largely focussed on techniques that require training corpora that are larger than what is available for most languages and software developers. The CONLL 2012 training corpus, which is most widely used, has the following restrictions:

  • CONLL 2012 covers English, Chinese and Arabic; there is nothing of comparable size for most other languages. For example, the corpus we used to train Coreferee for German is around a tenth of the size of CONLL 2012;

  • CONLL 2012 is not publicly available and has a relatively restrictive license.

Earlier versions of spaCy had an extension, Neuralcoref, that was excellent but that was never made publicly available for any language other than English. The aim of Coreferee, on the other hand, is to get coreference resolution working for a variety of languages: our focus is less on necessarily achieving the best possible precision and recall for English than on enabling the functionality to be reproduced for new languages as easily and as quickly as possible. Because training data is in such short supply for most languages and is very effort-intensive to produce, it is important to use what is available as effectively as possible.

There are three essential strategies that human readers employ to recognise coreferences within a text:

  1. Hard grammatical rules that completely preclude entities within a text from coreferring, e.g. The house stood tall. They went on walking. Such rules play an especially important role in languages that have grammatical gender, which includes most continental European languages.

  2. Pragmatic tendencies, e.g. a word that begins a sentence and that is a grammatical subject is more likely than a word that is in the middle of a sentence and that forms part of a prepositional phrase to be referred back to by a

Related Skills

View on GitHub
GitHub Stars199
CategoryDevelopment
Updated1mo ago
Forks20

Languages

Python

Security Score

95/100

Audited on Feb 23, 2026

No findings